Natural Language Processing€¦ · Classi cation tasks in NLP Naive Bayes Classi er Log linear...

0

SFUNatLangLab

Natural Language Processing

Angel Xuan Changangelxuanchang.github.io/nlp-class

adapted from lecture slides from Anoop Sarkar

Simon Fraser University

2020-01-23

http://angelxuanchang.github.io/nlp-class

1



adapted from lecture slides from Anoop Sarkar


January 23, 2020

Part 1: Classification tasks in NLP


2

Classification tasks in NLP

Naive Bayes Classifier

Log linear models

3

Sentiment classification: Movie reviews

I neg unbelievably disappointing

I pos Full of zany characters and richly applied satire, and somegreat plot twists

I pos this is the greatest screwball comedy ever filmed

I neg It was pathetic. The worst part about it was the boxingscenes.

4

Intent Detection

I ADDR CHANGE I just moved and want to change myaddress.

I ADDR CHANGE Please help me update my address.

I FILE CLAIM I just got into a terrible accident and I want tofile a claim.

I CLOSE ACCOUNT I’m moving and I want to disconnect myservice.

5

Prepositional Phrases

I noun attach: I bought the shirt with pockets

I verb attach: I bought the shirt with my credit card

I noun attach: I washed the shirt with mud

I verb attach: I washed the shirt with soap

I Attachment depends on the meaning of the entire sentence –needs world knowledge, etc.

I Maybe there is a simpler solution: we can attempt to solve itusing heuristics or associations between words

6

Ambiguity Resolution: Prepositional Phrases in English

I Learning Prepositional Phrase Attachment: Annotated Datav n1 p n2 Attachment

join board as director Vis chairman of N.V. N

using crocidolite in filters Vbring attention to problem V

is asbestos in products Nmaking paper for filters N

including three with cancer N...

......

......

7

Prepositional Phrase Attachment

Method Accuracy

Always noun attachment 59.0Most likely for each preposition 72.2Average Human (4 head words only) 88.2Average Human (whole sentence) 93.2

8

Back-off Smoothing

I Random variable a represents attachment.

I a = n1 or a = v (two-class classification)

I We want to compute probability of noun attachment:p(a = n1 | v , n1, p, n2).

I Probability of verb attachment is 1− p(a = n1 | v , n1, p, n2).

9

Back-off Smoothing1. If f (v , n1, p, n2) > 0 and p̂ 6= 0.5

p̂(an1 | v , n1, p, n2) =f (an1 , v , n1, p, n2)

f (v , n1, p, n2)

2. Else if f (v , n1, p) + f (v , p, n2) + f (n1, p, n2) > 0and p̂ 6= 0.5

p̂(an1 | v , n1, p, n2) =f (an1 , v , n1, p) + f (an1 , v , p, n2) + f (an1 , n1, p, n2)

f (v , n1, p) + f (v , p, n2) + f (n1, p, n2)

3. Else if f (v , p) + f (n1, p) + f (p, n2) > 0

p̂(an1 | v , n1, p, n2) =f (an1 , v , p) + f (an1 , n1, p) + f (an1 , p, n2)

f (v , p) + f (n1, p) + f (p, n2)

4. Else if f (p) > 0 (try choosing attachment based onpreposition alone)

p̂(an1 | v , n1, p, n2) =f (an1 , p)

f (p)

5. Else p̂(an1 | v , n1, p, n2) = 1.0

10

Prepositional Phrase Attachment: Results

I Results (Collins and Brooks 1995): 84.5% accuracywith the use of some limited word classes for dates, numbers,etc.

I Toutanova, Manning, and Ng, 2004:use sophisticated smoothing model for PP attachment86.18% with words & stems; with word classes: 87.54%

I Merlo, Crocker and Berthouzoz, 1997:test on multiple PPs, generalize disambiguation of 1 PP to2-3 PPs1PP: 84.3% 2PP: 69.6% 3PP: 43.6%

11



adapted from lecture slides fromAnoop Sarkar, Danqi Chen and Karthik Narasimhan


January 23, 2020

Part 2: Probabilistic Classifiers


12

Classification Task

I Input:I A document dI a set of classes C = {c1, c2, . . . , cm}

I Output: Predicted class c for document dI Example:

I neg unbelievably disappointingI pos this is the greatest screwball comedy ever filmed

13

Supervised learning: Let’s use statistics!

I Inputs:I Set of m classes C = {c1, c2, . . . , cm}I Set of n labeled documents: {(d1, c1), (d2, c2), . . . , (dn, cn)}

I Output: Trained classifier F : d → cI What form should F take?I How to learn F?

14

Types of supervised classifiers

15



Log linear models

16


I x is the input that can be represented as d independentfeatures fj , 1 ≤ j ≤ d

I y is the output classification

I P(y | x) = P(y)·P(x|y)P(x) (Bayes Rule)

I P(x | y) =∏d

j=1 P(fj | y)

I P(y | x) ∝ P(y) ·∏d

j=1 P(fj | y)

I We can ignore P(x) in the above equation because it is aconstant scaling factor for each y .

17

Naive Bayes Classifier for text classification

I For text classificaiton: input x = documentd = (w1, . . . ,wk),

I Use as our features the words wj , 1 ≤ j ≤ |V | where V is ourvocabulary

I c is the output classification

I Assume that position of each word is irrelevant and that thewords are conditionally independent given class c

P(w1,w2, . . . ,wk |c) = P(w1|c)P(w2|c) . . .P(wk |c)

I Maximum a posteriori estimate

cMAP = arg maxc

P(c)P(d |c) = arg maxc

P̂(c)k∏

i=1

P̂(wi |c)

18

Bag of words

19

Estimating probabilities

Maximum likelihood estimate

P̂(cj) =Count(cj)

n

P̂(wi |cj) =Count(wi , cj)∑

w∈V [Count(w , cj)]

Smoothing

P̂(wi |c) =Count(wi , c) + α∑

w∈V [Count(w , cj) + α]

20

Overall process

Input: Set of labeled documents: {(di , ci )}ni=1

I Compute vocabulary V of all words

I Calculate

P̂(cj) =Count(cj)

n

I Calculate

P̂(wi |cj) =Count(wi , cj) + α∑

w∈V [Count(w , cj) + α]

I Prediction: Given document d = (w1, . . . ,wk)

cMAP = arg maxc

P̂(c)k∏

i=1

P̂(wi |c)

21

Naive Bayes Example

22

Tokenization

Tokenization matters - it can affect your vocabulary

I aren’t aren’tarent

are n’taren t

I Emails, URLs, phone numbers, dates, emoticons

23

Features

I Remember: Naive Bayes can use any set of features

I Captitalization, subword features (end with -ing), etc

I Domain knowledge crucial for performance

Top features for spam detection

[Alqatawna et al, IJCNSS 2015]

24

Evaluation

I Table of prediction (binary classification)

I Ideally we want to get

25

Evaluation Metrics

Accuracy =TP + TN

Total=

200

250= 80%

26

Evaluation Metrics

Accuracy =TP + TN

Total=

200

250= 80%

27

Precision and Recall

28

Precision and Recall

from Wikipedia

29

F-Score

30

Choosing Beta

31

Aggregating scores

I We have Precision, Recall, F1 for each classI How to combine them for an overall score?

I Macro-average: Compute for each class, then averageI Micro-average: Collect predictions for all classes and jointly

evaluate

32

Macro vs Micro average

I Macroaveraged precision: (0.5 + 0.9)/2 = 0.7

I Microaveraged precision: 100/120 = .83

I Microaveraged score is dominated by score on common classes

33

Validation

I Choose a metric: Precision/Recall/F1

I Optimize for metric on Validation (aka Development) set

I Finally evaluate on ‘unseen’ Test setI Cross-validation

I Repeatedly sample several train-val splitsI Reduces bias due to sampling errors

34

Advantanges of Naive Bayes

I Very fast, low storage requirements

I Robust to irrelevant features

I Very good in domains with many equally important features

I Optimal if the independence assumptions hold

I Good dependable baseline for text classification

35

When to use Naive Bayes

I Small data sizes: Naive Bayes is great! .Rule-based classifiers can work well too

I Medium size datasets: More advanced classifiers mightperform better (SVM, logistic regression)

I Large datasets: Naive Bayes becomes competive again (mostlearned classifiers will work well)

36

Failings of Naive Bayes (1)

Independence assumptions are too strong

I XOR problem: Naive Bayes cannot learn a decision boundary

I Both variables are jointly required to predict class.Independence assumption broken!

37


Class Imbalance

I One or more classes have more instances than others

I Data skew causes NB to prefer one class over the other

38


Weight magnitude errors

I Classes with larger weights are preferred

I 10 documents with class=MA and “Boston” occurring onceeach

I 10 documents with class=CA and “San Francisco” occurringonce each

I New document d : “Boston Boston Boston San Francisco SanFrancisco”

P(class = CA|d) > P(class = MA|d)

39

Naive Bayes Summary

I Domain knowledge is crucial to selecting good features

I Handle class imbalance by re-weighting classes

I Use log scale operations instead of multiplying probabilities

P(cNB) = arg maxcj∈C

logP(cj) +∑i

logP(xi |cj)

I Model is now just max of sum of weights

40



Log linear models

41

Log linear model

I The model classifies input into output labels y ∈ YI Let there be m features, fk(x, y) for k = 1, . . . ,m

I Define a parameter vector v ∈ Rm

I Each (x, y) pair is mapped to score:

s(x, y) =∑k

vk · fk(x, y)

I Using inner product notation:

v · f(x, y) =∑k

vk · fk(x, y)

s(x, y) = v · f(x, y)

I To get a probability from the score: Renormalize!

Pr(y | x; v) =exp (s(x, y))∑

y ′∈Y exp (s(x, y ′))

42

Log linear model

I The name ‘log-linear model’ comes from:

log Pr(y | x; v) = v · f(x, y)︸︷︷︸linear term

− log∑y ′

exp(v · f(x, y ′)

)︸︷︷︸

normalization term

I Once the weights v are learned, we can perform predictionsusing these features.

I The goal: to find v that maximizes the log likelihood L(v) ofthe labeled training set containing (xi , yi ) for i = 1 . . . n

L(v) =∑i

log Pr(yi | xi ; v)

=∑i

v · f(xi , yi )−∑i

log∑y ′

exp(v · f(xi , y ′)

)

43

Log linear modelI Maximize:

L(v) =∑i

v · f(xi , yi )−∑i

log∑y ′

exp(v · f(xi , y ′)

)I Calculate gradient:

dL(v)

dv

∣∣∣∣v

=∑i

f(xi , yi )−∑i

1∑y ′′ exp (v · f(xi , y ′′))∑

y ′

f(xi , y′) · exp

(v · f(xi , y ′)

)=

∑i

f(xi , yi )−∑i

∑y ′

f(xi , y′)

exp (v · f(xi , y ′))∑y ′′ exp (v · f(xi , y ′′))

=∑i

f(xi , yi )︸︷︷︸Observed counts

−∑i

∑y ′

f(xi , y′) Pr(y ′ | xi ; v)︸︷︷︸

Expected counts

44

Gradient ascent

I Init: v(0) = 0

I t ← 0I Iterate until convergence:

I Calculate: ∆ = dL(v)dv

∣∣∣v=v(t)

I Find β∗ = arg maxβ L(v(t) + β∆)I Set v(t+1) ← v(t) + β∗∆

45

Learning the weights: v: Generalized Iterative Scaling

f # = maxx ,y∑

j fj(x , y)

(the maximum possible feature value; needed for scaling)

Initialize v(0)

For each iteration texpected[j] ← 0 for j = 1 .. # of featuresFor i = 1 to | training data |

For each feature fjexpected[j] += fj(xi , yi ) · P(yi | xi ; v(t))

For each feature fj(x , y)

observed[j] = fj(x , y) · c(x ,y)|training data|

For each feature fj(x , y)

v(t+1)j ← v

(t)j · f#

√observed[j]expected[j]

cf. Goodman, NIPS ’01

46

Acknowledgements

Many slides borrowed or inspired from lecture notes by AnoopSarkar, Danqi Chen, Karthik Narasimhan, Dan Jurafsky, MichaelCollins, Chris Dyer, Kevin Knight, Chris Manning, Philipp Koehn,Adam Lopez, Graham Neubig, Richard Socher and LukeZettlemoyer from their NLP course materials.

All mistakes are my own.

A big thank you to all the students who read through these notesand helped me improve them.

Date post:	19-Jul-2020
Category:	Documents
Upload:	others
View:	8 times
Download:	0 times

Natural Language Processing€¦ · Classi cation tasks in NLP Naive Bayes Classi er Log linear...

Documents