0
SFUNatLangLab
Natural Language Processing
Angel Xuan Changangelxuanchang.github.io/nlp-class
adapted from lecture slides from Anoop Sarkar
Simon Fraser University
2020-01-23
1
Natural Language Processing
Angel Xuan Changangelxuanchang.github.io/nlp-class
adapted from lecture slides from Anoop Sarkar
Simon Fraser University
January 23, 2020
Part 1: Classification tasks in NLP
2
Classification tasks in NLP
Naive Bayes Classifier
Log linear models
3
Sentiment classification: Movie reviews
I neg unbelievably disappointing
I pos Full of zany characters and richly applied satire, and somegreat plot twists
I pos this is the greatest screwball comedy ever filmed
I neg It was pathetic. The worst part about it was the boxingscenes.
4
Intent Detection
I ADDR CHANGE I just moved and want to change myaddress.
I ADDR CHANGE Please help me update my address.
I FILE CLAIM I just got into a terrible accident and I want tofile a claim.
I CLOSE ACCOUNT I’m moving and I want to disconnect myservice.
5
Prepositional Phrases
I noun attach: I bought the shirt with pockets
I verb attach: I bought the shirt with my credit card
I noun attach: I washed the shirt with mud
I verb attach: I washed the shirt with soap
I Attachment depends on the meaning of the entire sentence –needs world knowledge, etc.
I Maybe there is a simpler solution: we can attempt to solve itusing heuristics or associations between words
6
Ambiguity Resolution: Prepositional Phrases in English
I Learning Prepositional Phrase Attachment: Annotated Datav n1 p n2 Attachment
join board as director Vis chairman of N.V. N
using crocidolite in filters Vbring attention to problem V
is asbestos in products Nmaking paper for filters N
including three with cancer N...
......
......
7
Prepositional Phrase Attachment
Method Accuracy
Always noun attachment 59.0Most likely for each preposition 72.2Average Human (4 head words only) 88.2Average Human (whole sentence) 93.2
8
Back-off Smoothing
I Random variable a represents attachment.
I a = n1 or a = v (two-class classification)
I We want to compute probability of noun attachment:p(a = n1 | v , n1, p, n2).
I Probability of verb attachment is 1− p(a = n1 | v , n1, p, n2).
9
Back-off Smoothing1. If f (v , n1, p, n2) > 0 and p̂ 6= 0.5
p̂(an1 | v , n1, p, n2) =f (an1 , v , n1, p, n2)
f (v , n1, p, n2)
2. Else if f (v , n1, p) + f (v , p, n2) + f (n1, p, n2) > 0and p̂ 6= 0.5
p̂(an1 | v , n1, p, n2) =f (an1 , v , n1, p) + f (an1 , v , p, n2) + f (an1 , n1, p, n2)
f (v , n1, p) + f (v , p, n2) + f (n1, p, n2)
3. Else if f (v , p) + f (n1, p) + f (p, n2) > 0
p̂(an1 | v , n1, p, n2) =f (an1 , v , p) + f (an1 , n1, p) + f (an1 , p, n2)
f (v , p) + f (n1, p) + f (p, n2)
4. Else if f (p) > 0 (try choosing attachment based onpreposition alone)
p̂(an1 | v , n1, p, n2) =f (an1 , p)
f (p)
5. Else p̂(an1 | v , n1, p, n2) = 1.0
10
Prepositional Phrase Attachment: Results
I Results (Collins and Brooks 1995): 84.5% accuracywith the use of some limited word classes for dates, numbers,etc.
I Toutanova, Manning, and Ng, 2004:use sophisticated smoothing model for PP attachment86.18% with words & stems; with word classes: 87.54%
I Merlo, Crocker and Berthouzoz, 1997:test on multiple PPs, generalize disambiguation of 1 PP to2-3 PPs1PP: 84.3% 2PP: 69.6% 3PP: 43.6%
11
Natural Language Processing
Angel Xuan Changangelxuanchang.github.io/nlp-class
adapted from lecture slides fromAnoop Sarkar, Danqi Chen and Karthik Narasimhan
Simon Fraser University
January 23, 2020
Part 2: Probabilistic Classifiers
12
Classification Task
I Input:I A document dI a set of classes C = {c1, c2, . . . , cm}
I Output: Predicted class c for document dI Example:
I neg unbelievably disappointingI pos this is the greatest screwball comedy ever filmed
13
Supervised learning: Let’s use statistics!
I Inputs:I Set of m classes C = {c1, c2, . . . , cm}I Set of n labeled documents: {(d1, c1), (d2, c2), . . . , (dn, cn)}
I Output: Trained classifier F : d → cI What form should F take?I How to learn F?
14
Types of supervised classifiers
15
Classification tasks in NLP
Naive Bayes Classifier
Log linear models
16
Naive Bayes Classifier
I x is the input that can be represented as d independentfeatures fj , 1 ≤ j ≤ d
I y is the output classification
I P(y | x) = P(y)·P(x|y)P(x) (Bayes Rule)
I P(x | y) =∏d
j=1 P(fj | y)
I P(y | x) ∝ P(y) ·∏d
j=1 P(fj | y)
I We can ignore P(x) in the above equation because it is aconstant scaling factor for each y .
17
Naive Bayes Classifier for text classification
I For text classificaiton: input x = documentd = (w1, . . . ,wk),
I Use as our features the words wj , 1 ≤ j ≤ |V | where V is ourvocabulary
I c is the output classification
I Assume that position of each word is irrelevant and that thewords are conditionally independent given class c
P(w1,w2, . . . ,wk |c) = P(w1|c)P(w2|c) . . .P(wk |c)
I Maximum a posteriori estimate
cMAP = arg maxc
P(c)P(d |c) = arg maxc
P̂(c)k∏
i=1
P̂(wi |c)
18
Bag of words
19
Estimating probabilities
Maximum likelihood estimate
P̂(cj) =Count(cj)
n
P̂(wi |cj) =Count(wi , cj)∑
w∈V [Count(w , cj)]
Smoothing
P̂(wi |c) =Count(wi , c) + α∑
w∈V [Count(w , cj) + α]
20
Overall process
Input: Set of labeled documents: {(di , ci )}ni=1
I Compute vocabulary V of all words
I Calculate
P̂(cj) =Count(cj)
n
I Calculate
P̂(wi |cj) =Count(wi , cj) + α∑
w∈V [Count(w , cj) + α]
I Prediction: Given document d = (w1, . . . ,wk)
cMAP = arg maxc
P̂(c)k∏
i=1
P̂(wi |c)
21
Naive Bayes Example
22
Tokenization
Tokenization matters - it can affect your vocabulary
I aren’t aren’tarent
are n’taren t
I Emails, URLs, phone numbers, dates, emoticons
23
Features
I Remember: Naive Bayes can use any set of features
I Captitalization, subword features (end with -ing), etc
I Domain knowledge crucial for performance
Top features for spam detection
[Alqatawna et al, IJCNSS 2015]
24
Evaluation
I Table of prediction (binary classification)
I Ideally we want to get
25
Evaluation Metrics
Accuracy =TP + TN
Total=
200
250= 80%
26
Evaluation Metrics
Accuracy =TP + TN
Total=
200
250= 80%
27
Precision and Recall
28
Precision and Recall
from Wikipedia
29
F-Score
30
Choosing Beta
31
Aggregating scores
I We have Precision, Recall, F1 for each classI How to combine them for an overall score?
I Macro-average: Compute for each class, then averageI Micro-average: Collect predictions for all classes and jointly
evaluate
32
Macro vs Micro average
I Macroaveraged precision: (0.5 + 0.9)/2 = 0.7
I Microaveraged precision: 100/120 = .83
I Microaveraged score is dominated by score on common classes
33
Validation
I Choose a metric: Precision/Recall/F1
I Optimize for metric on Validation (aka Development) set
I Finally evaluate on ‘unseen’ Test setI Cross-validation
I Repeatedly sample several train-val splitsI Reduces bias due to sampling errors
34
Advantanges of Naive Bayes
I Very fast, low storage requirements
I Robust to irrelevant features
I Very good in domains with many equally important features
I Optimal if the independence assumptions hold
I Good dependable baseline for text classification
35
When to use Naive Bayes
I Small data sizes: Naive Bayes is great! .Rule-based classifiers can work well too
I Medium size datasets: More advanced classifiers mightperform better (SVM, logistic regression)
I Large datasets: Naive Bayes becomes competive again (mostlearned classifiers will work well)
36
Failings of Naive Bayes (1)
Independence assumptions are too strong
I XOR problem: Naive Bayes cannot learn a decision boundary
I Both variables are jointly required to predict class.Independence assumption broken!
37
Failings of Naive Bayes (2)
Class Imbalance
I One or more classes have more instances than others
I Data skew causes NB to prefer one class over the other
38
Failings of Naive Bayes (3)
Weight magnitude errors
I Classes with larger weights are preferred
I 10 documents with class=MA and “Boston” occurring onceeach
I 10 documents with class=CA and “San Francisco” occurringonce each
I New document d : “Boston Boston Boston San Francisco SanFrancisco”
P(class = CA|d) > P(class = MA|d)
39
Naive Bayes Summary
I Domain knowledge is crucial to selecting good features
I Handle class imbalance by re-weighting classes
I Use log scale operations instead of multiplying probabilities
P(cNB) = arg maxcj∈C
logP(cj) +∑i
logP(xi |cj)
I Model is now just max of sum of weights
40
Classification tasks in NLP
Naive Bayes Classifier
Log linear models
41
Log linear model
I The model classifies input into output labels y ∈ YI Let there be m features, fk(x, y) for k = 1, . . . ,m
I Define a parameter vector v ∈ Rm
I Each (x, y) pair is mapped to score:
s(x, y) =∑k
vk · fk(x, y)
I Using inner product notation:
v · f(x, y) =∑k
vk · fk(x, y)
s(x, y) = v · f(x, y)
I To get a probability from the score: Renormalize!
Pr(y | x; v) =exp (s(x, y))∑
y ′∈Y exp (s(x, y ′))
42
Log linear model
I The name ‘log-linear model’ comes from:
log Pr(y | x; v) = v · f(x, y)︸ ︷︷ ︸linear term
− log∑y ′
exp(v · f(x, y ′)
)︸ ︷︷ ︸
normalization term
I Once the weights v are learned, we can perform predictionsusing these features.
I The goal: to find v that maximizes the log likelihood L(v) ofthe labeled training set containing (xi , yi ) for i = 1 . . . n
L(v) =∑i
log Pr(yi | xi ; v)
=∑i
v · f(xi , yi )−∑i
log∑y ′
exp(v · f(xi , y ′)
)
43
Log linear modelI Maximize:
L(v) =∑i
v · f(xi , yi )−∑i
log∑y ′
exp(v · f(xi , y ′)
)I Calculate gradient:
dL(v)
dv
∣∣∣∣v
=∑i
f(xi , yi )−∑i
1∑y ′′ exp (v · f(xi , y ′′))∑
y ′
f(xi , y′) · exp
(v · f(xi , y ′)
)=
∑i
f(xi , yi )−∑i
∑y ′
f(xi , y′)
exp (v · f(xi , y ′))∑y ′′ exp (v · f(xi , y ′′))
=∑i
f(xi , yi )︸ ︷︷ ︸Observed counts
−∑i
∑y ′
f(xi , y′) Pr(y ′ | xi ; v)︸ ︷︷ ︸
Expected counts
44
Gradient ascent
I Init: v(0) = 0
I t ← 0I Iterate until convergence:
I Calculate: ∆ = dL(v)dv
∣∣∣v=v(t)
I Find β∗ = arg maxβ L(v(t) + β∆)I Set v(t+1) ← v(t) + β∗∆
45
Learning the weights: v: Generalized Iterative Scaling
f # = maxx ,y∑
j fj(x , y)
(the maximum possible feature value; needed for scaling)
Initialize v(0)
For each iteration texpected[j] ← 0 for j = 1 .. # of featuresFor i = 1 to | training data |
For each feature fjexpected[j] += fj(xi , yi ) · P(yi | xi ; v(t))
For each feature fj(x , y)
observed[j] = fj(x , y) · c(x ,y)|training data|
For each feature fj(x , y)
v(t+1)j ← v
(t)j · f#
√observed[j]expected[j]
cf. Goodman, NIPS ’01
46
Acknowledgements
Many slides borrowed or inspired from lecture notes by AnoopSarkar, Danqi Chen, Karthik Narasimhan, Dan Jurafsky, MichaelCollins, Chris Dyer, Kevin Knight, Chris Manning, Philipp Koehn,Adam Lopez, Graham Neubig, Richard Socher and LukeZettlemoyer from their NLP course materials.
All mistakes are my own.
A big thank you to all the students who read through these notesand helped me improve them.