8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 1/96
Statistical Methods in Natural Language Processing
Michael CollinsAT&T Labs-Research
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 2/96
OverviewSome NLP problems:
̄ Information extraction(Named entities, Relationships between entities, etc.)
̄
Finding linguistic structurePart-of-speech tagging, “Chunking”, Parsing
Techniques:
̄ Log-linear (maximum-entropy) taggers
̄ Probabilistic context-free grammars (PCFGs)PCFGs with enriched non-terminals
̄ Discriminative methods:Conditional MRFs, Perceptron algorithms, Kernel methods
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 3/96
Some NLP Problems
̄ Information extraction
– Named entities
– Relationships between entities
– More complex relationships
̄Finding linguistic structure
– Part-of-speech tagging
– “Chunking” (low-level syntactic structure)
– Parsing
̄ Machine translation
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 4/96
Common Themes
̄ Need to learn mapping from one discrete structure to another
– Strings to hidden state sequencesNamed-entity extraction, part-of-speech tagging
– Strings to stringsMachine translation
– Strings to underlying trees
Parsing
– Strings to relational data structuresInformation extraction
̄Speech recognition is similar (and shares many techniques)
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 5/96
Two Fundamental Problems
TAGGING: Strings to Tagged Sequences
a b e e a f h j µ a /C b /D e /C e /C a /D f /C h /D j /C
PARSING: Strings to Trees
d e f g µ (A (B (D d) (E e)) (C (F f ) (G g)))
d e f g µ A
B
D
d
E
e
C
F
f
G
g
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 6/96
Information Extraction: Named Entities
INPUT:Profits soared at Boeing Co., easily topping forecasts on WallStreet, as their CEO Alan Mulally announced first quarter results.
OUTPUT:
Profits soared at
Company Boeing Co. ℄
, easily topping forecastson Location Wall Street ℄ , as their CEO Person Alan Mulally ℄
announced first quarter results.
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 7/96
Information Extraction: Relationships between Entities
INPUT:Boeing is located in Seattle. Alan Mulally is the CEO.
OUTPUT:
Relationship = Company-LocationCompany = BoeingLocation = Seattle
Relationship = Employer-EmployeeEmployer = Boeing Co.Employee = Alan Mulally
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 8/96
Information Extraction: More Complex Relationships
INPUT:Alan Mulally resigned as Boeing CEO yesterday. He will besucceeded by Jane Swift, who was previously the president at RollsRoyce.
OUTPUT:
Relationship = Management Succession
Company = Boeing Co.
Role = CEO
Out = Alan Mulally
In = Jane Swift
Relationship = Management Succession
Company = Rolls RoyceRole = president
Out = Jane Swift
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 9/96
Part-of-Speech Tagging
INPUT:Profits soared at Boeing Co., easily topping forecasts on WallStreet, as their CEO Alan Mulally announced first quarter results.
OUTPUT:
Profits /N soared /V at /P Boeing /N Co. /N , /, easily /ADV topping /Vforecasts /N on /P Wall /N Street /N , /, as /P their /POSS CEO /N
Alan /N Mulally /N announced /V first /ADJ quarter /N results /N . /.
N = Noun
V = Verb
P = PrepositionAdv = Adverb
Adj = Adjective
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 10/96
“Chunking” (Low-level syntactic structure)
INPUT:Profits soared at Boeing Co., easily topping forecasts on WallStreet, as their CEO Alan Mulally announced first quarter results.
OUTPUT:
NP Profits ℄ soared at NP Boeing Co. ℄ , easily topping NPforecasts ℄ on NP Wall Street ℄ , as NP their CEO Alan Mulally ℄
announced
NP first quarter results ℄
.
NP
℄ = non-recursive noun phrase
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 11/96
Chunking as Tagging
INPUT:Profits soared at Boeing Co., easily topping forecasts on WallStreet, as their CEO Alan Mulally announced first quarter results.
OUTPUT:
Profits /S soared /N at /N Boeing /S Co. /C , /N easily /N topping /Nforecasts /S on /N Wall /S Street /C , /N as /N their /S CEO /C Alan /C
Mulally /C announced /N first /S quarter /C results /C . /N
N = Not part of noun-phrase
S = Start noun-phrase
C = Continue noun-phrase
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 12/96
Named Entity Extraction as Tagging
INPUT:Profits soared at Boeing Co., easily topping forecasts on WallStreet, as their CEO Alan Mulally announced first quarter results.
OUTPUT:
Profits /NA soared /NA at /NA Boeing /SC Co. /CC , /NA easily /NAtopping /NA forecasts /NA on /NA Wall /SL Street /CL , /NA as /NAtheir /NA CEO /NA Alan /SP Mulally /CP announced /NA first /NA
quarter /NA results /NA . /NA
NA = No entity
SC = Start Company
CC = Continue CompanySL = Start Location
CL = Continue Location
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 13/96
Parsing (Syntactic Structure)
INPUT: Boeing is located in Seattle.
OUTPUT:S
NP
N
Boeing
VP
V
is
VP
V
located
PP
P
in
NP
N
Seattle
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 14/96
Machine Translation
INPUT:Boeing is located in Seattle. Alan Mulally is the CEO.
OUTPUT:
Boeing ist in Seattle. Alan Mulally ist der CEO.
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 15/96
Summary
Problem Well-Studied Class of Problem
LearningApproaches?
Named entity extraction Yes TaggingRelationships between entities A little Parsing
More complex relationships No ??Part-of-speech tagging Yes TaggingChunking Yes TaggingSyntactic Structure Yes Parsing
Machine translation Yes ??
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 16/96
Techniques Covered in this Tutorial
̄ Log-linear (maximum-entropy) taggers
̄ Probabilistic context-free grammars (PCFGs)
̄ PCFGs with enriched non-terminals
̄ Discriminative methods:
– Conditional Markov Random Fields– Perceptron algorithms
– Kernels over NLP structures
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 17/96
Log-Linear Taggers: Notation
̄ Set of possible words = Î , possible tags = Ì
̄
Word sequence Û
½ Ò ℄
Û
½
Û
¾
Û
Ò
℄
̄ Tag sequence Ø
½ Ò ℄
Ø
½
Ø
¾
Ø
Ò
℄
̄ Training data is Ò tagged sentences,
where the ’th sentence is of length Ò
́ Û
½ Ò
℄
Ø
½ Ò
℄
µ
for
½ Ò
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 18/96
Log-Linear Taggers: Independence Assumptions
̄ The basic idea
È ́ Ø
½ Ò ℄
Û
½ Ò ℄
µ
É
Ò
½
È ́ Ø
Ø
½
Ø
½
Û
½ Ò ℄
µ Chain rule
É
Ò
½
È ́ Ø
Ø
½
Ø
¾
Û
½ Ò ℄
µ Independence
assumptions
̄ Two questions:
1. How to parameterize È ́ Ø
Ø
½
Ø
¾
Û
½ Ò ℄
µ ?
2. How to find Ö Ñ Ü
Ø
½ Ò ℄
È ́ Ø
½ Ò ℄
Û
½ Ò ℄
µ ?
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 19/96
The Parameterization Problem
Hispaniola /NNP quickly /RB became /VB an /DT
important /JJ base /?? from which Spain expanded
its empire into the rest of the Western Hemisphere .
̄ There are many possible tags in the position ??
̄ Need to learn a function from (context, tag) pairs to a probability È ́
Ø
Ó Ò Ø Ü Ø µ
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 20/96
Representation: Histories
̄ A history is a 4-tuple Ø
½
Ø
¾
Û
½ Ò ℄
̄ Ø
½
Ø
¾
are the previous two tags.
̄ Û
½ Ò ℄
are the Ò words in the input sentence.
̄ is the index of the word being tagged
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 21/96
Representation: Histories
Hispaniola /NNP quickly /RB became /VB an /DT important /JJbase /?? from which Spain expanded its empire into the rest of theWestern Hemisphere .
̄ History Ø
½
Ø
¾
Û
½ Ò ℄
̄ Ø
½
Ø
¾
DT, JJ
̄ Û
½ Ò ℄
À × Ô Ò Ó Ð Õ Ù Ð Ý Ñ
̄
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 22/96
Feature–Vector Representations
̄ Take a history/tag pair ́ Ø
µ .
̄
×
́ Ø
µ for × ½
are features representing
tagging decision Ø
in context
.
½ ¼ ¼ ¼
́ Ø µ
½ if current word Û
is base
and Ø = VB
¼ otherwise
½ ¼ ¼ ½
́ Ø µ
½ if Ø
¾
Ø
½
Ø DT, JJ, VB
¼ otherwise
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 23/96
Representation: Histories
̄ A history is a 4-tuple Ø
½
Ø
¾
Û
½ Ò ℄
̄ Ø
½
Ø
¾
are the previous two tags.
̄ Û
½ Ò ℄ are the Ò words in the input sentence.
̄ is the index of the word being tagged
Hispaniola /NNP quickly /RB became /VB an /DT important /JJbase /?? from which Spain expanded its empire into the rest of theWestern Hemisphere .
̄ Ø
½
Ø
¾
DT, JJ
̄ Û
½ Ò ℄
À × Ô Ò Ó Ð Õ Ù Ð Ý Ñ À Ñ × Ô Ö
̄
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 24/96
Feature–Vector Representations
̄ Take a history/tag pair ́ Ø
µ .
̄
×
́ Ø
µ for × ½ are features representing
tagging decision Ø in context .
Example: POS Tagging [Ratnaparkhi 96]
̄ Word/tag features
½ ¼ ¼
́ Ø µ
́
½
if current word Û
is base and Ø
= VB ¼ otherwise
½ ¼ ½
́ Ø µ
́
½ if current word Û
ends in ing and Ø = VBG
¼ otherwise
̄ Contextual Features
½ ¼ ¿
́ Ø µ
́
½ if Ø
¾
Ø
½
Ø DT, JJ, VB
¼ otherwise
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 25/96
Part-of-Speech (POS) Tagging [Ratnaparkhi 96]
̄ Word/tag features
½ ¼ ¼
́ Ø µ
́
½ if current word Û
is base and Ø = VB
¼
otherwise
̄ Spelling features
½ ¼ ½
́ Ø µ
́
½ if current word Û
ends in ing and Ø = VBG
¼ otherwise
½ ¼ ¾
́ Ø µ
́
½ if current word Û
starts with pre and Ø = NN
¼ otherwise
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 26/96
Ratnaparkhi’s POS Tagger
̄
Contextual Features
½ ¼ ¿
́ Ø µ
́
½ if Ø
¾
Ø
½
Ø DT, JJ, VB
¼ otherwise
½ ¼
́ Ø µ
́
½ if Ø
½
Ø JJ, VB
¼ otherwise
½ ¼
́ Ø µ
́
½ if Ø VB
¼otherwise
½ ¼
́ Ø µ
́
½if previous word
Û
½
= the and Ø
VB ¼ otherwise
½ ¼
́ Ø µ
́
½ if next word Û
· ½
= the and Ø VB ¼ otherwise
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 27/96
Log-Linear (Maximum-Entropy) Models
̄ Take a history/tag pair ́ Ø
µ .
̄
×
́ Ø
µ for × ½
are features
̄ Ï
×
for × ½ are parameters
̄ Parameters define a conditional distribution
È ́ Ø µ
È
×
Ï
×
×
́ Ø
µ
́
Ï µ
where ́
Ï
µ
Ø
¼
¾ Ì
È
×
Ï
×
×
́ Ø
¼
µ
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 28/96
Log-Linear (Maximum Entropy) Models
̄ Word sequence Û
½ Ò ℄
Û
½
Û
¾
Û
Ò
℄
̄ Tag sequence Ø
½ Ò ℄
Ø
½
Ø
¾
Ø
Ò
℄
̄ Histories
Ø
½
Ø
¾
Û
½ Ò ℄
Ð Ó È ́ Ø
½ Ò ℄
Û
½ Ò ℄
µ
Ò
½
Ð Ó È ́ Ø
µ
Ò
½
×
Ï
×
×
́
Ø
µ
ß Þ
Linear Score
Ò
½
Ð Ó ́
Ï µ
ß Þ
Local Normalization
Terms
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 29/96
Log-Linear Models
̄
Word sequence Û
½ Ò ℄
Û
½
Û
¾
Û
Ò
℄
̄ Tag sequence Ø
½ Ò ℄
Ø
½
Ø
¾
Ø
Ò
℄
Ð Ó È ́ Ø
½ Ò ℄
Û
½ Ò ℄
µ
Ò
½
Ð Ó È ́ Ø
µ
Ò
½
×
Ï
×
×
́
Ø
µ
Ò
½
Ð Ó ́
Ï µ
where
Ø
¾
Ø
½
Û
½ Ò ℄
i
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 30/96
Log-Linear Models
̄
Parameter estimation:Maximize likelihood of training data through gradient descent,iterative scaling
̄
Search for Ö Ñ Ü
Ø
½ Ò ℄
È ́ Ø
½ Ò ℄
Û
½ Ò ℄
µ
:Dynamic programming, Ç ́ Ò
Ì
¿
µ complexity
̄ Experimental results:
– Almost 97% accuracy for POS tagging [Ratnaparkhi 96]
– Over 90% accuracy for named-entity extraction[Borthwick et. al 98]
– Around 93% precision/recall for NP chunking
– Better results than an HMM for FAQ segmentation[McCallum et al. 2000]
T h i C d i hi T i l
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 31/96
Techniques Covered in this Tutorial
̄ Log-linear (maximum-entropy) taggers
̄ Probabilistic context-free grammars (PCFGs)
̄ PCFGs with enriched non-terminals
̄ Discriminative methods:
– Conditional Markov Random Fields– Perceptron algorithms
– Kernels over NLP structures
D t f P i E i t
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 32/96
Data for Parsing Experiments
̄ Penn WSJ Treebank = 50,000 sentences with associated trees
̄ Usual set-up: 40,000 training sentences, 2400 test sentences
An example tree:
Canadian
NNP
Utilities
NNPS
NP
had
VBD
1988
CD
revenue
NN
NP
of
IN
C$
$
1.16
CD
billion
CD
,
PUNC,
QP
NP
PP
NP
mainly
RB
ADVP
from
IN
its
PRP$
natural
JJ
gas
NN
and
CC
electric
JJ
utility
NN
businesses
NNS
NP
in
IN
Alberta
NNP
,
PUNC,
NP
where
WRB
WHADVP
the
DT
company
NN
NP
serves
VBZ
about
RB
800,000
CD
QP
customers
NNS
.
PUNC.
NP
VP
S
SBAR
NP
PP
NP
PP
VP
S
TOP
Canadian Utilities had 1988 revenue of C$ 1.16 billion , mainly from its
natural gas and electric utility businesses in Alberta , where the company
serves about 800,000 customers .
Th I f ti C d b P T
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 33/96
The Information Conveyed by Parse Trees
1) Part of speech for each word
(N = noun, V = verb, D = determiner)
S
NP
D
the
N
burglar
VP
V
robbed
NP
D
the
N
apartment
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 34/96
2) Phrases S
NP
DT
the
N
burglar
VP
V
robbed
NP
DT
the
N
apartment
Noun Phrases (NP): “the burglar”, “the apartment”
Verb Phrases (VP): “robbed the apartment”
Sentences (S): “the burglar robbed the apartment”
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 35/96
3) Useful Relationships
S
NP
subject
VP
V
verb
S
NP
DT
the
N
burglar
VP
V
robbed
NP
DT
the
N
apartment
µ “the burglar” is the subject of “robbed”
An Example Application: Machine Translation
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 36/96
An Example Application: Machine Translation
̄ English word order is subject – verb – object
̄ Japanese word order is subject – object – verb
English: IBM bought LotusJapanese: IBM Lotus bought
English: Sources said that IBM bought Lotus yesterdayJapanese: Sources yesterday IBM Lotus bought that said
Context Free Grammars
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 37/96
Context-Free Grammars
[Hopcroft and Ullman 1979]A context free grammar
´ ¦
Ê Ë µ where:
̄ is a set of non-terminal symbols
̄ ¦ is a set of terminal symbols
̄ Ê is a set of rules of the form
½
¾
Ò
for Ò ¼ , ¾ ,
¾ ́ ¦ µ
̄ Ë ¾ is a distinguished start symbol
A Context-Free Grammar for English
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 38/96
A Context-Free Grammar for English
=
S, NP, VP, PP, D, Vi, Vt, N, P
Ë = S ¦ = sleeps, saw, man, woman, telescope, the, with, in
Ê
S µ
NP VPVP µ ViVP µ Vt NPVP µ VP PP
NP µ
D NNP µ NP PP
PP µ P NP
Vi µ sleeps
Vt µ sawN
µman
N µ womanN µ telescope
D µ the
P µ withP µ in
Note: S=sentence, VP=verb phrase, NP=noun phrase, PP=prepositional phrase,
D=determiner, Vi=intransitive verb, Vt=transitive verb, N=noun, P=preposition
Left-Most Derivations
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 39/96
A left-most derivation is a sequence of strings ×
½
×
Ò
, where
̄ ×
½
Ë , the start symbol
̄ ×
Ò
¾ ¦
£ , i.e. ×
Ò
is made up of terminal symbols only
̄ Each ×
for ¾ Ò
is derived from ×
½
by picking the left-
most non-terminal in ×
½
and replacing it by some ¬ where ¬ is a rule in Ê
For example: [S], [NP VP], [D N VP], [the N VP], [the man VP],
[the man Vi], [the man sleeps]Representation of a derivation as a tree:
S
NP
D
the
N
man
VP
Vi
sleeps
Notation
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 40/96
Notation
̄ We use to denote the set of all left-most derivations (trees)allowed by a grammar
̄ We use ́ Ü µ for a string Ü ¾ ¦
£ to denote the set of all
derivations whose final string (“yield”) is Ü .
The Problem with Parsing: Ambiguity
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 41/96
The Problem with Parsing: Ambiguity
INPUT:She announced a program to promote safety in trucks and vans
·
POSSIBLE OUTPUTS:And there are more...
An Example Tree
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 42/96
An Example Tree
Canadian Utilities had 1988 revenue of C$ 1.16 billion ,mainly from its natural gas and electric utility businessesin Alberta , where the company serves about 800,000customers .
Canadian
NNP
Utilities
NNPS
NP
had
VBD
1988
CD
revenue
NN
NP
of
IN
C$
$
1.16
CD
billion
CD
,
PUNC,
QP
NP
PP
NP
mainly
RB
ADVP
from
IN
its
PRP$
natural
JJ
gas
NN
and
CC
electric
JJ
utility
NN
businesses
NNS
NP
in
IN
Alberta
NNP
,
PUNC,
NP
where
WRB
WHADVP
the
DT
company
NN
NP
serves
VBZ
about
RB
800,000
CD
QP
customers
NNS
.
PUNC.
NP
VP
S
SBAR
NP
PP
NP
PP
VP
S
TOP
A Probabilistic Context-Free Grammar
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 43/96
bab C G a a
S µ NP VP 1.0
VP µ Vi 0.4VP µ Vt NP 0.4
VP µ
VP PP 0.2NP µ D N 0.3NP µ NP PP 0.7
PP µ P NP 1.0
Vi µ sleeps 1.0Vt
µsaw 1.0
N µ man 0.7N µ woman 0.2N µ telescope 0.1
D µ the 1.0
P µ with 0.5
P µ
in 0.5
̄ Probability of a tree with rules «
¬
isÉ
È ́ «
¬
«
µ
̄ Maximum Likelihood estimation
È ́ VP µ V NP VP µ
Ó Ù Ò Ø ́ VP µ V NP µ
Ó Ù Ò Ø ́ VP µ
PCFGs
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 44/96
PCFGs
[Booth and Thompson 73] showed that a CFG with ruleprobabilities correctly defines a distribution over the set of derivations provided that:
1. The rule probabilities define conditional distributions over thedifferent ways of rewriting each non-terminal.
2. A technical condition on the rule probabilities ensuring that
the probability of the derivation terminating in a finite numberof steps is 1. (This condition is not really a practical concern.)
TOP
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 45/96
S
NP
N
IBM
VP
V
bought
NP
N
Lotus
PROB = È ́ TOP S µ
¢ È ́ S NP VP µ ¢ È ́ N Á Å
µ
¢ È ́ VP V NP µ ¢ È ́ V Ó Ù Ø
µ
¢ È ́
NP
N µ ¢ È ́
N
Ä Ó Ø Ù × µ
¢ È ́ NP N µ
The SPATTER Parser: (Magerman 95;Jelinek et al 94)
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 46/96
The SPATTER Parser: (Magerman 95;Jelinek et al 94)
̄
For each rule, identify the “head” child
S µ NP VP
VP µ V NP
NP µ
DT N
̄ Add word to each non-terminal
S(questioned)
NP(lawyer)
DT
the
N
lawyer
VP(questioned)
V
questioned
NP(witness)
DT
the
N
witness
A Lexicalized PCFG
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 47/96
S(questioned) µ NP(lawyer) VP(questioned) ??
VP(questioned) µ
V(questioned) NP(witness) ??
NP(lawyer) µ D(the) N(lawyer) ??
NP(witness) µ
D(the) N(witness) ??
̄ The big question: how to estimate rule probabilities??
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 48/96
CHARNIAK (1997)
S(questioned)
· È ́ NP VP S(questioned) µ
S(questioned)
NP VP(questioned)
· È ́ lawyer S,VP,NP, questioned) µ
S(questioned)
NP(lawyer) VP(questioned)
Smoothed Estimation
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 49/96
È ́
NP VP
S(questioned) µ
½
¢
Ó Ù Ò Ø ́ S(questioned) NP VP µ
Ó Ù Ò Ø ́ S(questioned) µ
·
¾
¢
Ó Ù Ò Ø ́ S NP VP µ
Ó Ù Ò Ø ́ S µ
̄ Where ¼
½
¾
½ , and
½
·
¾
½
Smoothed Estimation
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 50/96
È ́
lawyer
S,NP,VP,questioned µ
½
¢
Ó Ù Ò Ø ́ lawyer S,NP,VP,questioned µ
Ó Ù Ò Ø ́ S,NP,VP,questioned µ
·
¾
¢
Ó Ù Ò Ø ́ lawyer S,NP,VP µ
Ó Ù Ò Ø ́ S,NP,VP µ
·
¿
¢
Ó Ù Ò Ø ́ lawyer NP µ
Ó Ù Ò Ø ́ NP µ
̄ Where ¼
½
¾
¿
½ , and
½
·
¾
·
¿
½
(l ) ( i d) S( i d)
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 51/96
È ́ NP(lawyer) VP(questioned) S(questioned) µ
́
½
¢
Ó Ù Ò Ø ́
S(questioned)
NP VP µ
Ó Ù Ò Ø ́
S(questioned) µ
·
¾
¢
Ó Ù Ò Ø ́
S
NP VP µ
Ó Ù Ò Ø ́ S µ
µ
¢ ́
½
¢
Ó Ù Ò Ø ́
lawyer
S,NP,VP,questioned µ
Ó Ù Ò Ø ́
S,NP,VP,questioned µ
·
¾
¢
Ó Ù Ò Ø ́
lawyer
S,NP,VP µ
Ó Ù Ò Ø ́
S,NP,VP µ
·
¿
¢
Ó Ù Ò Ø ́
lawyer
NP µ
Ó Ù Ò Ø ́ NP µ
µ
Lexicalized Probabilistic Context-Free Grammars
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 52/96
̄ Transformation to lexicalized rulesS NP VP
vs. S(questioned) NP(lawyer) VP(questioned)
̄ Smoothed estimation techniques “blend” different counts
̄ Search for most probable tree through dynamic programming
̄ Perform vastly better than PCFGs (88% vs. 73% accuracy)
Independence AssumptionsPCFG
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 53/96
̄PCFGs
S
NP
DT
the
N
lawyer
VP
V
questioned
NP
DT
the
N
witness
̄ Lexicalized PCFGsS(questioned)
NP(lawyer)
DT
the
N
lawyer
VP(questioned)
V
questioned
NP(witness)
DT
the
N
witness
Results
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 54/96
Method Accuracy
PCFGs (Charniak 97) 73.0%
Conditional Models – Decision Trees (Magerman 95) 84.2%
Lexical Dependencies (Collins 96) 85.5%
Conditional Models – Logistic (Ratnaparkhi 97) 86.9%
Generative Lexicalized Model (Charniak 97) 86.7%Generative Lexicalized Model (Collins 97) 88.2%
Logistic-inspired Model (Charniak 99) 89.6%
Boosting (Collins 2000) 89.8%
̄ Accuracy = average recall/precision
Parsing for Information Extraction:Relationships between Entities
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 55/96
Relationships between Entities
INPUT:Boeing is located in Seattle.
OUTPUT:
Relationship = Company-LocationCompany = BoeingLocation = Seattle
A Generative Model (Miller et. al)
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 56/96
[Miller et. al 2000] use non-terminals to carry lexical items and
semantic tagsSis
CL
NPBoeingCOMPANY
Boeing
VPisCLLOC
V
is
VPlocatedCLLOC
V
located
PPinCLLOC
P
in
NPSeattleLOCATION
Seattle
PPin lexical headCLLOC semantic tag
A Generative Model [Miller et. al 2000]
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 57/96
We’re now left with an even more complicated estimation problem,
P(SisCL µ NP
BoeingCOMPANY VPis
CLLOC)
See [Miller et. al 2000] for the details
̄ Parsing algorithm recovers annotated trees µ
Simultaneously recovers syntactic structure and namedentity relationships
̄ Accuracy (precision/recall) is greater than 80% in recoveringrelations
Techniques Covered in this Tutorial
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 58/96
̄
Log-linear (maximum-entropy) taggers
̄ Probabilistic context-free grammars (PCFGs)
̄
PCFGs with enriched non-terminals
̄ Discriminative methods:
– Conditional Markov Random Fields– Perceptron algorithms
– Kernels over NLP structures
Linear Models for Parsing and Tagging
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 59/96
̄ Three components:
Æis a function from a string to a set of candidates
̈ maps a candidate to a feature vector
Ï is a parameter vector
Component 1: Æ
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 60/96
̄ Æ enumerates a set of candidates for a sentence
She announced a program to promote safety in trucks and vans
· Æ
S
NP
She
VP
announced NP
NP
a program
VP
t o p ro mo te NP
safety PP
in NP
t r uc k s a n d v a ns
S
NP
She
VP
announced NP
NP
NP
a p r og r am
VP
t o p ro mo te NP
safety PP
in NP
trucks
and NP
vans
S
NP
She
VP
announced NP
NP
a program
VP
to promote NP
NP
safety PP
in NP
trucks
and NP
vans
S
NP
She
VP
announced NP
NP
a p r og r am
VP
t o p ro mo te NP
safety
PP
in NP
t r uc k s a n d v a ns
S
NP
She
VP
announced NP
NP
NP
a program
VP
t o p r om o te NP
safety
PP
in NP
trucks
and NP
vans
S
NP
She
VP
announced NP
NP
NP
a program
VP
t o p r om o te NP
safety
PP
in NP
t r uc k s a n d v a ns
Examples of Æ
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 61/96
̄ A context-free grammar
̄ A finite-state machine
̄ Top most probable analyses from a probabilistic grammar
Component 2: ̈
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 62/96
̄ ̈ maps a candidate to a feature vector ¾ Ê
̄ ̈ defines the representation of a candidate
S
NP
She
VP
announced NP
NP
a program
VP
to VP
promote NP
safety PP
in NP
NP
trucks
and NP
vans
· ̈
½ ¼ ¾ ¼ ¼ ½
Features
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 63/96
̄ A “feature” is a function on a structure, e.g.,
́ Ü µ
Number of times A
B C
is seen in Ü
Ì
½
A
B
D
d
E
e
C
F
f
G
g
Ì
¾
A
B
D
d
E
e
C
F
h
A
B
b
C
c
́ Ì
½
µ ½ ́ Ì
¾
µ ¾
Feature Vectors
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 64/96
̄
A set of functions
½
define a feature vector
̈ ́ Ü µ
½
́ Ü µ
¾
́ Ü µ
́ Ü µ
Ì
½
A
B
D
d
E
e
C
F
f
G
g
Ì
¾
A
B
D
d
E
e
C
F
h
A
B
b
C
c
̈ ́ Ì
½
µ ½ ¼ ¼ ¿ ̈ ́ Ì
¾
µ ¾ ¼ ½ ½
Component 3: Ï
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 65/96
̄ Ï is a parameter vector ¾ Ê
̄ ̈ and Ï together map a candidate to a real-valued score
S
NP
She
VP
announced NP
NP
a program
VP
to VP
promote NP
safety PP
in NP
NP
trucks
and NP
vans
· ̈
½ ¼ ¾ ¼ ¼ ½
· ̈ ¡ Ï
½ ¼ ¾ ¼ ¼ ½
¡
½ ¼ ¿ ¼ ¾ ½ ¿ ¼ ½ ¼ ¾ ¿
Putting it all Together
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 66/96
̄ is set of sentences, is set of possible outputs (e.g. trees)
̄Need to learn a function
̄ Æ
, ̈ , Ï define
́ Ü µ Ö Ñ Ü
Ý ¾ Æ
́ Ü µ
̈ ́ Ý µ ¡ Ï
Choose the highest scoring tree as the most plausible structure
̄ Given examples ́ Ü
Ý
µ , how to set Ï ?
She announced a program to promote safety in trucks and vans
· Æ
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 67/96
S
NP
She
VP
announced NP
NP
a program
VP
t o p ro mo te NP
safety PP
in NP
t r uc k s a n d v a ns
S
NP
She
VP
announced NP
NP
NP
a p r og r am
VP
t o p ro mo te NP
safety PP
in NP
trucks
and NP
vans
S
NP
She
VP
announced NP
NP
a program
VP
to promote NP
NP
safety PP
in NP
trucks
and NP
vans
S
NP
She
VP
announced NP
NP
a p r og r am
VP
t o p ro mo te NP
safety
PP
in NP
t r uc k s a n d v a ns
S
NP
She
VP
announced NP
NP
NP
a program
VP
t o p r om o te NP
safety
PP
in NP
trucks
and NP
vans
S
NP
She
VP
announced NP
NP
NP
a program
VP
t o p r om o te NP
safety
PP
in NP
t r uc k s a n d v a ns
· ̈ · ̈ · ̈ · ̈ · ̈ · ̈
½ ½ ¿ ¾ ¼ ¼ ½ ¼ ½ ¼ ¼ ¿ ¼ ¼ ½ ¼ ¼ ¼ ½
·
̈ ¡
Ï ·
̈ ¡
Ï ·
̈ ¡
Ï ·
̈ ¡
Ï ·
̈ ¡
Ï ·
̈ ¡
Ï
13.6 12.2 12.1 3.3 9.4 11.1 ·
Ö Ñ Ü
S
NP
She
VP
announced NP
NP
a p ro gr am
VP
to VP
promote NP
safety PP
in NP
NP
trucks
and NP
vans
Markov Random Fields
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 68/96
̄ Parameters Ï define a conditional distribution over candidates:
È ́ Ý
Ü
Ï µ
̈ ́ Ý
µ ¡ Ï
È
Ý ¾ Æ
́ Ü
µ
̈ ́ Ý µ ¡ Ï
̄ Gaussian prior: Ð Ó
È ́ Ï µ
Ï
¾
¾
̄
MAP parameter estimates maximise
Ð Ó
̈ ́ Ý
µ ¡ Ï
È
Ý ¾ Æ
́ Ü
µ
̈ ́ Ý µ ¡ Ï
Ï
¾
¾
Note: This is a “globally normalised” model
Markov Random Fields Example 1: [Johnson et. al 1999]
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 69/96
Æis the set of parses for a sentence with a hand-crafted
grammar (a Lexical Functional Grammar)
̈ can include arbitrary features of the candidate parses
Ï
is estimated using conjugate gradient descent
Markov Random Fields Example 2: [Lafferty et al. 2001]
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 70/96
Going back to tagging:
̄ Inputs Ü are sentences Û
½ Ò ℄
̄ Æ
́ Û
½ Ò ℄
µ Ì
Ò i.e. all tag sequences of length Ò
̄ Global representations ̈ are composed from local feature
vectors
̈ ́ Û
½ Ò ℄
Ø
½ Ò ℄
µ
Ò
½
́
Ø
µ
where
Ø
¾
Ø
½
Û
½ Ò ℄
Markov Random Fields Example 2: [Lafferty et al. 2001]
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 71/96
̄ Typically, local features are indicator functions, e.g.,
½ ¼ ½
́ Ø µ
́
½ if current word Û
ends in ing and Ø = VBG
¼ otherwise
̄ and global features are then counts,
̈
½ ¼ ½
́ Û
½ Ò ℄
Ø
½ Ò ℄
µ Number of times a word ending in ing is
tagged as VBG in ́ Û
½ Ò ℄
Ø
½ Ò ℄
µ
Markov Random Fields Example 2: [Lafferty et al. 2001]
Conditional random fields are globally normalised models:
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 72/96
Ð Ó È ́ Ø
½ Ò ℄
Û
½ Ò ℄
µ ̈ ́ Û
½ Ò ℄
Ø
½ Ò ℄
µ ¡ Ï Ð Ó
́ Û
½ Ò ℄
Ï µ
Ò
½
×
Ï
×
×
́
Ø
µ
ß Þ
Linear model
Ð Ó
́ Û
½ Ò ℄
Ï µ
ß Þ
Normalization
where ́ Û
½ Ò ℄
Ï µ
È
Ø
½ Ò ℄
¾ Ì
Ò
̈ ́ Û
½ Ò ℄
Ø
½ Ò ℄
µ ¡
Ï
Log-linear taggers (see earlier part of the tutorial) are locally normalised models:
Ð Ó È ́ Ø
½ Ò ℄
Û
½ Ò ℄
µ
Ò
½
×
Ï
×
×
́
Ø
µ
ß Þ
Linear Model
Ò
½
Ð Ó ́
Ï µ
ß Þ
Local Normalization
Problems with Locally Normalized Models
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 73/96
̄
“Label bias” problem [Lafferty et al. 2001]See also [Klein and Manning 2002]
̄ Example of a conditional distribution that locally normalized
models can’t capture (under bigram tag representation):
a b c µ
A — B — C
a b cwith È ́ A B C a b c
µ ½
a b e µ
A — D — E
a b ewith È ́ A D E a b e
µ ½
̄ Impossible to find parameters that satisfy
È ́ µ ¢ È ́
µ ¢ È ́ µ ½
È ́ µ ¢ È ́
µ ¢ È ́ µ ½
Markov Random Fields Example 2: [Lafferty et al. 2001]Parameter Estimation
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 74/96
̄ Need to calculate gradient of the log-likelihood,
Ï
È
Ð Ó È ́ Ø
½ Ò
℄
Û
½ Ò
℄
Ï µ
=
Ï
È
̈ ́ Û
½ Ò
℄
Ø
½ Ò
℄
µ ¡ Ï
È
Ð Ó ́ Û
½ Ò
℄
Ï µ
=È
̈ ́ Û
½ Ò
℄
Ø
½ Ò
℄
µ
È
È
Ù
½ Ò
℄
¾ Ì
Ò
È ́ Ù
½ Ò
℄
Û
½ Ò
℄
Ï µ ̈ ́ Û
½ Ò
℄
Ù
½ Ò
℄
µ
Last term looks difficult to compute. But because ̈
is definedthrough “local” features, it can be calculated efficiently usingdynamic programming. (Very similar problem to that solved bythe EM algorithm for HMMs.) See [Lafferty et al. 2001].
Techniques Covered in this Tutorial
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 75/96
̄
Log-linear (maximum-entropy) taggers
̄ Probabilistic context-free grammars (PCFGs)
̄
PCFGs with enriched non-terminals
̄ Discriminative methods:
– Conditional Markov Random Fields
– Perceptron algorithms
– Kernels over NLP structures
A Variant of the Perceptron Algorithm
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 76/96
Inputs: Training set ́ Ü
Ý
µ for ½ Ò
Initialization: Ï ¼
Define: ́ Ü µ Ö Ñ Ü
Ý ¾ Æ ́ Ü µ
̈ ́ Ý µ ¡ Ï
Algorithm: For Ø ½ Ì
, ½ Ò
Þ
́ Ü
µ
If ́ Þ
Ý
µ Ï Ï · ̈ ́ Ý
µ ̈ ́ Þ
µ
Output: Parameters Ï
Theory Underlying the Algorithm
̄ Definition: Æ ́ Ü
µ Æ ́ Ü
µ
Ý
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 77/96
̄ Definition: The training set is separable with margin Æ,
if there is a vector Í ¾ Ê
with
Í ½
such that
Þ ¾ Æ
́ Ü
µ Í ¡ ̈ ́ Ý
µ Í ¡ ̈ ́ Þ µ Æ
Theorem: For any training sequence ́ Ü
Ý
µwhich is separable
with margin Æ , then for the perceptron algorithm
Number of mistakes
Ê
¾
Æ
¾
where Ê is a constant such that
Þ ¾ Æ
́ Ü
µ
̈ ́ Ý
µ ̈ ́ Þ µ
Ê
Proof: Direct modification of the proof for the classification case.See [Collins 2002]
More Theory for the Perceptron Algorithm
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 78/96
̄
Question 1: what if the data is not separable?[Freund and Schapire 99] give a modified theorem for this case
̄ Question 2: performance on training data is all very well,but what about performance on new test examples?
Assume some distribution È ́ Ü Ý
µ underlying examples
Theorem [Helmbold and Warmuth 95]: For any distribution È ́
Ü Ý µ generating examples, if = expected number of mistakes
of an online algorithm on a sequence of Ñ · ½
examples, then arandomized algorithm trained on Ñ samples will have probability
Ñ · ½
of making an error on a newly drawn example from È .
[Freund and Schapire 99] use this to define the Voted Perceptron
Perceptron Algorithm 1: Tagging
̄ Score for a ́ Û
½ Ò ℄
Ø
½ Ò ℄
µ pair is
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 79/96
́ Û
½ Ò ℄
Ø
½ Ò ℄
µ
×
Ï
×
×
́
Ø
µ
×
Ï
×
̈
×
́ Ø
½ Ò ℄
Û
½ Ò ℄
µ
̄
Note: no normalization terms ̄ Note: ́ Û
½ Ò ℄
Ø
½ Ò ℄
µ is not a log probability
̄ Viterbi algorithm for
Ö Ñ Ü
Ø
½ Ò ℄
¾ Ì
Ò
́ Û
½ Ò ℄
Ø
½ Ò ℄
µ
Training the Parameters
Inputs: Training set ́ Û
½ Ò
℄
Ø
½ Ò
℄
µ for ½ Ò
.
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 80/96
Initialization: Ï
¼
Algorithm: For Ø
½ Ì ½ Ò
Þ
½ Ò
℄
Ö Ñ Ü
Ù
½ Ò
℄
¾ Ì
Ò
×
Ï
×
̈
×
́ Û
½ Ò
℄
Ù
½ Ò
℄
µ
Þ
½ Ò
℄
is output on ’th sentence with current parameters
If Þ
½ Ò
℄
Ø
½ Ò
℄ then
Ï
×
Ï
×
· ̈
×
́ Û
½ Ò
℄
Ø
½ Ò
℄
µ
ß Þ
Correct tags’feature value
̈
×
́ Û
½ Ò
℄
Þ
½ Ò
℄
µ
ß Þ
Incorrect tags’feature value
Output: Parameter vector Ï
.
An Example
Say the correct tags for ’th sentence are
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 81/96
y g
the /DT man /NN bit /VBD the /DT dog /NN
Under current parameters, output is
the /DT man /NN bit /NN the /DT dog /NN
Assume also that features track: (1) all bigrams; (2) word/tag pairs
Parameters incremented:
NN, VBD VBD, DT VBD bit
Parameters decremented:
NN, NN NN, DT NN bit
Experiments
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 82/96
̄
Wall Street Journal part-of-speech tagging data
Perceptron = 2.89%, Max-ent = 3.28%
(11.9% relative error reduction)
̄
[Ramshaw and Marcus 95] NP chunking data
Perceptron = 93.63%, Max-ent = 93.29%
(5.1% relative error reduction)
See [Collins 2002]
Perceptron Algorithm 2: Reranking Approaches
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 83/96
̄ Æ
is the top Ò most probable candidates from a base model
– Parsing: a lexicalized probabilistic context-free grammar
– Tagging: “maximum entropy” tagger– Speech recognition: existing recogniser
Parsing Experiments
ÆBeam search used to parse training and test sentences:
around 27 parses for each sentence
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 84/96
̈ Ä ́ Ü µ
½
́ Ü µ
Ñ
́ Ü µ , where Ä ́ Ü µ
log-likelihood fromfirst-pass parser,
½
Ñ
are ¼ ¼
¼ ¼ ¼
indicator functions
½
́ Ü µ
́
½
if Ü
contains Ë
È Î È
¼ otherwise
S
NP
She
VP
announced NP
NP
a program
VP
to VP
promote NP
safety PP
in NP
NP
trucks
and NP
vans
· ̈
½
¼ ¼ ½ ½ ¼ ½ ¼ ¼ ½ ¼ ¼ ½ ½ ¼ ¼ ½ ½ ¼ ¼ ¼ ¼
½ ¼ ¼
Named Entities
ÆTop 20 segmentations from a “maximum-entropy” tagger
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 85/96
̈ Ä ́ Ü µ
½
́ Ü µ
Ñ
́ Ü µ ,
½
́ Ü µ
́
½ if Ü contains a boundary = “[The
¼ otherwise
Whether you’re an aging flower child or a clueless
[Gen-Xer], “[The Day They Shot John Lennon],” playing at the
[Dougherty Arts Center], entertains the imagination.
· ̈
¿
½ ½ ¼ ¼ ¼ ½ ½ ¼ ½ ½ ¼ ¼ ½ ¼ ¼ ½ ¼ ½ ¼ ¼ ¼ ¼
¼ ½ ½
Whether you’re an aging flower child or a clueless
[Gen-Xer], “[The Day They Shot John Lennon],” playing at the
[Dougherty Arts Center], entertains the imagination.
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 86/96
· ̈
¿
½ ½ ¼ ¼ ¼ ½ ½ ¼ ½ ½ ¼ ¼ ½ ¼ ¼ ½ ¼ ½ ¼ ¼ ¼ ¼
¼ ½ ½
Whether you’re an aging flower child or a cluelessGen-Xer, “The Day [They Shot John Lennon],” playing at the
[Dougherty Arts Center], entertains the imagination.
· ̈
¿
½ ½ ½ ½ ¼ ¼ ½ ¼ ¼ ½ ¼ ½ ½ ½ ¼ ½ ¼ ½ ¼ ¼ ¼ ¼
¼ ½ ¼
Whether you’re an aging flower child or a clueless
[Gen-Xer], “The Day [They Shot John Lennon],” playing at the
[Dougherty Arts Center], entertains the imagination.
· ̈
¾
¼ ¼ ½ ¼ ¼ ½ ¼ ¼ ½ ¼ ¼ ¼ ¼ ¼ ½ ¼ ½ ¼ ¼ ¼ ¼
¼ ½ ¼
Experiments
P i W ll S J l T b k
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 87/96
Parsing Wall Street Journal Treebank
Training set = 40,000 sentences, test 2,416 sentencesState-of-the-art parser: 88.2% F-measureReranked model: 89.5% F-measure (11% relative error reduction)Boosting: 89.7% F-measure (13% relative error reduction)
Recovering Named-Entities in Web Data
Training data 53,609 sentences (1,047,491 words),test data 14,717 sentences (291,898 words)State-of-the-art tagger: 85.3% F-measureReranked model: 87.9% F-measure (17.7% relative error reduction)
Boosting: 87.6% F-measure (15.6% relative error reduction)
Perceptron Algorithm 3: Kernel Methods(Work with Nigel Duffy)
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 88/96
̄ It’s simple to derive a “dual form” of the perceptron algorithm
If we can compute ̈ ́ Ü µ ¡ ̈ ́ Ý µ
efficientlywe can learn efficiently with the representation ̈
“All Subtrees” Representation [Bod 98]
̄ Given: Non-Terminal symbols
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 89/96
Terminal symbols
̄ An infinite set of subtrees
A
B C
A
B
b
E
A
B
b
C
A B
A
B A
B
b
C
̄ Step 1:
Choose an (arbitrary) mapping from subtrees to integers
́ Ü µ
Number of times subtree is seen in Ü
̈ ́ Ü µ
½
́ Ü µ
¾
́ Ü µ
¿
́ Ü µ
All Subtrees Representation
i h
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 90/96
̄ ̈ is now huge
̄ But inner product ̈ ́ Ì
½
µ ¡ ̈ ́ Ì
¾
µ can be computed
efficiently using dynamic programming.See [Collins and Duffy 2001, Collins and Duffy 2002]
Similar Kernels Exist for Tagged Sequences
Whether you’re an aging flower child or a clueless
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 91/96
Whether you’re an aging flower child or a clueless[Gen-Xer], “[The Day They Shot John Lennon],” playing
at the [Dougherty Arts Center], entertains the imagination.
· ̈
Whether [Gen-Xer], Day They John Lennon],” playing
Whether you’re an aging flower child or a clueless [Gen
Experiments
Parsing Wall Street Journal Treebank
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 92/96
Parsing Wall Street Journal TreebankTraining set = 40,000 sentences, test 2,416 sentencesState-of-the-art parser: 88.5% F-measureReranked model: 89.1% F-measure(5% relative error reduction)
Recovering Named-Entities in Web Data
Training data 53,609 sentences (1,047,491 words),test data 14,717 sentences (291,898 words)State-of-the-art tagger: 85.3% F-measureReranked model: 87.6% F-measure
(15.6% relative error reduction)
Conclusions
Some Other Topics in Statistical NLP:
Machine translation
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 93/96
̄ Machine translation
̄ Unsupervised/partially supervised methods
̄ Finite state machines
̄ Generation
̄ Question answering
̄ Coreference
̄ Language modeling for speech recognition
̄ Lexical semantics
̄ Word sense disambiguation
̄ Summarization
MACHINE TRANSLATION (BROWN ET. AL)
̄ Training corpus: Canadian parliament (French-English translations)
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 94/96
̄ Task: learn mapping from French Sentence English Sentence
̄ Noisy channel model:
Ø Ö Ò × Ð Ø Ó Ò ́
µ Ö Ñ Ü
È ́ µ Ö Ñ Ü
È ́ µ È ́ µ
̄ Parameterization
È ́ µ
È ́ µ È ́
µ
̄
È
is a sum over possible alignments from English to FrenchModel estimation through EM
References[Bod 98] Bod, R. (1998). Beyond Grammar: An Experience-Based Theory of Language. CSLI
Publications/Cambridge University Press
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 95/96
Publications/Cambridge University Press.
[Booth and Thompson 73] Booth, T., and Thompson, R. 1973. Applying probability measures to
abstract languages. IEEE Transactions on Computers, C-22(5), pages 442–450.
[Borthwick et. al 98] Borthwick, A., Sterling, J., Agichtein, E., and Grishman, R. (1998). Exploiting
Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition. Proc.
of the Sixth Workshop on Very Large Corpora.
[Collins and Duffy 2001] Collins, M. and Duffy, N. (2001). Convolution Kernels for Natural
Language. In Proceedings of NIPS 14.
[Collins and Duffy 2002] Collins, M. and Duffy, N. (2002). New Ranking Algorithms for Parsing
and Tagging: Kernels over Discrete Structures, and the Voted Perceptron. In Proceedings
of ACL 2002.[Collins 2002] Collins, M. (2002). Discriminative Training Methods for Hidden Markov Models:
Theory and Experiments with the Perceptron Algorithm. In Proceedings of EMNLP 2002.
[Freund and Schapire 99] Freund, Y. and Schapire, R. (1999). Large Margin Classification using thePerceptron Algorithm. In Machine Learning, 37(3):277–296.
[Helmbold and Warmuth 95] Helmbold, D., and Warmuth, M. On Weak Learning. Journal of Computer and System Sciences, 50(3):551-573, June 1995.
[Hopcroft and Ullman 1979] Hopcroft, J. E., and Ullman, J. D. 1979. Introduction to automata
theory, languages, and computation. Reading, Mass.: Addison–Wesley.
[Johnson et. al 1999] Johnson, M., Geman, S., Canon, S., Chi, S., & Riezler, S. (1999). Estimators
for stochastic ‘unification-based” grammars. In Proceedings of the 37th Annual Meeting
of the Association for Computational Linguistics. San Francisco: Morgan Kaufmann.[L ff t t l 2001] J h L ff t A d M C ll d F d P i C diti l d
8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)
http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 96/96
f f p g g[Lafferty et al. 2001] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random
fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML-01, pages 282-289, 2001.
[MSM93] Marcus, M., Santorini, B., & Marcinkiewicz, M. (1993). Building a large annotated
corpus of english: The Penn treebank. Computational Linguistics, 19, 313-330.
[McCallum et al. 2000] McCallum, A., Freitag, D., and Pereira, F. (2000) Maximum entropy markov
models for information extraction and segmentation. In Proceedings of ICML 2000.
[Miller et. al 2000] Miller, S., Fox, H., Ramshaw, L., and Weischedel, R. 2000. A Novel Use of Statistical Parsing to Extract Information from Text. In Proceedings of ANLP 2000.
[Ramshaw and Marcus 95] Ramshaw, L., and Marcus, M. P. (1995). Text Chunking UsingTransformation-Based Learning. In Proceedings of the Third ACL Workshop on Very Large
Corpora, Association for Computational Linguistics, 1995.
[Ratnaparkhi 96] A maximum entropy part-of-speech tagger. In Proceedings of the empirical
methods in natural language processing conference.