Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 1/96

Statistical Methods in Natural Language Processing

Michael CollinsAT&T Labs-Research



OverviewSome NLP problems:

̄ Information extraction(Named entities, Relationships between entities, etc.)

̄

Finding linguistic structurePart-of-speech tagging, “Chunking”, Parsing

Techniques:

̄ Log-linear (maximum-entropy) taggers

̄ Probabilistic context-free grammars (PCFGs)PCFGs with enriched non-terminals

̄ Discriminative methods:Conditional MRFs, Perceptron algorithms, Kernel methods



Some NLP Problems

̄ Information extraction

– Named entities

– Relationships between entities

– More complex relationships

̄Finding linguistic structure

– Part-of-speech tagging

– “Chunking” (low-level syntactic structure)

– Parsing

̄ Machine translation



Common Themes

̄ Need to learn mapping from one discrete structure to another

– Strings to hidden state sequencesNamed-entity extraction, part-of-speech tagging

– Strings to stringsMachine translation

– Strings to underlying trees

Parsing

– Strings to relational data structuresInformation extraction

̄Speech recognition is similar (and shares many techniques)



Two Fundamental Problems

TAGGING: Strings to Tagged Sequences

a b e e a f h j µ a /C b /D e /C e /C a /D f /C h /D j /C

PARSING: Strings to Trees

d e f g µ (A (B (D d) (E e)) (C (F f ) (G g)))

d e f g µ A

B

D

d

E

e

C

F

f

G

g



Information Extraction: Named Entities

INPUT:Profits soared at Boeing Co., easily topping forecasts on WallStreet, as their CEO Alan Mulally announced first quarter results.

OUTPUT:

Profits soared at

Company Boeing Co. ℄

, easily topping forecastson Location Wall Street ℄ , as their CEO Person Alan Mulally ℄

announced first quarter results.



Information Extraction: Relationships between Entities

INPUT:Boeing is located in Seattle. Alan Mulally is the CEO.

OUTPUT:

Relationship = Company-LocationCompany = BoeingLocation = Seattle

Relationship = Employer-EmployeeEmployer = Boeing Co.Employee = Alan Mulally



Information Extraction: More Complex Relationships

INPUT:Alan Mulally resigned as Boeing CEO yesterday. He will besucceeded by Jane Swift, who was previously the president at RollsRoyce.

OUTPUT:

Relationship = Management Succession

Company = Boeing Co.

Role = CEO

Out = Alan Mulally

In = Jane Swift

Relationship = Management Succession

Company = Rolls RoyceRole = president

Out = Jane Swift



Part-of-Speech Tagging


OUTPUT:

Profits /N soared /V at /P Boeing /N Co. /N , /, easily /ADV topping /Vforecasts /N on /P Wall /N Street /N , /, as /P their /POSS CEO /N

Alan /N Mulally /N announced /V first /ADJ quarter /N results /N . /.

N = Noun

V = Verb

P = PrepositionAdv = Adverb

Adj = Adjective



“Chunking” (Low-level syntactic structure)


OUTPUT:

NP Profits ℄ soared at NP Boeing Co. ℄ , easily topping NPforecasts ℄ on NP Wall Street ℄ , as NP their CEO Alan Mulally ℄

announced

NP first quarter results ℄

.

NP

℄ = non-recursive noun phrase



Chunking as Tagging


OUTPUT:

Profits /S soared /N at /N Boeing /S Co. /C , /N easily /N topping /Nforecasts /S on /N Wall /S Street /C , /N as /N their /S CEO /C Alan /C

Mulally /C announced /N first /S quarter /C results /C . /N

N = Not part of noun-phrase

S = Start noun-phrase

C = Continue noun-phrase



Named Entity Extraction as Tagging


OUTPUT:

Profits /NA soared /NA at /NA Boeing /SC Co. /CC , /NA easily /NAtopping /NA forecasts /NA on /NA Wall /SL Street /CL , /NA as /NAtheir /NA CEO /NA Alan /SP Mulally /CP announced /NA first /NA

quarter /NA results /NA . /NA

NA = No entity

SC = Start Company

CC = Continue CompanySL = Start Location

CL = Continue Location



Parsing (Syntactic Structure)

INPUT: Boeing is located in Seattle.

OUTPUT:S

NP

N

Boeing

VP

V

is

VP

V

located

PP

P

in

NP

N

Seattle



Machine Translation

INPUT:Boeing is located in Seattle. Alan Mulally is the CEO.

OUTPUT:

Boeing ist in Seattle. Alan Mulally ist der CEO.



Summary

Problem Well-Studied Class of Problem

LearningApproaches?

Named entity extraction Yes TaggingRelationships between entities A little Parsing

More complex relationships No ??Part-of-speech tagging Yes TaggingChunking Yes TaggingSyntactic Structure Yes Parsing

Machine translation Yes ??



Techniques Covered in this Tutorial


̄ Probabilistic context-free grammars (PCFGs)

̄ PCFGs with enriched non-terminals

̄ Discriminative methods:

– Conditional Markov Random Fields– Perceptron algorithms

– Kernels over NLP structures



Log-Linear Taggers: Notation

̄ Set of possible words = Î , possible tags = Ì

̄

Word sequence Û

½ Ò ℄

Û

½

Û

¾

Û

Ò

℄

̄ Tag sequence Ø

½ Ò ℄

Ø

½

Ø

¾

Ø

Ò

℄

̄ Training data is Ò tagged sentences,

where the ’th sentence is of length Ò

́ Û

½ Ò

℄

Ø

½ Ò

℄

µ

for

½ Ò



Log-Linear Taggers: Independence Assumptions

̄ The basic idea

È ́ Ø

½ Ò ℄

Û

½ Ò ℄

µ

É

Ò

½

È ́ Ø

Ø

½

Ø

½

Û

½ Ò ℄

µ Chain rule

É

Ò

½

È ́ Ø

Ø

½

Ø

¾

Û

½ Ò ℄

µ Independence

assumptions

̄ Two questions:

1. How to parameterize È ́ Ø

Ø

½

Ø

¾

Û

½ Ò ℄

µ ?

2. How to find Ö Ñ Ü

Ø

½ Ò ℄

È ́ Ø

½ Ò ℄

Û

½ Ò ℄

µ ?



The Parameterization Problem

Hispaniola /NNP quickly /RB became /VB an /DT

important /JJ base /?? from which Spain expanded

its empire into the rest of the Western Hemisphere .

̄ There are many possible tags in the position ??

̄ Need to learn a function from (context, tag) pairs to a probability È ́

Ø

Ó Ò Ø Ü Ø µ



Representation: Histories

̄ A history is a 4-tuple Ø

½

Ø

¾

Û

½ Ò ℄

̄ Ø

½

Ø

¾

are the previous two tags.

̄ Û

½ Ò ℄

are the Ò words in the input sentence.

̄ is the index of the word being tagged




Hispaniola /NNP quickly /RB became /VB an /DT important /JJbase /?? from which Spain expanded its empire into the rest of theWestern Hemisphere .

̄ History Ø

½

Ø

¾

Û

½ Ò ℄

̄ Ø

½

Ø

¾

DT, JJ

̄ Û

½ Ò ℄

À × Ô Ò Ó Ð Õ Ù Ð Ý Ñ

̄



Feature–Vector Representations

̄ Take a history/tag pair ́ Ø

µ .

̄

×

́ Ø

µ for × ½

are features representing

tagging decision Ø

in context

.

½ ¼ ¼ ¼

́ Ø µ

½ if current word Û

is base

and Ø = VB

¼ otherwise

½ ¼ ¼ ½

́ Ø µ

½ if Ø

¾

Ø

½

Ø DT, JJ, VB

¼ otherwise




̄ A history is a 4-tuple Ø

½

Ø

¾

Û

½ Ò ℄

̄ Ø

½

Ø

¾

are the previous two tags.

̄ Û

½ Ò ℄ are the Ò words in the input sentence.

̄ is the index of the word being tagged

Hispaniola /NNP quickly /RB became /VB an /DT important /JJbase /?? from which Spain expanded its empire into the rest of theWestern Hemisphere .

̄ Ø

½

Ø

¾

DT, JJ

̄ Û

½ Ò ℄

À × Ô Ò Ó Ð Õ Ù Ð Ý Ñ À Ñ × Ô Ö

̄



Feature–Vector Representations


µ .

̄

×

́ Ø

µ for × ½ are features representing

tagging decision Ø in context .

Example: POS Tagging [Ratnaparkhi 96]

̄ Word/tag features

½ ¼ ¼

́ Ø µ

́

½

if current word Û

is base and Ø

= VB ¼ otherwise

½ ¼ ½

́ Ø µ

́


ends in ing and Ø = VBG

¼ otherwise

̄ Contextual Features

½ ¼ ¿

́ Ø µ

́

½ if Ø

¾

Ø

½

Ø DT, JJ, VB

¼ otherwise



Part-of-Speech (POS) Tagging [Ratnaparkhi 96]

̄ Word/tag features

½ ¼ ¼

́ Ø µ

́


is base and Ø = VB

¼

otherwise

̄ Spelling features

½ ¼ ½

́ Ø µ

́



¼ otherwise

½ ¼ ¾

́ Ø µ

́


starts with pre and Ø = NN

¼ otherwise



Ratnaparkhi’s POS Tagger

̄

Contextual Features

½ ¼ ¿

́ Ø µ

́

½ if Ø

¾

Ø

½

Ø DT, JJ, VB

¼ otherwise

½ ¼

́ Ø µ

́

½ if Ø

½

Ø JJ, VB

¼ otherwise

½ ¼

́ Ø µ

́

½ if Ø VB

¼otherwise

½ ¼

́ Ø µ

́

½if previous word

Û

½

= the and Ø

VB ¼ otherwise

½ ¼

́ Ø µ

́

½ if next word Û

· ½

= the and Ø VB ¼ otherwise



Log-Linear (Maximum-Entropy) Models


µ .

̄

×

́ Ø

µ for × ½

are features

̄ Ï

×

for × ½ are parameters

̄ Parameters define a conditional distribution

È ́ Ø µ

È

×

Ï

×

×

́ Ø

µ

́

Ï µ

where ́

Ï

µ

Ø

¼

¾ Ì

È

×

Ï

×

×

́ Ø

¼

µ



Log-Linear (Maximum Entropy) Models

̄ Word sequence Û

½ Ò ℄

Û

½

Û

¾

Û

Ò

℄

̄ Tag sequence Ø

½ Ò ℄

Ø

½

Ø

¾

Ø

Ò

℄

̄ Histories

Ø

½

Ø

¾

Û

½ Ò ℄

Ð Ó È ́ Ø

½ Ò ℄

Û

½ Ò ℄

µ

Ò

½

Ð Ó È ́ Ø

µ

Ò

½

×

Ï

×

×

́

Ø

µ

ß Þ

Linear Score

Ò

½

Ð Ó ́

Ï µ

ß Þ

Local Normalization

Terms



Log-Linear Models

̄

Word sequence Û

½ Ò ℄

Û

½

Û

¾

Û

Ò

℄

̄ Tag sequence Ø

½ Ò ℄

Ø

½

Ø

¾

Ø

Ò

℄

Ð Ó È ́ Ø

½ Ò ℄

Û

½ Ò ℄

µ

Ò

½

Ð Ó È ́ Ø

µ

Ò

½

×

Ï

×

×

́

Ø

µ

Ò

½

Ð Ó ́

Ï µ

where

Ø

¾

Ø

½

Û

½ Ò ℄

i



Log-Linear Models

̄

Parameter estimation:Maximize likelihood of training data through gradient descent,iterative scaling

̄

Search for Ö Ñ Ü

Ø

½ Ò ℄

È ́ Ø

½ Ò ℄

Û

½ Ò ℄

µ

:Dynamic programming, Ç ́ Ò

Ì

¿

µ complexity

̄ Experimental results:

– Almost 97% accuracy for POS tagging [Ratnaparkhi 96]

– Over 90% accuracy for named-entity extraction[Borthwick et. al 98]

– Around 93% precision/recall for NP chunking

– Better results than an HMM for FAQ segmentation[McCallum et al. 2000]

T h i C d i hi T i l






̄ PCFGs with enriched non-terminals




D t f P i E i t



Data for Parsing Experiments

̄ Penn WSJ Treebank = 50,000 sentences with associated trees

̄ Usual set-up: 40,000 training sentences, 2400 test sentences

An example tree:

Canadian

NNP

Utilities

NNPS

NP

had

VBD

1988

CD

revenue

NN

NP

of

IN

C$

$

1.16

CD

billion

CD

,

PUNC,

QP

NP

PP

NP

mainly

RB

ADVP

from

IN

its

PRP$

natural

JJ

gas

NN

and

CC

electric

JJ

utility

NN

businesses

NNS

NP

in

IN

Alberta

NNP

,

PUNC,

NP

where

WRB

WHADVP

the

DT

company

NN

NP

serves

VBZ

about

RB

800,000

CD

QP

customers

NNS

.

PUNC.

NP

VP

S

SBAR

NP

PP

NP

PP

VP

S

TOP

Canadian Utilities had 1988 revenue of C$ 1.16 billion , mainly from its

natural gas and electric utility businesses in Alberta , where the company

serves about 800,000 customers .

Th I f ti C d b P T



The Information Conveyed by Parse Trees

1) Part of speech for each word

(N = noun, V = verb, D = determiner)

S

NP

D

the

N

burglar

VP

V

robbed

NP

D

the

N

apartment



2) Phrases S

NP

DT

the

N

burglar

VP

V

robbed

NP

DT

the

N

apartment

Noun Phrases (NP): “the burglar”, “the apartment”

Verb Phrases (VP): “robbed the apartment”

Sentences (S): “the burglar robbed the apartment”



3) Useful Relationships

S

NP

subject

VP

V

verb

S

NP

DT

the

N

burglar

VP

V

robbed

NP

DT

the

N

apartment

µ “the burglar” is the subject of “robbed”

An Example Application: Machine Translation



An Example Application: Machine Translation

̄ English word order is subject – verb – object

̄ Japanese word order is subject – object – verb

English: IBM bought LotusJapanese: IBM Lotus bought

English: Sources said that IBM bought Lotus yesterdayJapanese: Sources yesterday IBM Lotus bought that said

Context Free Grammars



Context-Free Grammars

[Hopcroft and Ullman 1979]A context free grammar

´ ¦

Ê Ë µ where:

̄ is a set of non-terminal symbols

̄ ¦ is a set of terminal symbols

̄ Ê is a set of rules of the form

½

¾

Ò

for Ò ¼ , ¾ ,

¾ ́ ¦ µ

̄ Ë ¾ is a distinguished start symbol

A Context-Free Grammar for English



A Context-Free Grammar for English

=

S, NP, VP, PP, D, Vi, Vt, N, P

Ë = S ¦ = sleeps, saw, man, woman, telescope, the, with, in

Ê

S µ

NP VPVP µ ViVP µ Vt NPVP µ VP PP

NP µ

D NNP µ NP PP

PP µ P NP

Vi µ sleeps

Vt µ sawN

µman

N µ womanN µ telescope

D µ the

P µ withP µ in

Note: S=sentence, VP=verb phrase, NP=noun phrase, PP=prepositional phrase,

D=determiner, Vi=intransitive verb, Vt=transitive verb, N=noun, P=preposition

Left-Most Derivations



A left-most derivation is a sequence of strings ×

½

×

Ò

, where

̄ ×

½

Ë , the start symbol

̄ ×

Ò

¾ ¦

£ , i.e. ×

Ò

is made up of terminal symbols only

̄ Each ×

for ¾ Ò

is derived from ×

½

by picking the left-

most non-terminal in ×

½

and replacing it by some ¬ where ¬ is a rule in Ê

For example: [S], [NP VP], [D N VP], [the N VP], [the man VP],

[the man Vi], [the man sleeps]Representation of a derivation as a tree:

S

NP

D

the

N

man

VP

Vi

sleeps

Notation



Notation

̄ We use to denote the set of all left-most derivations (trees)allowed by a grammar

̄ We use ́ Ü µ for a string Ü ¾ ¦

£ to denote the set of all

derivations whose final string (“yield”) is Ü .

The Problem with Parsing: Ambiguity



The Problem with Parsing: Ambiguity

INPUT:She announced a program to promote safety in trucks and vans

·

POSSIBLE OUTPUTS:And there are more...

An Example Tree



An Example Tree

Canadian Utilities had 1988 revenue of C$ 1.16 billion ,mainly from its natural gas and electric utility businessesin Alberta , where the company serves about 800,000customers .

Canadian

NNP

Utilities

NNPS

NP

had

VBD

1988

CD

revenue

NN

NP

of

IN

C$

$

1.16

CD

billion

CD

,

PUNC,

QP

NP

PP

NP

mainly

RB

ADVP

from

IN

its

PRP$

natural

JJ

gas

NN

and

CC

electric

JJ

utility

NN

businesses

NNS

NP

in

IN

Alberta

NNP

,

PUNC,

NP

where

WRB

WHADVP

the

DT

company

NN

NP

serves

VBZ

about

RB

800,000

CD

QP

customers

NNS

.

PUNC.

NP

VP

S

SBAR

NP

PP

NP

PP

VP

S

TOP

A Probabilistic Context-Free Grammar



bab C G a a

S µ NP VP 1.0

VP µ Vi 0.4VP µ Vt NP 0.4

VP µ

VP PP 0.2NP µ D N 0.3NP µ NP PP 0.7

PP µ P NP 1.0

Vi µ sleeps 1.0Vt

µsaw 1.0

N µ man 0.7N µ woman 0.2N µ telescope 0.1

D µ the 1.0

P µ with 0.5

P µ

in 0.5

̄ Probability of a tree with rules «

¬

isÉ

È ́ «

¬

«

µ

̄ Maximum Likelihood estimation

È ́ VP µ V NP VP µ

Ó Ù Ò Ø ́ VP µ V NP µ

Ó Ù Ò Ø ́ VP µ

PCFGs



PCFGs

[Booth and Thompson 73] showed that a CFG with ruleprobabilities correctly defines a distribution over the set of derivations provided that:

1. The rule probabilities define conditional distributions over thedifferent ways of rewriting each non-terminal.

2. A technical condition on the rule probabilities ensuring that

the probability of the derivation terminating in a finite numberof steps is 1. (This condition is not really a practical concern.)

TOP



S

NP

N

IBM

VP

V

bought

NP

N

Lotus

PROB = È ́ TOP S µ

¢ È ́ S NP VP µ ¢ È ́ N Á Å

µ

¢ È ́ VP V NP µ ¢ È ́ V Ó Ù Ø

µ

¢ È ́

NP

N µ ¢ È ́

N

Ä Ó Ø Ù × µ

¢ È ́ NP N µ

The SPATTER Parser: (Magerman 95;Jelinek et al 94)



The SPATTER Parser: (Magerman 95;Jelinek et al 94)

̄

For each rule, identify the “head” child

S µ NP VP

VP µ V NP

NP µ

DT N

̄ Add word to each non-terminal

S(questioned)

NP(lawyer)

DT

the

N

lawyer

VP(questioned)

V

questioned

NP(witness)

DT

the

N

witness

A Lexicalized PCFG



S(questioned) µ NP(lawyer) VP(questioned) ??

VP(questioned) µ

V(questioned) NP(witness) ??

NP(lawyer) µ D(the) N(lawyer) ??

NP(witness) µ

D(the) N(witness) ??

̄ The big question: how to estimate rule probabilities??



CHARNIAK (1997)

S(questioned)

· È ́ NP VP S(questioned) µ

S(questioned)

NP VP(questioned)

· È ́ lawyer S,VP,NP, questioned) µ

S(questioned)

NP(lawyer) VP(questioned)

Smoothed Estimation



È ́

NP VP

S(questioned) µ

½

¢

Ó Ù Ò Ø ́ S(questioned) NP VP µ

Ó Ù Ò Ø ́ S(questioned) µ

·

¾

¢

Ó Ù Ò Ø ́ S NP VP µ

Ó Ù Ò Ø ́ S µ

̄ Where ¼

½

¾

½ , and

½

·

¾

½

Smoothed Estimation



È ́

lawyer

S,NP,VP,questioned µ

½

¢

Ó Ù Ò Ø ́ lawyer S,NP,VP,questioned µ

Ó Ù Ò Ø ́ S,NP,VP,questioned µ

·

¾

¢

Ó Ù Ò Ø ́ lawyer S,NP,VP µ

Ó Ù Ò Ø ́ S,NP,VP µ

·

¿

¢

Ó Ù Ò Ø ́ lawyer NP µ

Ó Ù Ò Ø ́ NP µ

̄ Where ¼

½

¾

¿

½ , and

½

·

¾

·

¿

½

(l ) ( i d) S( i d)



È ́ NP(lawyer) VP(questioned) S(questioned) µ

́

½

¢

Ó Ù Ò Ø ́

S(questioned)

NP VP µ

Ó Ù Ò Ø ́

S(questioned) µ

·

¾

¢

Ó Ù Ò Ø ́

S

NP VP µ

Ó Ù Ò Ø ́ S µ

µ

¢ ́

½

¢

Ó Ù Ò Ø ́

lawyer


Ó Ù Ò Ø ́


·

¾

¢

Ó Ù Ò Ø ́

lawyer

S,NP,VP µ

Ó Ù Ò Ø ́

S,NP,VP µ

·

¿

¢

Ó Ù Ò Ø ́

lawyer

NP µ

Ó Ù Ò Ø ́ NP µ

µ

Lexicalized Probabilistic Context-Free Grammars



̄ Transformation to lexicalized rulesS NP VP

vs. S(questioned) NP(lawyer) VP(questioned)

̄ Smoothed estimation techniques “blend” different counts

̄ Search for most probable tree through dynamic programming

̄ Perform vastly better than PCFGs (88% vs. 73% accuracy)

Independence AssumptionsPCFG



̄PCFGs

S

NP

DT

the

N

lawyer

VP

V

questioned

NP

DT

the

N

witness

̄ Lexicalized PCFGsS(questioned)

NP(lawyer)

DT

the

N

lawyer

VP(questioned)

V

questioned

NP(witness)

DT

the

N

witness

Results



Method Accuracy

PCFGs (Charniak 97) 73.0%

Conditional Models – Decision Trees (Magerman 95) 84.2%

Lexical Dependencies (Collins 96) 85.5%

Conditional Models – Logistic (Ratnaparkhi 97) 86.9%

Generative Lexicalized Model (Charniak 97) 86.7%Generative Lexicalized Model (Collins 97) 88.2%

Logistic-inspired Model (Charniak 99) 89.6%

Boosting (Collins 2000) 89.8%

̄ Accuracy = average recall/precision

Parsing for Information Extraction:Relationships between Entities



Relationships between Entities

INPUT:Boeing is located in Seattle.

OUTPUT:

Relationship = Company-LocationCompany = BoeingLocation = Seattle

A Generative Model (Miller et. al)



[Miller et. al 2000] use non-terminals to carry lexical items and

semantic tagsSis

CL

NPBoeingCOMPANY

Boeing

VPisCLLOC

V

is

VPlocatedCLLOC

V

located

PPinCLLOC

P

in

NPSeattleLOCATION

Seattle

PPin lexical headCLLOC semantic tag

A Generative Model [Miller et. al 2000]



We’re now left with an even more complicated estimation problem,

P(SisCL µ NP

BoeingCOMPANY VPis

CLLOC)

See [Miller et. al 2000] for the details

̄ Parsing algorithm recovers annotated trees µ

Simultaneously recovers syntactic structure and namedentity relationships

̄ Accuracy (precision/recall) is greater than 80% in recoveringrelations




̄

Log-linear (maximum-entropy) taggers


̄

PCFGs with enriched non-terminals




Linear Models for Parsing and Tagging



̄ Three components:

Æis a function from a string to a set of candidates

̈ maps a candidate to a feature vector

Ï is a parameter vector

Component 1: Æ



̄ Æ enumerates a set of candidates for a sentence

She announced a program to promote safety in trucks and vans

· Æ

S

NP

She

VP

announced NP

NP

a program

VP

t o p ro mo te NP

safety PP

in NP

t r uc k s a n d v a ns

S

NP

She

VP

announced NP

NP

NP

a p r og r am

VP

t o p ro mo te NP

safety PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

a program

VP

to promote NP

NP

safety PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

a p r og r am

VP

t o p ro mo te NP

safety

PP

in NP


S

NP

She

VP

announced NP

NP

NP

a program

VP

t o p r om o te NP

safety

PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

NP

a program

VP

t o p r om o te NP

safety

PP

in NP


Examples of Æ



̄ A context-free grammar

̄ A finite-state machine

̄ Top most probable analyses from a probabilistic grammar

Component 2: ̈



̄ ̈ maps a candidate to a feature vector ¾ Ê

̄ ̈ defines the representation of a candidate

S

NP

She

VP

announced NP

NP

a program

VP

to VP

promote NP

safety PP

in NP

NP

trucks

and NP

vans

· ̈

½ ¼ ¾ ¼ ¼ ½

Features



̄ A “feature” is a function on a structure, e.g.,

́ Ü µ

Number of times A

B C

is seen in Ü

Ì

½

A

B

D

d

E

e

C

F

f

G

g

Ì

¾

A

B

D

d

E

e

C

F

h

A

B

b

C

c

́ Ì

½

µ ½ ́ Ì

¾

µ ¾

Feature Vectors



̄

A set of functions

½

define a feature vector

̈ ́ Ü µ

½

́ Ü µ

¾

́ Ü µ

́ Ü µ

Ì

½

A

B

D

d

E

e

C

F

f

G

g

Ì

¾

A

B

D

d

E

e

C

F

h

A

B

b

C

c

̈ ́ Ì

½

µ ½ ¼ ¼ ¿ ̈ ́ Ì

¾

µ ¾ ¼ ½ ½

Component 3: Ï



̄ Ï is a parameter vector ¾ Ê

̄ ̈ and Ï together map a candidate to a real-valued score

S

NP

She

VP

announced NP

NP

a program

VP

to VP

promote NP

safety PP

in NP

NP

trucks

and NP

vans

· ̈

½ ¼ ¾ ¼ ¼ ½

· ̈ ¡ Ï

½ ¼ ¾ ¼ ¼ ½

¡

½ ¼ ¿ ¼ ¾ ½ ¿ ¼ ½ ¼ ¾ ¿

Putting it all Together



̄ is set of sentences, is set of possible outputs (e.g. trees)

̄Need to learn a function

̄ Æ

, ̈ , Ï define

́ Ü µ Ö Ñ Ü

Ý ¾ Æ

́ Ü µ

̈ ́ Ý µ ¡ Ï

Choose the highest scoring tree as the most plausible structure

̄ Given examples ́ Ü

Ý

µ , how to set Ï ?

She announced a program to promote safety in trucks and vans

· Æ



S

NP

She

VP

announced NP

NP

a program

VP

t o p ro mo te NP

safety PP

in NP


S

NP

She

VP

announced NP

NP

NP

a p r og r am

VP

t o p ro mo te NP

safety PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

a program

VP

to promote NP

NP

safety PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

a p r og r am

VP

t o p ro mo te NP

safety

PP

in NP


S

NP

She

VP

announced NP

NP

NP

a program

VP

t o p r om o te NP

safety

PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

NP

a program

VP

t o p r om o te NP

safety

PP

in NP


· ̈ · ̈ · ̈ · ̈ · ̈ · ̈

½ ½ ¿ ¾ ¼ ¼ ½ ¼ ½ ¼ ¼ ¿ ¼ ¼ ½ ¼ ¼ ¼ ½

·

̈ ¡

Ï ·

̈ ¡

Ï ·

̈ ¡

Ï ·

̈ ¡

Ï ·

̈ ¡

Ï ·

̈ ¡

Ï

13.6 12.2 12.1 3.3 9.4 11.1 ·

Ö Ñ Ü

S

NP

She

VP

announced NP

NP

a p ro gr am

VP

to VP

promote NP

safety PP

in NP

NP

trucks

and NP

vans

Markov Random Fields



̄ Parameters Ï define a conditional distribution over candidates:

È ́ Ý

Ü

Ï µ

̈ ́ Ý

µ ¡ Ï

È

Ý ¾ Æ

́ Ü

µ

̈ ́ Ý µ ¡ Ï

̄ Gaussian prior: Ð Ó

È ́ Ï µ

Ï

¾

¾

̄

MAP parameter estimates maximise

Ð Ó

̈ ́ Ý

µ ¡ Ï

È

Ý ¾ Æ

́ Ü

µ

̈ ́ Ý µ ¡ Ï

Ï

¾

¾

Note: This is a “globally normalised” model

Markov Random Fields Example 1: [Johnson et. al 1999]



Æis the set of parses for a sentence with a hand-crafted

grammar (a Lexical Functional Grammar)

̈ can include arbitrary features of the candidate parses

Ï

is estimated using conjugate gradient descent

Markov Random Fields Example 2: [Lafferty et al. 2001]



Going back to tagging:

̄ Inputs Ü are sentences Û

½ Ò ℄

̄ Æ

́ Û

½ Ò ℄

µ Ì

Ò i.e. all tag sequences of length Ò

̄ Global representations ̈ are composed from local feature

vectors

̈ ́ Û

½ Ò ℄

Ø

½ Ò ℄

µ

Ò

½

́

Ø

µ

where

Ø

¾

Ø

½

Û

½ Ò ℄




̄ Typically, local features are indicator functions, e.g.,

½ ¼ ½

́ Ø µ

́



¼ otherwise

̄ and global features are then counts,

̈

½ ¼ ½

́ Û

½ Ò ℄

Ø

½ Ò ℄

µ Number of times a word ending in ing is

tagged as VBG in ́ Û

½ Ò ℄

Ø

½ Ò ℄

µ


Conditional random fields are globally normalised models:



Ð Ó È ́ Ø

½ Ò ℄

Û

½ Ò ℄

µ ̈ ́ Û

½ Ò ℄

Ø

½ Ò ℄

µ ¡ Ï Ð Ó

́ Û

½ Ò ℄

Ï µ

Ò

½

×

Ï

×

×

́

Ø

µ

ß Þ

Linear model

Ð Ó

́ Û

½ Ò ℄

Ï µ

ß Þ

Normalization

where ́ Û

½ Ò ℄

Ï µ

È

Ø

½ Ò ℄

¾ Ì

Ò

̈ ́ Û

½ Ò ℄

Ø

½ Ò ℄

µ ¡

Ï

Log-linear taggers (see earlier part of the tutorial) are locally normalised models:

Ð Ó È ́ Ø

½ Ò ℄

Û

½ Ò ℄

µ

Ò

½

×

Ï

×

×

́

Ø

µ

ß Þ

Linear Model

Ò

½

Ð Ó ́

Ï µ

ß Þ

Local Normalization

Problems with Locally Normalized Models



̄

“Label bias” problem [Lafferty et al. 2001]See also [Klein and Manning 2002]

̄ Example of a conditional distribution that locally normalized

models can’t capture (under bigram tag representation):

a b c µ

A — B — C

a b cwith È ́ A B C a b c

µ ½

a b e µ

A — D — E

a b ewith È ́ A D E a b e

µ ½

̄ Impossible to find parameters that satisfy

È ́ µ ¢ È ́

µ ¢ È ́ µ ½

È ́ µ ¢ È ́

µ ¢ È ́ µ ½

Markov Random Fields Example 2: [Lafferty et al. 2001]Parameter Estimation



̄ Need to calculate gradient of the log-likelihood,

Ï

È

Ð Ó È ́ Ø

½ Ò

℄

Û

½ Ò

℄

Ï µ

=

Ï

È

̈ ́ Û

½ Ò

℄

Ø

½ Ò

℄

µ ¡ Ï

È

Ð Ó ́ Û

½ Ò

℄

Ï µ

=È

̈ ́ Û

½ Ò

℄

Ø

½ Ò

℄

µ

È

È

Ù

½ Ò

℄

¾ Ì

Ò

È ́ Ù

½ Ò

℄

Û

½ Ò

℄

Ï µ ̈ ́ Û

½ Ò

℄

Ù

½ Ò

℄

µ

Last term looks difficult to compute. But because ̈

is definedthrough “local” features, it can be calculated efficiently usingdynamic programming. (Very similar problem to that solved bythe EM algorithm for HMMs.) See [Lafferty et al. 2001].




̄

Log-linear (maximum-entropy) taggers


̄

PCFGs with enriched non-terminals


– Conditional Markov Random Fields

– Perceptron algorithms


A Variant of the Perceptron Algorithm



Inputs: Training set ́ Ü

Ý

µ for ½ Ò

Initialization: Ï ¼

Define: ́ Ü µ Ö Ñ Ü

Ý ¾ Æ ́ Ü µ

̈ ́ Ý µ ¡ Ï

Algorithm: For Ø ½ Ì

, ½ Ò

Þ

́ Ü

µ

If ́ Þ

Ý

µ Ï Ï · ̈ ́ Ý

µ ̈ ́ Þ

µ

Output: Parameters Ï

Theory Underlying the Algorithm

̄ Definition: Æ ́ Ü

µ Æ ́ Ü

µ

Ý



̄ Definition: The training set is separable with margin Æ,

if there is a vector Í ¾ Ê

with

Í ½

such that

Þ ¾ Æ

́ Ü

µ Í ¡ ̈ ́ Ý

µ Í ¡ ̈ ́ Þ µ Æ

Theorem: For any training sequence ́ Ü

Ý

µwhich is separable

with margin Æ , then for the perceptron algorithm

Number of mistakes

Ê

¾

Æ

¾

where Ê is a constant such that

Þ ¾ Æ

́ Ü

µ

̈ ́ Ý

µ ̈ ́ Þ µ

Ê

Proof: Direct modification of the proof for the classification case.See [Collins 2002]

More Theory for the Perceptron Algorithm



̄

Question 1: what if the data is not separable?[Freund and Schapire 99] give a modified theorem for this case

̄ Question 2: performance on training data is all very well,but what about performance on new test examples?

Assume some distribution È ́ Ü Ý

µ underlying examples

Theorem [Helmbold and Warmuth 95]: For any distribution È ́

Ü Ý µ generating examples, if = expected number of mistakes

of an online algorithm on a sequence of Ñ · ½

examples, then arandomized algorithm trained on Ñ samples will have probability

Ñ · ½

of making an error on a newly drawn example from È .

[Freund and Schapire 99] use this to define the Voted Perceptron

Perceptron Algorithm 1: Tagging

̄ Score for a ́ Û

½ Ò ℄

Ø

½ Ò ℄

µ pair is



́ Û

½ Ò ℄

Ø

½ Ò ℄

µ

×

Ï

×

×

́

Ø

µ

×

Ï

×

̈

×

́ Ø

½ Ò ℄

Û

½ Ò ℄

µ

̄

Note: no normalization terms ̄ Note: ́ Û

½ Ò ℄

Ø

½ Ò ℄

µ is not a log probability

̄ Viterbi algorithm for

Ö Ñ Ü

Ø

½ Ò ℄

¾ Ì

Ò

́ Û

½ Ò ℄

Ø

½ Ò ℄

µ

Training the Parameters

Inputs: Training set ́ Û

½ Ò

℄

Ø

½ Ò

℄

µ for ½ Ò

.



Initialization: Ï

¼

Algorithm: For Ø

½ Ì ½ Ò

Þ

½ Ò

℄

Ö Ñ Ü

Ù

½ Ò

℄

¾ Ì

Ò

×

Ï

×

̈

×

́ Û

½ Ò

℄

Ù

½ Ò

℄

µ

Þ

½ Ò

℄

is output on ’th sentence with current parameters

If Þ

½ Ò

℄

Ø

½ Ò

℄ then

Ï

×

Ï

×

· ̈

×

́ Û

½ Ò

℄

Ø

½ Ò

℄

µ

ß Þ

Correct tags’feature value

̈

×

́ Û

½ Ò

℄

Þ

½ Ò

℄

µ

ß Þ

Incorrect tags’feature value

Output: Parameter vector Ï

.

An Example

Say the correct tags for ’th sentence are



y g

the /DT man /NN bit /VBD the /DT dog /NN

Under current parameters, output is

the /DT man /NN bit /NN the /DT dog /NN

Assume also that features track: (1) all bigrams; (2) word/tag pairs

Parameters incremented:

NN, VBD VBD, DT VBD bit

Parameters decremented:

NN, NN NN, DT NN bit

Experiments



̄

Wall Street Journal part-of-speech tagging data

Perceptron = 2.89%, Max-ent = 3.28%

(11.9% relative error reduction)

̄

[Ramshaw and Marcus 95] NP chunking data

Perceptron = 93.63%, Max-ent = 93.29%


See [Collins 2002]

Perceptron Algorithm 2: Reranking Approaches



̄ Æ

is the top Ò most probable candidates from a base model

– Parsing: a lexicalized probabilistic context-free grammar

– Tagging: “maximum entropy” tagger– Speech recognition: existing recogniser

Parsing Experiments

ÆBeam search used to parse training and test sentences:

around 27 parses for each sentence



̈ Ä ́ Ü µ

½

́ Ü µ

Ñ

́ Ü µ , where Ä ́ Ü µ

log-likelihood fromfirst-pass parser,

½

Ñ

are ¼ ¼

¼ ¼ ¼

indicator functions

½

́ Ü µ

́

½

if Ü

contains Ë

È Î È

¼ otherwise

S

NP

She

VP

announced NP

NP

a program

VP

to VP

promote NP

safety PP

in NP

NP

trucks

and NP

vans

· ̈

½

¼ ¼ ½ ½ ¼ ½ ¼ ¼ ½ ¼ ¼ ½ ½ ¼ ¼ ½ ½ ¼ ¼ ¼ ¼

½ ¼ ¼

Named Entities

ÆTop 20 segmentations from a “maximum-entropy” tagger



̈ Ä ́ Ü µ

½

́ Ü µ

Ñ

́ Ü µ ,

½

́ Ü µ

́

½ if Ü contains a boundary = “[The

¼ otherwise

Whether you’re an aging flower child or a clueless

[Gen-Xer], “[The Day They Shot John Lennon],” playing at the

[Dougherty Arts Center], entertains the imagination.

· ̈

¿

½ ½ ¼ ¼ ¼ ½ ½ ¼ ½ ½ ¼ ¼ ½ ¼ ¼ ½ ¼ ½ ¼ ¼ ¼ ¼

¼ ½ ½


[Gen-Xer], “[The Day They Shot John Lennon],” playing at the




· ̈

¿

½ ½ ¼ ¼ ¼ ½ ½ ¼ ½ ½ ¼ ¼ ½ ¼ ¼ ½ ¼ ½ ¼ ¼ ¼ ¼

¼ ½ ½

Whether you’re an aging flower child or a cluelessGen-Xer, “The Day [They Shot John Lennon],” playing at the


· ̈

¿

½ ½ ½ ½ ¼ ¼ ½ ¼ ¼ ½ ¼ ½ ½ ½ ¼ ½ ¼ ½ ¼ ¼ ¼ ¼

¼ ½ ¼


[Gen-Xer], “The Day [They Shot John Lennon],” playing at the


· ̈

¾

¼ ¼ ½ ¼ ¼ ½ ¼ ¼ ½ ¼ ¼ ¼ ¼ ¼ ½ ¼ ½ ¼ ¼ ¼ ¼

¼ ½ ¼

Experiments

P i W ll S J l T b k



Parsing Wall Street Journal Treebank

Training set = 40,000 sentences, test 2,416 sentencesState-of-the-art parser: 88.2% F-measureReranked model: 89.5% F-measure (11% relative error reduction)Boosting: 89.7% F-measure (13% relative error reduction)

Recovering Named-Entities in Web Data

Training data 53,609 sentences (1,047,491 words),test data 14,717 sentences (291,898 words)State-of-the-art tagger: 85.3% F-measureReranked model: 87.9% F-measure (17.7% relative error reduction)

Boosting: 87.6% F-measure (15.6% relative error reduction)

Perceptron Algorithm 3: Kernel Methods(Work with Nigel Duffy)



̄ It’s simple to derive a “dual form” of the perceptron algorithm

If we can compute ̈ ́ Ü µ ¡ ̈ ́ Ý µ

efficientlywe can learn efficiently with the representation ̈

“All Subtrees” Representation [Bod 98]

̄ Given: Non-Terminal symbols



Terminal symbols

̄ An infinite set of subtrees

A

B C

A

B

b

E

A

B

b

C

A B

A

B A

B

b

C

̄ Step 1:

Choose an (arbitrary) mapping from subtrees to integers

́ Ü µ

Number of times subtree is seen in Ü

̈ ́ Ü µ

½

́ Ü µ

¾

́ Ü µ

¿

́ Ü µ

All Subtrees Representation

i h



̄ ̈ is now huge

̄ But inner product ̈ ́ Ì

½

µ ¡ ̈ ́ Ì

¾

µ can be computed

efficiently using dynamic programming.See [Collins and Duffy 2001, Collins and Duffy 2002]

Similar Kernels Exist for Tagged Sequences




Whether you’re an aging flower child or a clueless[Gen-Xer], “[The Day They Shot John Lennon],” playing

at the [Dougherty Arts Center], entertains the imagination.

· ̈

Whether [Gen-Xer], Day They John Lennon],” playing

Whether you’re an aging flower child or a clueless [Gen

Experiments

Parsing Wall Street Journal Treebank



Parsing Wall Street Journal TreebankTraining set = 40,000 sentences, test 2,416 sentencesState-of-the-art parser: 88.5% F-measureReranked model: 89.1% F-measure(5% relative error reduction)

Recovering Named-Entities in Web Data

Training data 53,609 sentences (1,047,491 words),test data 14,717 sentences (291,898 words)State-of-the-art tagger: 85.3% F-measureReranked model: 87.6% F-measure


Conclusions

Some Other Topics in Statistical NLP:

Machine translation



̄ Machine translation

̄ Unsupervised/partially supervised methods

̄ Finite state machines

̄ Generation

̄ Question answering

̄ Coreference

̄ Language modeling for speech recognition

̄ Lexical semantics

̄ Word sense disambiguation

̄ Summarization

MACHINE TRANSLATION (BROWN ET. AL)

̄ Training corpus: Canadian parliament (French-English translations)



̄ Task: learn mapping from French Sentence English Sentence

̄ Noisy channel model:

Ø Ö Ò × Ð Ø Ó Ò ́

µ Ö Ñ Ü

È ́ µ Ö Ñ Ü

È ́ µ È ́ µ

̄ Parameterization

È ́ µ

È ́ µ È ́

µ

̄

È

is a sum over possible alignments from English to FrenchModel estimation through EM

References[Bod 98] Bod, R. (1998). Beyond Grammar: An Experience-Based Theory of Language. CSLI

Publications/Cambridge University Press



Publications/Cambridge University Press.

[Booth and Thompson 73] Booth, T., and Thompson, R. 1973. Applying probability measures to

abstract languages. IEEE Transactions on Computers, C-22(5), pages 442–450.

[Borthwick et. al 98] Borthwick, A., Sterling, J., Agichtein, E., and Grishman, R. (1998). Exploiting

Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition. Proc.

of the Sixth Workshop on Very Large Corpora.

[Collins and Duffy 2001] Collins, M. and Duffy, N. (2001). Convolution Kernels for Natural

Language. In Proceedings of NIPS 14.

[Collins and Duffy 2002] Collins, M. and Duffy, N. (2002). New Ranking Algorithms for Parsing

and Tagging: Kernels over Discrete Structures, and the Voted Perceptron. In Proceedings

of ACL 2002.[Collins 2002] Collins, M. (2002). Discriminative Training Methods for Hidden Markov Models:

Theory and Experiments with the Perceptron Algorithm. In Proceedings of EMNLP 2002.

[Freund and Schapire 99] Freund, Y. and Schapire, R. (1999). Large Margin Classification using thePerceptron Algorithm. In Machine Learning, 37(3):277–296.

[Helmbold and Warmuth 95] Helmbold, D., and Warmuth, M. On Weak Learning. Journal of Computer and System Sciences, 50(3):551-573, June 1995.

[Hopcroft and Ullman 1979] Hopcroft, J. E., and Ullman, J. D. 1979. Introduction to automata

theory, languages, and computation. Reading, Mass.: Addison–Wesley.

[Johnson et. al 1999] Johnson, M., Geman, S., Canon, S., Chi, S., & Riezler, S. (1999). Estimators

for stochastic ‘unification-based” grammars. In Proceedings of the 37th Annual Meeting

of the Association for Computational Linguistics. San Francisco: Morgan Kaufmann.[L ff t t l 2001] J h L ff t A d M C ll d F d P i C diti l d



f f p g g[Lafferty et al. 2001] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random

fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML-01, pages 282-289, 2001.

[MSM93] Marcus, M., Santorini, B., & Marcinkiewicz, M. (1993). Building a large annotated

corpus of english: The Penn treebank. Computational Linguistics, 19, 313-330.

[McCallum et al. 2000] McCallum, A., Freitag, D., and Pereira, F. (2000) Maximum entropy markov

models for information extraction and segmentation. In Proceedings of ICML 2000.

[Miller et. al 2000] Miller, S., Fox, H., Ramshaw, L., and Weischedel, R. 2000. A Novel Use of Statistical Parsing to Extract Information from Text. In Proceedings of ANLP 2000.

[Ramshaw and Marcus 95] Ramshaw, L., and Marcus, M. P. (1995). Text Chunking UsingTransformation-Based Learning. In Proceedings of the Third ACL Workshop on Very Large

Corpora, Association for Computational Linguistics, 1995.

[Ratnaparkhi 96] A maximum entropy part-of-speech tagger. In Proceedings of the empirical

methods in natural language processing conference.

Date post:	07-Apr-2018
Category:	Documents
Upload:	priscian
View:	221 times
Download:	0 times

Collins - Statistical Methods in Natural Language Processing (Slides)

Documents