+ All Categories
Home > Documents > Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical...

Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical...

Date post: 24-Jul-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
138
Statistical Natural Language Parsing Mausam (Based on slides of Michael Collins, Dan Jurafsky, Dan Klein, Chris Manning, Ray Mooney, Luke Zettlemoyer)
Transcript
Page 1: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Statistical Natural Language Parsing

Mausam

(Based on slides of Michael Collins Dan Jurafsky Dan Klein Chris Manning Ray Mooney Luke Zettlemoyer)

Two views of linguistic structure 1 Constituency (phrase structure)

bull Phrase structure organizes words into nested constituents

bull How do we know what is a constituent (Not that linguists donrsquot argue about some cases)bull Distribution a constituent behaves as a unit that can appear in different

places

bull John talked [to the children] [about drugs]

bull John talked [about drugs] [to the children]

bull John talked drugs to the children about

bull Substitutionexpansionpro-forms

bull I sat [on the boxright on top of the boxthere]

bull Coordination regular internal structure no intrusion fragments semantics hellip

Two views of linguistic structure 2 Dependency structure

bull Dependency structure shows which words depend on (modify or are arguments of) which other words

The boy put the tortoise on the rug

rug

the

the

ontortoise

put

boy

The

Why Parse

bull Part of speech information

bull Phrase information

bull Useful relationships

8

The rise of annotated data

The Penn Treebank

( (S(NP-SBJ (DT The) (NN move))(VP (VBD followed)

(NP(NP (DT a) (NN round))(PP (IN of)(NP(NP (JJ similar) (NNS increases))(PP (IN by)

(NP (JJ other) (NNS lenders)))(PP (IN against)

(NP (NNP Arizona) (JJ real) (NN estate) (NNS loans))))))( )(S-ADV

(NP-SBJ (-NONE- ))(VP (VBG reflecting)(NP(NP (DT a) (VBG continuing) (NN decline))(PP-LOC (IN in)

(NP (DT that) (NN market)))))))( )))

[Marcus et al 1993 Computational Linguistics]

Penn Treebank Non-terminals

The rise of annotated data

bull Starting off building a treebank seems a lot slower and less useful than building a grammar

bull But a treebank gives us many thingsbull Reusability of the labor

bull Many parsers POS taggers etc

bull Valuable resource for linguistics

bull Broad coverage

bull Frequencies and distributional information

bull A way to evaluate systems

Statistical parsing applications

Statistical parsers are now robust and widely used in larger NLP applications

bull High precision question answering [Pasca and Harabagiu SIGIR 2001]

bull Improving biological named entity finding [Finkel et al JNLPBA 2004]

bull Syntactically based sentence compression [Lin and Wilbur 2007]

bull Extracting opinions about products [Bloom et al NAACL 2007]

bull Improved interaction in computer games [Gorniak and Roy 2005]

bull Helping linguists find data [Resnik et al BLS 2005]

bull Source sentence analysis for machine translation [Xu et al 2009]

bull Relation extraction systems [Fundel et al Bioinformatics 2006]

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPIN NP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPINNP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNthe

boy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNलड़क न रखाकछआकालीन

ऊपर

Pre 1990 (ldquoClassicalrdquo) NLP Parsing

bull Goes back to Chomskyrsquos PhD thesis in 1950s

bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest

NP (DT) NN NNS rates

NP NN NNS NNS raises

NP NNP VBP interest

VP V NP VBZ rates

bull Used grammarproof systems to prove parses from words

bull This scaled very badly and didnrsquot give coverage For sentence

Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses

bull Simple 10 rule grammar 592 parses

bull Real-size broad-coverage grammar millions of parses

Classical NLP ParsingThe problem and its solution

bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust

bull In traditional systems commonly 30 of sentences in even an edited text would have no parse

bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to

choose between them

bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit

millions of parses for sentences but still quickly find the best parse(s)

Context Free Grammars and Ambiguities

20

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 2: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Two views of linguistic structure 1 Constituency (phrase structure)

bull Phrase structure organizes words into nested constituents

bull How do we know what is a constituent (Not that linguists donrsquot argue about some cases)bull Distribution a constituent behaves as a unit that can appear in different

places

bull John talked [to the children] [about drugs]

bull John talked [about drugs] [to the children]

bull John talked drugs to the children about

bull Substitutionexpansionpro-forms

bull I sat [on the boxright on top of the boxthere]

bull Coordination regular internal structure no intrusion fragments semantics hellip

Two views of linguistic structure 2 Dependency structure

bull Dependency structure shows which words depend on (modify or are arguments of) which other words

The boy put the tortoise on the rug

rug

the

the

ontortoise

put

boy

The

Why Parse

bull Part of speech information

bull Phrase information

bull Useful relationships

8

The rise of annotated data

The Penn Treebank

( (S(NP-SBJ (DT The) (NN move))(VP (VBD followed)

(NP(NP (DT a) (NN round))(PP (IN of)(NP(NP (JJ similar) (NNS increases))(PP (IN by)

(NP (JJ other) (NNS lenders)))(PP (IN against)

(NP (NNP Arizona) (JJ real) (NN estate) (NNS loans))))))( )(S-ADV

(NP-SBJ (-NONE- ))(VP (VBG reflecting)(NP(NP (DT a) (VBG continuing) (NN decline))(PP-LOC (IN in)

(NP (DT that) (NN market)))))))( )))

[Marcus et al 1993 Computational Linguistics]

Penn Treebank Non-terminals

The rise of annotated data

bull Starting off building a treebank seems a lot slower and less useful than building a grammar

bull But a treebank gives us many thingsbull Reusability of the labor

bull Many parsers POS taggers etc

bull Valuable resource for linguistics

bull Broad coverage

bull Frequencies and distributional information

bull A way to evaluate systems

Statistical parsing applications

Statistical parsers are now robust and widely used in larger NLP applications

bull High precision question answering [Pasca and Harabagiu SIGIR 2001]

bull Improving biological named entity finding [Finkel et al JNLPBA 2004]

bull Syntactically based sentence compression [Lin and Wilbur 2007]

bull Extracting opinions about products [Bloom et al NAACL 2007]

bull Improved interaction in computer games [Gorniak and Roy 2005]

bull Helping linguists find data [Resnik et al BLS 2005]

bull Source sentence analysis for machine translation [Xu et al 2009]

bull Relation extraction systems [Fundel et al Bioinformatics 2006]

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPIN NP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPINNP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNthe

boy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNलड़क न रखाकछआकालीन

ऊपर

Pre 1990 (ldquoClassicalrdquo) NLP Parsing

bull Goes back to Chomskyrsquos PhD thesis in 1950s

bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest

NP (DT) NN NNS rates

NP NN NNS NNS raises

NP NNP VBP interest

VP V NP VBZ rates

bull Used grammarproof systems to prove parses from words

bull This scaled very badly and didnrsquot give coverage For sentence

Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses

bull Simple 10 rule grammar 592 parses

bull Real-size broad-coverage grammar millions of parses

Classical NLP ParsingThe problem and its solution

bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust

bull In traditional systems commonly 30 of sentences in even an edited text would have no parse

bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to

choose between them

bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit

millions of parses for sentences but still quickly find the best parse(s)

Context Free Grammars and Ambiguities

20

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 3: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Two views of linguistic structure 2 Dependency structure

bull Dependency structure shows which words depend on (modify or are arguments of) which other words

The boy put the tortoise on the rug

rug

the

the

ontortoise

put

boy

The

Why Parse

bull Part of speech information

bull Phrase information

bull Useful relationships

8

The rise of annotated data

The Penn Treebank

( (S(NP-SBJ (DT The) (NN move))(VP (VBD followed)

(NP(NP (DT a) (NN round))(PP (IN of)(NP(NP (JJ similar) (NNS increases))(PP (IN by)

(NP (JJ other) (NNS lenders)))(PP (IN against)

(NP (NNP Arizona) (JJ real) (NN estate) (NNS loans))))))( )(S-ADV

(NP-SBJ (-NONE- ))(VP (VBG reflecting)(NP(NP (DT a) (VBG continuing) (NN decline))(PP-LOC (IN in)

(NP (DT that) (NN market)))))))( )))

[Marcus et al 1993 Computational Linguistics]

Penn Treebank Non-terminals

The rise of annotated data

bull Starting off building a treebank seems a lot slower and less useful than building a grammar

bull But a treebank gives us many thingsbull Reusability of the labor

bull Many parsers POS taggers etc

bull Valuable resource for linguistics

bull Broad coverage

bull Frequencies and distributional information

bull A way to evaluate systems

Statistical parsing applications

Statistical parsers are now robust and widely used in larger NLP applications

bull High precision question answering [Pasca and Harabagiu SIGIR 2001]

bull Improving biological named entity finding [Finkel et al JNLPBA 2004]

bull Syntactically based sentence compression [Lin and Wilbur 2007]

bull Extracting opinions about products [Bloom et al NAACL 2007]

bull Improved interaction in computer games [Gorniak and Roy 2005]

bull Helping linguists find data [Resnik et al BLS 2005]

bull Source sentence analysis for machine translation [Xu et al 2009]

bull Relation extraction systems [Fundel et al Bioinformatics 2006]

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPIN NP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPINNP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNthe

boy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNलड़क न रखाकछआकालीन

ऊपर

Pre 1990 (ldquoClassicalrdquo) NLP Parsing

bull Goes back to Chomskyrsquos PhD thesis in 1950s

bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest

NP (DT) NN NNS rates

NP NN NNS NNS raises

NP NNP VBP interest

VP V NP VBZ rates

bull Used grammarproof systems to prove parses from words

bull This scaled very badly and didnrsquot give coverage For sentence

Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses

bull Simple 10 rule grammar 592 parses

bull Real-size broad-coverage grammar millions of parses

Classical NLP ParsingThe problem and its solution

bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust

bull In traditional systems commonly 30 of sentences in even an edited text would have no parse

bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to

choose between them

bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit

millions of parses for sentences but still quickly find the best parse(s)

Context Free Grammars and Ambiguities

20

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 4: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Why Parse

bull Part of speech information

bull Phrase information

bull Useful relationships

8

The rise of annotated data

The Penn Treebank

( (S(NP-SBJ (DT The) (NN move))(VP (VBD followed)

(NP(NP (DT a) (NN round))(PP (IN of)(NP(NP (JJ similar) (NNS increases))(PP (IN by)

(NP (JJ other) (NNS lenders)))(PP (IN against)

(NP (NNP Arizona) (JJ real) (NN estate) (NNS loans))))))( )(S-ADV

(NP-SBJ (-NONE- ))(VP (VBG reflecting)(NP(NP (DT a) (VBG continuing) (NN decline))(PP-LOC (IN in)

(NP (DT that) (NN market)))))))( )))

[Marcus et al 1993 Computational Linguistics]

Penn Treebank Non-terminals

The rise of annotated data

bull Starting off building a treebank seems a lot slower and less useful than building a grammar

bull But a treebank gives us many thingsbull Reusability of the labor

bull Many parsers POS taggers etc

bull Valuable resource for linguistics

bull Broad coverage

bull Frequencies and distributional information

bull A way to evaluate systems

Statistical parsing applications

Statistical parsers are now robust and widely used in larger NLP applications

bull High precision question answering [Pasca and Harabagiu SIGIR 2001]

bull Improving biological named entity finding [Finkel et al JNLPBA 2004]

bull Syntactically based sentence compression [Lin and Wilbur 2007]

bull Extracting opinions about products [Bloom et al NAACL 2007]

bull Improved interaction in computer games [Gorniak and Roy 2005]

bull Helping linguists find data [Resnik et al BLS 2005]

bull Source sentence analysis for machine translation [Xu et al 2009]

bull Relation extraction systems [Fundel et al Bioinformatics 2006]

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPIN NP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPINNP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNthe

boy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNलड़क न रखाकछआकालीन

ऊपर

Pre 1990 (ldquoClassicalrdquo) NLP Parsing

bull Goes back to Chomskyrsquos PhD thesis in 1950s

bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest

NP (DT) NN NNS rates

NP NN NNS NNS raises

NP NNP VBP interest

VP V NP VBZ rates

bull Used grammarproof systems to prove parses from words

bull This scaled very badly and didnrsquot give coverage For sentence

Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses

bull Simple 10 rule grammar 592 parses

bull Real-size broad-coverage grammar millions of parses

Classical NLP ParsingThe problem and its solution

bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust

bull In traditional systems commonly 30 of sentences in even an edited text would have no parse

bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to

choose between them

bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit

millions of parses for sentences but still quickly find the best parse(s)

Context Free Grammars and Ambiguities

20

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 5: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

The rise of annotated data

The Penn Treebank

( (S(NP-SBJ (DT The) (NN move))(VP (VBD followed)

(NP(NP (DT a) (NN round))(PP (IN of)(NP(NP (JJ similar) (NNS increases))(PP (IN by)

(NP (JJ other) (NNS lenders)))(PP (IN against)

(NP (NNP Arizona) (JJ real) (NN estate) (NNS loans))))))( )(S-ADV

(NP-SBJ (-NONE- ))(VP (VBG reflecting)(NP(NP (DT a) (VBG continuing) (NN decline))(PP-LOC (IN in)

(NP (DT that) (NN market)))))))( )))

[Marcus et al 1993 Computational Linguistics]

Penn Treebank Non-terminals

The rise of annotated data

bull Starting off building a treebank seems a lot slower and less useful than building a grammar

bull But a treebank gives us many thingsbull Reusability of the labor

bull Many parsers POS taggers etc

bull Valuable resource for linguistics

bull Broad coverage

bull Frequencies and distributional information

bull A way to evaluate systems

Statistical parsing applications

Statistical parsers are now robust and widely used in larger NLP applications

bull High precision question answering [Pasca and Harabagiu SIGIR 2001]

bull Improving biological named entity finding [Finkel et al JNLPBA 2004]

bull Syntactically based sentence compression [Lin and Wilbur 2007]

bull Extracting opinions about products [Bloom et al NAACL 2007]

bull Improved interaction in computer games [Gorniak and Roy 2005]

bull Helping linguists find data [Resnik et al BLS 2005]

bull Source sentence analysis for machine translation [Xu et al 2009]

bull Relation extraction systems [Fundel et al Bioinformatics 2006]

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPIN NP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPINNP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNthe

boy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNलड़क न रखाकछआकालीन

ऊपर

Pre 1990 (ldquoClassicalrdquo) NLP Parsing

bull Goes back to Chomskyrsquos PhD thesis in 1950s

bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest

NP (DT) NN NNS rates

NP NN NNS NNS raises

NP NNP VBP interest

VP V NP VBZ rates

bull Used grammarproof systems to prove parses from words

bull This scaled very badly and didnrsquot give coverage For sentence

Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses

bull Simple 10 rule grammar 592 parses

bull Real-size broad-coverage grammar millions of parses

Classical NLP ParsingThe problem and its solution

bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust

bull In traditional systems commonly 30 of sentences in even an edited text would have no parse

bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to

choose between them

bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit

millions of parses for sentences but still quickly find the best parse(s)

Context Free Grammars and Ambiguities

20

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 6: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Penn Treebank Non-terminals

The rise of annotated data

bull Starting off building a treebank seems a lot slower and less useful than building a grammar

bull But a treebank gives us many thingsbull Reusability of the labor

bull Many parsers POS taggers etc

bull Valuable resource for linguistics

bull Broad coverage

bull Frequencies and distributional information

bull A way to evaluate systems

Statistical parsing applications

Statistical parsers are now robust and widely used in larger NLP applications

bull High precision question answering [Pasca and Harabagiu SIGIR 2001]

bull Improving biological named entity finding [Finkel et al JNLPBA 2004]

bull Syntactically based sentence compression [Lin and Wilbur 2007]

bull Extracting opinions about products [Bloom et al NAACL 2007]

bull Improved interaction in computer games [Gorniak and Roy 2005]

bull Helping linguists find data [Resnik et al BLS 2005]

bull Source sentence analysis for machine translation [Xu et al 2009]

bull Relation extraction systems [Fundel et al Bioinformatics 2006]

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPIN NP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPINNP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNthe

boy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNलड़क न रखाकछआकालीन

ऊपर

Pre 1990 (ldquoClassicalrdquo) NLP Parsing

bull Goes back to Chomskyrsquos PhD thesis in 1950s

bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest

NP (DT) NN NNS rates

NP NN NNS NNS raises

NP NNP VBP interest

VP V NP VBZ rates

bull Used grammarproof systems to prove parses from words

bull This scaled very badly and didnrsquot give coverage For sentence

Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses

bull Simple 10 rule grammar 592 parses

bull Real-size broad-coverage grammar millions of parses

Classical NLP ParsingThe problem and its solution

bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust

bull In traditional systems commonly 30 of sentences in even an edited text would have no parse

bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to

choose between them

bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit

millions of parses for sentences but still quickly find the best parse(s)

Context Free Grammars and Ambiguities

20

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 7: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

The rise of annotated data

bull Starting off building a treebank seems a lot slower and less useful than building a grammar

bull But a treebank gives us many thingsbull Reusability of the labor

bull Many parsers POS taggers etc

bull Valuable resource for linguistics

bull Broad coverage

bull Frequencies and distributional information

bull A way to evaluate systems

Statistical parsing applications

Statistical parsers are now robust and widely used in larger NLP applications

bull High precision question answering [Pasca and Harabagiu SIGIR 2001]

bull Improving biological named entity finding [Finkel et al JNLPBA 2004]

bull Syntactically based sentence compression [Lin and Wilbur 2007]

bull Extracting opinions about products [Bloom et al NAACL 2007]

bull Improved interaction in computer games [Gorniak and Roy 2005]

bull Helping linguists find data [Resnik et al BLS 2005]

bull Source sentence analysis for machine translation [Xu et al 2009]

bull Relation extraction systems [Fundel et al Bioinformatics 2006]

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPIN NP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPINNP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNthe

boy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNलड़क न रखाकछआकालीन

ऊपर

Pre 1990 (ldquoClassicalrdquo) NLP Parsing

bull Goes back to Chomskyrsquos PhD thesis in 1950s

bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest

NP (DT) NN NNS rates

NP NN NNS NNS raises

NP NNP VBP interest

VP V NP VBZ rates

bull Used grammarproof systems to prove parses from words

bull This scaled very badly and didnrsquot give coverage For sentence

Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses

bull Simple 10 rule grammar 592 parses

bull Real-size broad-coverage grammar millions of parses

Classical NLP ParsingThe problem and its solution

bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust

bull In traditional systems commonly 30 of sentences in even an edited text would have no parse

bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to

choose between them

bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit

millions of parses for sentences but still quickly find the best parse(s)

Context Free Grammars and Ambiguities

20

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 8: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Statistical parsing applications

Statistical parsers are now robust and widely used in larger NLP applications

bull High precision question answering [Pasca and Harabagiu SIGIR 2001]

bull Improving biological named entity finding [Finkel et al JNLPBA 2004]

bull Syntactically based sentence compression [Lin and Wilbur 2007]

bull Extracting opinions about products [Bloom et al NAACL 2007]

bull Improved interaction in computer games [Gorniak and Roy 2005]

bull Helping linguists find data [Resnik et al BLS 2005]

bull Source sentence analysis for machine translation [Xu et al 2009]

bull Relation extraction systems [Fundel et al Bioinformatics 2006]

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPIN NP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPINNP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNthe

boy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNलड़क न रखाकछआकालीन

ऊपर

Pre 1990 (ldquoClassicalrdquo) NLP Parsing

bull Goes back to Chomskyrsquos PhD thesis in 1950s

bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest

NP (DT) NN NNS rates

NP NN NNS NNS raises

NP NNP VBP interest

VP V NP VBZ rates

bull Used grammarproof systems to prove parses from words

bull This scaled very badly and didnrsquot give coverage For sentence

Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses

bull Simple 10 rule grammar 592 parses

bull Real-size broad-coverage grammar millions of parses

Classical NLP ParsingThe problem and its solution

bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust

bull In traditional systems commonly 30 of sentences in even an edited text would have no parse

bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to

choose between them

bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit

millions of parses for sentences but still quickly find the best parse(s)

Context Free Grammars and Ambiguities

20

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 9: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPIN NP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPINNP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNthe

boy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNलड़क न रखाकछआकालीन

ऊपर

Pre 1990 (ldquoClassicalrdquo) NLP Parsing

bull Goes back to Chomskyrsquos PhD thesis in 1950s

bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest

NP (DT) NN NNS rates

NP NN NNS NNS raises

NP NNP VBP interest

VP V NP VBZ rates

bull Used grammarproof systems to prove parses from words

bull This scaled very badly and didnrsquot give coverage For sentence

Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses

bull Simple 10 rule grammar 592 parses

bull Real-size broad-coverage grammar millions of parses

Classical NLP ParsingThe problem and its solution

bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust

bull In traditional systems commonly 30 of sentences in even an edited text would have no parse

bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to

choose between them

bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit

millions of parses for sentences but still quickly find the best parse(s)

Context Free Grammars and Ambiguities

20

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 10: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPIN NP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPINNP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNthe

boy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNलड़क न रखाकछआकालीन

ऊपर

Pre 1990 (ldquoClassicalrdquo) NLP Parsing

bull Goes back to Chomskyrsquos PhD thesis in 1950s

bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest

NP (DT) NN NNS rates

NP NN NNS NNS raises

NP NNP VBP interest

VP V NP VBZ rates

bull Used grammarproof systems to prove parses from words

bull This scaled very badly and didnrsquot give coverage For sentence

Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses

bull Simple 10 rule grammar 592 parses

bull Real-size broad-coverage grammar millions of parses

Classical NLP ParsingThe problem and its solution

bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust

bull In traditional systems commonly 30 of sentences in even an edited text would have no parse

bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to

choose between them

bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit

millions of parses for sentences but still quickly find the best parse(s)

Context Free Grammars and Ambiguities

20

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 11: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN V NPINNP

DT NNDT NNthe

boyput

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNthe

boy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNलड़क न रखाकछआकालीन

ऊपर

Pre 1990 (ldquoClassicalrdquo) NLP Parsing

bull Goes back to Chomskyrsquos PhD thesis in 1950s

bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest

NP (DT) NN NNS rates

NP NN NNS NNS raises

NP NNP VBP interest

VP V NP VBZ rates

bull Used grammarproof systems to prove parses from words

bull This scaled very badly and didnrsquot give coverage For sentence

Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses

bull Simple 10 rule grammar 592 parses

bull Real-size broad-coverage grammar millions of parses

Classical NLP ParsingThe problem and its solution

bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust

bull In traditional systems commonly 30 of sentences in even an edited text would have no parse

bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to

choose between them

bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit

millions of parses for sentences but still quickly find the best parse(s)

Context Free Grammars and Ambiguities

20

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 12: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNthe

boy

put

tortoisethethe rug

on

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNलड़क न रखाकछआकालीन

ऊपर

Pre 1990 (ldquoClassicalrdquo) NLP Parsing

bull Goes back to Chomskyrsquos PhD thesis in 1950s

bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest

NP (DT) NN NNS rates

NP NN NNS NNS raises

NP NNP VBP interest

VP V NP VBZ rates

bull Used grammarproof systems to prove parses from words

bull This scaled very badly and didnrsquot give coverage For sentence

Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses

bull Simple 10 rule grammar 592 parses

bull Real-size broad-coverage grammar millions of parses

Classical NLP ParsingThe problem and its solution

bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust

bull In traditional systems commonly 30 of sentences in even an edited text would have no parse

bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to

choose between them

bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit

millions of parses for sentences but still quickly find the best parse(s)

Context Free Grammars and Ambiguities

20

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 13: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Example Application Machine Translation

bull The boy put the tortoise on the rug

bull लड़क न रखा कछआ ऊपर कालीनbull SVO vs SOV preposition vs post-position

S

NP VP PP

DT NN V NP IN NP

DT NN DT NNtheboy

put

tortoisethethe rug

on

S

NP VPPP

DT NN VNPINNP

DT NNDT NNलड़क न रखाकछआकालीन

ऊपर

Pre 1990 (ldquoClassicalrdquo) NLP Parsing

bull Goes back to Chomskyrsquos PhD thesis in 1950s

bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest

NP (DT) NN NNS rates

NP NN NNS NNS raises

NP NNP VBP interest

VP V NP VBZ rates

bull Used grammarproof systems to prove parses from words

bull This scaled very badly and didnrsquot give coverage For sentence

Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses

bull Simple 10 rule grammar 592 parses

bull Real-size broad-coverage grammar millions of parses

Classical NLP ParsingThe problem and its solution

bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust

bull In traditional systems commonly 30 of sentences in even an edited text would have no parse

bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to

choose between them

bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit

millions of parses for sentences but still quickly find the best parse(s)

Context Free Grammars and Ambiguities

20

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 14: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Pre 1990 (ldquoClassicalrdquo) NLP Parsing

bull Goes back to Chomskyrsquos PhD thesis in 1950s

bull Wrote symbolic grammar (CFG or often richer) and lexiconS NP VP NN interest

NP (DT) NN NNS rates

NP NN NNS NNS raises

NP NNP VBP interest

VP V NP VBZ rates

bull Used grammarproof systems to prove parses from words

bull This scaled very badly and didnrsquot give coverage For sentence

Fed raises interest rates 05 in effort to control inflationbull Minimal grammar 36 parses

bull Simple 10 rule grammar 592 parses

bull Real-size broad-coverage grammar millions of parses

Classical NLP ParsingThe problem and its solution

bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust

bull In traditional systems commonly 30 of sentences in even an edited text would have no parse

bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to

choose between them

bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit

millions of parses for sentences but still quickly find the best parse(s)

Context Free Grammars and Ambiguities

20

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 15: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Classical NLP ParsingThe problem and its solution

bull Categorical constraints can be added to grammars to limit unlikelyweird parses for sentencesbull But the attempt make the grammars not robust

bull In traditional systems commonly 30 of sentences in even an edited text would have no parse

bull A less constrained grammar can parse more sentencesbull But simple sentences end up with ever more parses with no way to

choose between them

bull We need mechanisms that allow us to find the most likely parse(s) for a sentencebull Statistical parsing lets us work with very loose grammars that admit

millions of parses for sentences but still quickly find the best parse(s)

Context Free Grammars and Ambiguities

20

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 16: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Context Free Grammars and Ambiguities

20

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 17: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Context-Free Grammars

21

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 18: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Context-Free Grammars in NLP

bull A context free grammar G in NLP = (N C Σ S L R)bull Σ is a set of terminal symbols

bull C is a set of preterminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull L is the lexicon a set of items of the form X x

bull X isin C and x isin Σ

bull R is the grammar a set of items of the form X

bull X isin N and isin (N cup C)

bull By usual convention S is the start symbol but in statistical NLP we usually have an extra node at the top (ROOT TOP)

bull We usually write e for an empty sequence rather than nothing22

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 19: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

A Context Free Grammar of English

23

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 20: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Left-Most Derivations

24

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 21: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Properties of CFGs

25

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 22: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 23: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Attachment ambiguities

bull A key parsing decision is how we lsquoattachrsquo various constituentsbull PPs adverbial or participial phrases infinitives coordinations etc

bull Catalan numbers Cn

= (2n)[(n+1)n]

bull An exponentially growing series which arises in many tree-like contexts

bull Eg the number of possible triangulations of a polygon with n+2 sides

bull Turns up in triangulation of probabilistic graphical modelshellip

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 24: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Attachments

bull I cleaned the dishes from dinner

bull I cleaned the dishes with detergent

bull I cleaned the dishes in my pajamas

bull I cleaned the dishes in the sink

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 25: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Syntactic Ambiguities I

bull Prepositional phrasesThey cooked the beans in the pot on the stove with handles

bull Particle vs prepositionThe lady dressed up the staircase

bull Complement structuresThe tourists objected to the guide that they couldnrsquot hearShe knows you like the back of her hand

bull Gerund vs participial adjectiveVisiting relatives can be boringChanging schedules frequently confused passengers

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 26: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Syntactic Ambiguities II

bull Modifier scope within NPsimpractical design requirementsplastic cup holder

bull Multiple gap constructionsThe chicken is ready to eatThe contractors are rich enough to sue

bull Coordination scopeSmall rats and mice can squeeze into holes or cracks in the wall

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 27: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Non-Local Phenomena

bull Dislocation gappingbull Which book should Peter buy

bull A debate arose which continued until the election

bull Bindingbull Reference

bull The IRS audits itself

bull Controlbull I want to go

bull I want you to go

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 28: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

32

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 29: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

A Fragment of a Noun Phrase Grammar

33

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 30: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Extended Grammar with Prepositional Phrases

34

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 31: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Verbs Verb Phrases and Sentences

35

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 32: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

PPs Modifying Verb Phrases

36

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 33: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Complementizers and SBARs

37

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 34: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

More Verbs

38

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 35: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Coordination

39

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 36: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Much more remainshellip

40

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 37: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 38: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Parsing Two problems to solve1 Repeated workhellip

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 39: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Parsing Two problems to solve2 Choosing the correct parse

bull How do we work out the correct attachment

bull She saw the man with a telescope

bull Is the problem lsquoAI completersquo Yes but hellip

bull Words are good predictors of attachmentbull Even absent full understanding

bull Moscow sent more than 100000 soldiers into Afghanistan hellip

bull Sydney Water breached an agreement with NSW Health hellip

bull Our statistical parsers will try to exploit such statistics

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 40: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Probabilistic Context Free Grammar

45

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 41: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Probabilistic ndash or stochastic ndash context-free grammars (PCFGs)

bull G = (Σ N S R P)bull T is a set of terminal symbols

bull N is a set of nonterminal symbols

bull S is the start symbol (S isin N)

bull R is a set of rulesproductions of the form X

bull P is a probability function

bull P R [01]

bull

bull A grammar G generates a language model L

P(g) =1g IcircT

aring

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 42: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

PCFG ExampleA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 43: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Example of a PCFG

48

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 44: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Probability of a ParseA Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

A Probabilistic Context-Free Grammar (PCFG)

S rArr NP VP 10

VP rArr Vi 04

VP rArr Vt NP 04

VP rArr VP PP 02

NP rArr DT NN 03

NP rArr NP PP 07

PP rArr P NP 10

Vi rArr sleeps 10

Vt rArr saw 10

NN rArr man 07

NN rArr woman 02

NN rArr telescope 01

DT rArr the 10

IN rArr with 05

IN rArr in 05

bull Probability of a tree t with rules

α1 rarr β1α2 rarr β2 αn rarr βn

is

p(t) =n

i = 1

q(α i rarr βi )

where q(α rarr β) is the probability for rule α rarr β

44

The man sleeps

The man saw the woman with the telescope

NNDT Vi

VPNP

NNDT

NP

NNDT

NP

NNDT

NPVt

VP

IN

PP

VP

S

S

t1=

p(t1)=100310070410

10

0403

10 07 10

t2=

p(ts)=180310070204100310020405031001

10

03 03 03

02

04 04

0510

10 10 1007 02 01

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 45: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

PCFGs Learning and Inference

Model The probability of a tree t with n rules αi βi i = 1n

Learning Read the rules off of labeled sentences use ML estimates for

probabilities

and use all of our standard smoothing tricks

Inference For input sentence s define T(s) to be the set of trees whose yield is s

(whole leaves read left to right match the words in s)

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 46: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Grammar Transforms

51

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 47: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Chomsky Normal Form

bull All rules are of the form X Y Z or X wbull X Y Z isin N and w isin Σ

bull A transformation to this form doesnrsquot change the weak generative capacity of a CFGbull That is it recognizes the same language

bull But maybe with different trees

bull Empties and unaries are removed recursively

bull n-ary rules are divided by introducing new nonterminals (n gt 2)

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 48: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 49: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Chomsky Normal Form steps

S NP VP

S VP

VP V NP

VP V

VP V NP PP

VP V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 50: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

S V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 51: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

V fish

S fish

V tanks

S tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 52: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP

NP NP PP

NP PP

NP N

PP P NP

PP P

N people

N fish

N tanks

N rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 53: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V NP PP

S V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 54: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 55: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

A phrase structure grammar

S NP VP

VP V NP

VP V NP PP

NP NP NP

NP NP PP

NP N

NP e

PP P NP

N people

N fish

N tanks

N rods

V people

V fish

V tanks

P with

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 56: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Chomsky Normal Form steps

S NP VP

VP V NP

S V NP

VP V VP_V

VP_V NP PP

S V S_V

S_V NP PP

VP V PP

S V PP

NP NP NP

NP NP PP

NP P NP

PP P NP

NP people

NP fish

NP tanks

NP rods

V people

S people

VP people

V fish

S fish

VP fish

V tanks

S tanks

VP tanks

P with

PP with

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 57: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Chomsky Normal Form

bull You should think of this as a transformation for efficient parsing

bull With some extra book-keeping in symbol names you can even reconstruct the same trees with a detransform

bull In practice full Chomsky Normal Form is a painbull Reconstructing n-aries is easy

bull Reconstructing unariesempties is trickier

bull Binarization is crucial for cubic time CFG parsing

bull The rest isnrsquot necessary it just makes the algorithms cleaner and a bit quicker

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 58: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

NN

An example before binarizationhellip

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 59: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

VP_V

ROOT

S

After binarizationhellip

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 60: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Treebank empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 61: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Parsing

66

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 62: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Constituency Parsing

fish people fish tanks

Rule Prob θi

S NP VP θ0

NP NP NP θ1

hellip

N fish θ42

N people θ43

V fish θ44

hellip

PCFG

N N V N

VP

NPNP

S

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 63: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Cocke-Kasami-Younger (CKY) Constituency Parsing

fish people fish tanks

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 64: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Viterbi (Max) Scores

people fish

NP 035V 01N 05

VP 006NP 014V 06N 02

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 65: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Extended CKY parsing

bull Unaries can be incorporated into the algorithmbull Messy but doesnrsquot increase algorithmic complexity

bull Empties can be incorporatedbull Use fenceposts

bull Doesnrsquot increase complexity essentially like unaries

bull Binarization is vitalbull Without binarization you donrsquot get parsing cubic in the length of the

sentence and in the number of nonterminals in the grammar

bull Binarization may be an explicit transformation or implicit in how the parser works (Early-style dotted rules) but itrsquos always there

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 66: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

A Recursive Parser

bestScore(Xijs)

if (j == i)

return q(X-gts[i])

else

return max q(X-gtYZ)

bestScore(Yiks)

bestScore(Zk+1js)

kX-gtYZ

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 67: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

function CKY(words grammar) returns [most_probable_parseprob]

score = new double[(words)+1][(words)+1][(nonterms)]

back = new Pair[(words)+1][(words)+1][nonterms]]

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if prob gt score[i][i+1][A]

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

The CKY algorithm (19601965)hellip extended to unaries

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 68: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

for span = 2 to (words)

for begin = 0 to (words)- span

end = begin + span

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

return buildTree(score back)

The CKY algorithm (19601965)hellip extended to unaries

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 69: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

The grammarBinary no epsilons

S NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks 02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 70: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 71: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for i=0 ilt(words) i++

for A in nonterms

if A -gt words[i] in grammar

score[i][i+1][A] = P(A -gt words[i])

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 72: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

N fish 02V fish 06

N people 05V people 01

N fish 02V fish 06

N tanks 02V tanks 01

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

if score[i][i+1][B] gt 0 ampamp A-gtB in grammar

prob = P(A-gtB)score[i][i+1][B]

if(prob gt score[i][i+1][A])

score[i][i+1][A] = prob

back[i][i+1][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 73: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if (prob gt score[begin][end][A])

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 74: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S NP VP000126

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S NP VP000378

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

handle unaries

boolean added = true

while added

added = false

for A B in nonterms

prob = P(A-gtB)score[begin][end][B]

if prob gt score[begin][end][A]

score[begin][end][A] = prob

back[begin][end][A] = B

added = true

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 75: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 76: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 77: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

for split = begin+1 to end-1

for ABC in nonterms

prob=score[begin][split][B]score[split][end][C]P(A-gtBC)

if prob gt score[begin][end][A]

score[begin]end][A] = prob

back[begin][end][A] = new Triple(splitBC)

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 78: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

N fish 02V fish 06NP N 014VP V 006S VP 0006

N people 05V people 01NP N 035VP V 001S VP 0001

N fish 02V fish 06NP N 014VP V 006S VP 0006

N tanks 02V tanks 01NP N 014VP V 003S VP 0003

NP NP NP00049

VP V NP0105

S VP00105

NP NP NP00049

VP V NP0007

S NP VP00189

NP NP NP000196

VP V NP0042

S VP00042

NP NP NP00000686

VP V NP000147

S NP VP0000882

NP NP NP00000686

VP V NP0000098

S NP VP001323

NP NP NP

00000009604VP V NP

000002058S NP VP

000018522

0

1

2

3

4

1 2 3 4fish people fish tanksS NP VP 09

S VP 01

VP V NP 05

VP V 01

VP V VP_V 03

VP V PP 01

VP_V NP PP 10

NP NP NP 01

NP NP PP 02

NP N 07

PP P NP 10

N people 05

N fish 02

N tanks02

N rods 01

V people 01

V fish 06

V tanks 03

P with 10

Call buildTree(score back) to get the best parse

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 79: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Evaluating constituency parsing

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 80: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Evaluating constituency parsing

Gold standard brackets S-(011) NP-(02) VP-(29) VP-(39) NP-(46) PP-(6-9) NP-(79) NP-(910)

Candidate brackets S-(011) NP-(02) VP-(210) VP-(310) NP-(46) PP-(6-10) NP-(710)

Labeled Precision 37 = 429

Labeled Recall 38 = 375

LPLR F1 400

Tagging Accuracy 1111 = 1000

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 81: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

How good are PCFGs

bull Penn WSJ parsing accuracy about 73 LPLR F1

bull Robust

bull Usually admit everything but with low probability

bull Partial solution for grammar ambiguity

bull A PCFG gives some idea of the plausibility of a parse

bull But not so good because the independence assumptions are

too strong

bull Give a probabilistic language model

bull But in the simple case it performs worse than a trigram model

bull The problem seems to be that PCFGs lack the

lexicalization of a trigram model

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 82: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Weaknesses of PCFGs

87

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 83: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Weaknesses

bull Lack of sensitivity to structural frequencies

bull Lack of sensitivity to lexical information

bull (A word is independent of the rest of the tree given its POS)

88

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 84: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

A Case of PP Attachment Ambiguity

89

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 85: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

90

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 86: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

A Case of Coordination Ambiguity

91

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 87: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

92

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 88: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Structural Preferences Close Attachment

93

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 89: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Structural Preferences Close Attachment

bull Example John was believed to have been shot by Bill

bull Low attachment analysis (Bill does the shooting) contains same rules as high attachment analysis (Bill does the believing)bull Two analyses receive the same probability

94

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 90: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

PCFGs and Independence

bull The symbols in a PCFG define independence assumptions

bull At any node the material inside that node is independent of the material outside that node given the label of that node

bull Any information that statistically connects behavior inside and outside a node must flow through that nodersquos label

NP

S

VP

S NP VP

NP DT NN

NP

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 91: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Non-Independence I

bull The independence assumptions of a PCFG are often too strong

bull Example the expansion of an NP is highly dependent on the parent of the NP (ie subjects vs objects)

119

6

NP PP DT NN PRP

9 9

21

NP PP DT NN PRP

74

23

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 92: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Non-Independence II

bull Symptoms of overly strong assumptionsbull Rewrites get used where they donrsquot belong

In the PTB this

construction is

for possessives

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 93: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Advanced Unlexicalized Parsing

99

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 94: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Horizontal Markovization

bull Horizontal Markovization Merges States

70

71

72

73

74

0 1 2v 2 inf

Horizontal Markov Order

0

3000

6000

9000

12000

0 1 2v 2 inf

Horizontal Markov Order

Sym

bo

ls

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 95: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Vertical Markovization

bull Vertical Markov order rewrites depend on past k ancestor nodes

(ie parent annotation)

Order 1 Order 2

7273747576777879

1 2v 2 3v 3

Vertical Markov Order

0

5000

10000

15000

20000

25000

1 2v 2 3v 3

Vertical Markov Order

Sym

bo

ls

Model F1 Size

v=h=2v 778 75K

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 96: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Unary Splits

bull Problem unary rewrites are used to transmute categories so a high-probability rule can be used

Annotation F1 Size

Base 778 75K

UNARY 783 80K

Solution Mark unary rewrite sites with -U

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 97: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Tag Splits

bull Problem Treebank tags are too coarse

bull Example SBAR sentential complementizers (that whether if) subordinating conjunctions (while after) and true prepositions (in of to) are all tagged IN

bull Partial Solutionbull Subdivide the IN tag

Annotation F1 Size

Previous 783 80K

SPLIT-IN 803 81K

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 98: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Other Tag Splits

bull UNARY-DT mark demonstratives as DT^U (ldquothe Xrdquo vs ldquothoserdquo)

bull UNARY-RB mark phrasal adverbs as RB^U (ldquoquicklyrdquo vs ldquoveryrdquo)

bull TAG-PA mark tags with non-canonical parents (ldquonotrdquo is an RB^VP)

bull SPLIT-AUX mark auxiliary verbs with ndashAUX [cf Charniak 97]

bull SPLIT-CC separate ldquobutrdquo and ldquoamprdquo from other conjunctions

bull SPLIT- ldquordquo gets its own tag

F1 Size

804 81K

805 81K

812 85K

816 90K

817 91K

818 93K

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 99: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Yield Splits

bull Problem sometimes the behavior of a category depends on something inside its future yield

bull Examplesbull Possessive NPs

bull Finite vs infinite VPs

bull Lexical heads

bull Solution annotate future elements into nodes

Annotation F1 Size

tag splits 823 97K

POSS-NP 831 98K

SPLIT-VP 857 105K

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 100: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Distance Recursion Splits

bull Problem vanilla PCFGs cannot distinguish attachment heights

bull Solution mark a property of higher or lower sitesbull Contains a verb

bull Is (non)-recursive

bull Base NPs [cf Collins 99]

bull Right-recursive NPs

Annotation F1 Size

Previous 857 105K

BASE-NP 860 117K

DOMINATES-V 869 141K

RIGHT-REC-NP 870 152K

NP

VP

PP

NP

v

-v

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 101: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

A Fully Annotated Tree

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 102: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Final Test Set Results

bull Beats ldquofirst generationrdquo lexicalized parsers

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 103: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Lexicalised PCFGs

109

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 104: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Heads in Comtext Free Rules

110

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 105: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Heads

111

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 106: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Rules to Recover Heads An Example for NPs

112

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 107: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Rules to Recover Heads An Example for VPs

113

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 108: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Adding Headwords to Trees

114

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 109: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Adding Headwords to Trees

115

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 110: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Lexicalized CFGs in Chomsky Normal Form

116

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 111: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Example

117

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 112: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Lexicalized CKY

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

(VP-gt VBD[saw] NP[her])

(VP-gtVBDNP)[saw]

bestScore(Xijh)

if (j = i)

return score(Xs[i])

else

return

max max score(X[h]-gtY[h]Z[w])

bestScore(Yikh)

bestScore(Zkjw)

max score(X[h]-gtY[w]Z[h])

bestScore(Yikw)

bestScore(Zkjh)

kh

X-gtYZ

kh

X-gtYZ

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 113: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Parsing with Lexicalized CFGs

119

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 114: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Pruning with Beams

bull The Collins parser prunes with per-cell beams [Collins 99]bull Essentially run the O(n5) CKY

bull Remember only a few hypotheses for each span ltijgt

bull If we keep K hypotheses at each span then we do at most O(nK2) work per span (why)

bull Keeps things more or less cubic

bull Also certain spans are forbidden entirely on the basis of punctuation (crucial for speed)

Y[h] Z[hrsquo]

X[h]

i h k hrsquo j

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 115: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Parameter Estimation

121

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 116: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

A Model from Charniak (1997)

122

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 117: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

A Model from Charniak (1997)

123

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 118: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Other Details

124

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 119: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Final Test Set Results

Parser LP LR F1

Magerman 95 849 846 847

Collins 96 863 858 860

Klein amp Manning 03 869 857 863

Charniak 97 874 875 874

Collins 99 887 886 886

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 120: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

AnalysisEvaluation (Method 2)

126

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 121: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Dependency Accuracies

127

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 122: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Strengths and Weaknesses of Modern Parsers

128

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 123: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Modern Parsers

129

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 124: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Annotation refines base treebank symbols to

improve statistical fit of the grammar

Parent annotation [Johnson rsquo98]

Head lexicalization [Collins rsquo99 Charniak rsquo00]

Automatic clustering

The Game of Designing a Grammar

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 125: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Manual Splits

bull Manually split categoriesbull NP subject vs object

bull DT determiners vs demonstratives

bull IN sentential vs prepositional

bull Advantagesbull Fairly compact grammar

bull Linguistic motivations

bull Disadvantagesbull Performance leveled out

bull Manually annotated

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 126: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

ForwardOutside

Learning Latent AnnotationsLatent Annotations

bull Brackets are known

bull Base categories are known

bull Hidden variables for subcategories

X1

X2 X7X4

X5 X6X3

He was right

Can learn with EM like Forward-Backward for HMMs BackwardInside

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 127: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Automatic Annotation Induction

bull Advantages

bull Automatically learned

Label all nodes with latent variables

Same number k of subcategoriesfor all categories

bull Disadvantages

bull Grammar gets too large

bull Most categories are oversplit while others are undersplit

Model F1

Klein amp Manning rsquo03 863

Matsuzaki et al rsquo05 867

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 128: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Refinement of the DT tag

DT

DT-1 DT-2 DT-3 DT-4

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 129: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Hierarchical refinement Repeatedly learn more fine-grained subcategories

start with two (per non-terminal) then keep splitting

initialize each EM run with the output of the last

DT

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 130: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Adaptive Splitting

Want to split complex categories more

Idea split everything roll back splits which were

least useful

[Petrov et al 06]

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 131: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Adaptive Splitting

bull Evaluate loss in likelihood from removing each split =

Data likelihood with split reversed

Data likelihood with split

bull No loss in accuracy when 50 of the splits are reversed

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 132: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Adaptive Splitting Results

Model F1

Previous 884

With 50 Merging 895

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 133: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Number of Phrasal Subcategories

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 134: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Final Results

F1

le 40 words

F1

all wordsParser

Klein amp Manning rsquo03 863 857

Matsuzaki et al rsquo05 867 861

Collins rsquo99 886 882

Charniak amp Johnson rsquo05 901 896

Petrov et al 06 902 897

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 135: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Hierarchical Pruning

hellip QP NP VP hellipcoarse

split in two hellip QP1 QP2 NP1 NP2 VP1 VP2 hellip

hellip QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4 hellipsplit in four

split in eight hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip hellip

Parse multiple times with grammars at different levels of granularity

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 136: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

Bracket Posteriors

1621 min

111 min

35 min

15 min [912 F1](no search error)

Page 137: Statistical Natural Language Parsingmausam/courses/csl772/autumn2014/lectures/1… · Statistical parsing applications Statistical parsers are now robust and widely used in larger

1621 min

111 min

35 min

15 min [912 F1](no search error)


Recommended