Log-Linear Models with Structured Outputs (continued)dasmith/inlp2009/lect15-cs... ·...

Post on 20-Jun-2020

2 views 0 download

transcript

Log-Linear Modelswith Structured Outputs

(continued)

Introduction to Natural Language ProcessingComputer Science 585—Fall 2009

University of Massachusetts Amherst

David Smith

Overview

• What computations do we need?

• Smoothing log-linear models

• MEMMs vs. CRFs again

• Action-based parsing and dependency parsing

Recipe for Conditional Training of p(y | x)

1. Gather constraints/features from training data

2. Initialize all parameters to zero

3. Classify training data with current parameters; calculate expectations

4. Gradient is

5. Take a step in the direction of the gradient

6. Repeat from 3 until convergence 43 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

43

43 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

4343 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

43

Recipe for Conditional Training of p(y | x)

1. Gather constraints/features from training data

2. Initialize all parameters to zero

3. Classify training data with current parameters; calculate expectations

4. Gradient is

5. Take a step in the direction of the gradient

6. Repeat from 3 until convergence 43 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

43

43 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

4343 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

43

Where have we seen expected counts before?

Recipe for Conditional Training of p(y | x)

1. Gather constraints/features from training data

2. Initialize all parameters to zero

3. Classify training data with current parameters; calculate expectations

4. Gradient is

5. Take a step in the direction of the gradient

6. Repeat from 3 until convergence 43 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

43

43 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

4343 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

43

Where have we seen expected counts before?

EM!

Gradient-Based Training

• λ <- λ + rate * Gradient(F)

• After all training examples? (batch)

• After every example? (on-line)

• Use second derivative?

• A big field: numerical optimization

Overfitting• If we have too many features, we can choose

weights to model the training data perfectly

• If we have a feature that only appears in spam training, not ham training, it will get weight ∞ to maximize p(spam | feature) at 1.

• These behaviors

• Overfit the training data

• Will probably do poorly on test data

Solutions to Overfitting• Throw out rare features.

• Require every feature to occur > 4 times, and > 0 times with ling, and > 0 times with spam.

• Only keep, e.g., 1000 features.

• Add one at a time, always greedily picking the one that most improves performance on held-out data.

• Smooth the observed feature counts.

• Smooth the weights by using a prior.

• max p(λ|data) = max p(λ, data) =p(λ)p(data|λ)

• decree p(λ) to be high when most weights close to 0

Smoothing with Priors• What if we had a prior expectation that parameter values

wouldn’t be very large?

• We could then balance evidence suggesting large (or infinite) parameters against our prior expectation.

• The evidence would never totally defeat the prior, and parameters would be smoothed (and kept finite)

• We can do this explicitly by changing the optimization objective to maximum posterior likelihood:

log P(y, λ | x) = log P(λ) + log P(y | x, λ)

Posterior Prior Likelihood

42 42

42

Parsing as Structured Prediction

Shift-reduce parsing

Stack Input remaining Action() Book that flight shift(Book) that flight reduce, Verb ! book, (Choice #1 of 2)(Verb) that flight shift(Verb that) flight reduce, Det ! that(Verb Det) flight shift(Verb Det flight) reduce, Noun ! flight(Verb Det Noun) reduce, NOM ! Noun(Verb Det NOM) reduce, NP ! Det NOM(Verb NP) reduce, VP ! Verb NP(Verb) reduce, S ! V(S) SUCCESS!

Ambiguity may lead to the need for backtracking.

Shift-reduce parsing

Stack Input remaining Action() Book that flight shift(Book) that flight reduce, Verb ! book, (Choice #1 of 2)(Verb) that flight shift(Verb that) flight reduce, Det ! that(Verb Det) flight shift(Verb Det flight) reduce, Noun ! flight(Verb Det Noun) reduce, NOM ! Noun(Verb Det NOM) reduce, NP ! Det NOM(Verb NP) reduce, VP ! Verb NP(Verb) reduce, S ! V(S) SUCCESS!

Ambiguity may lead to the need for backtracking.

Shift-reduce parsing

Stack Input remaining Action() Book that flight shift(Book) that flight reduce, Verb ! book, (Choice #1 of 2)(Verb) that flight shift(Verb that) flight reduce, Det ! that(Verb Det) flight shift(Verb Det flight) reduce, Noun ! flight(Verb Det Noun) reduce, NOM ! Noun(Verb Det NOM) reduce, NP ! Det NOM(Verb NP) reduce, VP ! Verb NP(Verb) reduce, S ! V(S) SUCCESS!

Ambiguity may lead to the need for backtracking.

Train log-linear model of p(action | context)

11

Word Dependency Parsing

He reckons the current account deficit will narrow to only 1.8 billion in September.Raw sentence

slide adapted from Yuji Matsumoto

11

Word Dependency Parsing

He reckons the current account deficit will narrow to only 1.8 billion in September.Raw sentence

Part-of-speech tagging

He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .

POS-tagged sentence

slide adapted from Yuji Matsumoto

11

Word Dependency Parsing

He reckons the current account deficit will narrow to only 1.8 billion in September.Raw sentence

Part-of-speech tagging

He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .

POS-tagged sentence

Word dependency parsing

slide adapted from Yuji Matsumoto

Word dependency parsed sentenceHe reckons the current account deficit will narrow to only 1.8 billion in September .

ROOT

11

Word Dependency Parsing

He reckons the current account deficit will narrow to only 1.8 billion in September.Raw sentence

Part-of-speech tagging

He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .

POS-tagged sentence

Word dependency parsing

slide adapted from Yuji Matsumoto

Word dependency parsed sentenceHe reckons the current account deficit will narrow to only 1.8 billion in September .

SUBJ

ROOT

S-COMP

11

Word Dependency Parsing

He reckons the current account deficit will narrow to only 1.8 billion in September.Raw sentence

Part-of-speech tagging

He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .

POS-tagged sentence

Word dependency parsing

slide adapted from Yuji Matsumoto

Word dependency parsed sentenceHe reckons the current account deficit will narrow to only 1.8 billion in September .

SUBJ

ROOT

S-COMP

SUBJCOMP

11

Word Dependency Parsing

He reckons the current account deficit will narrow to only 1.8 billion in September.Raw sentence

Part-of-speech tagging

He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .

POS-tagged sentence

Word dependency parsing

slide adapted from Yuji Matsumoto

Word dependency parsed sentenceHe reckons the current account deficit will narrow to only 1.8 billion in September .

SUBJ

ROOT

S-COMP

SUBJCOMP

COMP

11

Word Dependency Parsing

He reckons the current account deficit will narrow to only 1.8 billion in September.Raw sentence

Part-of-speech tagging

He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .

POS-tagged sentence

Word dependency parsing

slide adapted from Yuji Matsumoto

Word dependency parsed sentenceHe reckons the current account deficit will narrow to only 1.8 billion in September .

SUBJ

ROOT

S-COMP

SUBJ

SPEC

MODMOD

COMPCOMP

11

Word Dependency Parsing

He reckons the current account deficit will narrow to only 1.8 billion in September.Raw sentence

Part-of-speech tagging

He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .

POS-tagged sentence

Word dependency parsing

slide adapted from Yuji Matsumoto

Word dependency parsed sentenceHe reckons the current account deficit will narrow to only 1.8 billion in September .

SUBJ

ROOT

S-COMP

SUBJ

SPEC

MODMOD

COMPCOMP

11

MOD

Word Dependency Parsing

He reckons the current account deficit will narrow to only 1.8 billion in September.Raw sentence

Part-of-speech tagging

He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .

POS-tagged sentence

Word dependency parsing

slide adapted from Yuji Matsumoto

Word dependency parsed sentenceHe reckons the current account deficit will narrow to only 1.8 billion in September .

SUBJ

ROOT

S-COMP

SUBJ

SPEC

MODMOD

COMPCOMP

12 7

Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)

In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …

12 7

Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)

In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …

12 7

Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)

In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …

each choice depends on a limited part of the history

12 7

Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)

In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …

each choice depends on a limited part of the history

but which dependencies to allow?what if they’re all worthwhile?

p(D | A,B,C)?p(D | A,B,C)?

… p(D | A,B) * p(C | A,B,D)?

13 8

Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)

p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …which dependencies to allow? (given limited training data)

13 8

Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)

In the beginning, we used generative models.

p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …which dependencies to allow? (given limited training data)

13 8

Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)

In the beginning, we used generative models.

p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …which dependencies to allow? (given limited training data)

(1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B)

* Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) *

…throw them all in!

13 8

Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)

In the beginning, we used generative models.

Solution: Log-linear (max-entropy) modeling

Features may interact in arbitrary ways Iterative scaling keeps adjusting the feature weights

until the model agrees with the training data.

p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …which dependencies to allow? (given limited training data)

(1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B)

* Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) *

…throw them all in!

14 9

How about structured outputs?

14

Log-linear models great for n-way classification

9

How about structured outputs?

14

Log-linear models great for n-way classification Also good for predicting sequences

9

How about structured outputs?

find preferred tags

v a n

14

Log-linear models great for n-way classification Also good for predicting sequences

9

How about structured outputs?

but to allow fast dynamic programming, only use n-gram featuresfind preferred tags

v a n

14

Log-linear models great for n-way classification Also good for predicting sequences

9

How about structured outputs?

but to allow fast dynamic programming, only use n-gram featuresfind preferred tags

v a n

14

Log-linear models great for n-way classification Also good for predicting sequences

9

How about structured outputs?

but to allow fast dynamic programming, only use n-gram featuresfind preferred tags

v a n

14

Log-linear models great for n-way classification Also good for predicting sequences

9

How about structured outputs?

but to allow fast dynamic programming, only use n-gram featuresfind preferred tags

v a n

14

Log-linear models great for n-way classification Also good for predicting sequences

Also good for dependency parsing

9

How about structured outputs?

but to allow fast dynamic programming, only use n-gram features

…find preferred links…

find preferred tags

v a n

14

Log-linear models great for n-way classification Also good for predicting sequences

Also good for dependency parsing

9

How about structured outputs?

but to allow fast dynamic programming, only use n-gram features

but to allow fast dynamic programming or MST parsing,only use single-edge features…find preferred links…

find preferred tags

v a n

14 9

How about structured outputs?but to allow fast dynamic programming or MST parsing,only use single-edge features

…find preferred links…

15 10

How about structured outputs?but to allow fast dynamic programming or MST parsing,only use single-edge features…find preferred links…

15 10

How about structured outputs?but to allow fast dynamic programming or MST parsing,only use single-edge features…find preferred links…

15 10

How about structured outputs?but to allow fast dynamic programming or MST parsing,only use single-edge features…find preferred links…

15 10

How about structured outputs?but to allow fast dynamic programming or MST parsing,only use single-edge features…find preferred links…

15 10

How about structured outputs?but to allow fast dynamic programming or MST parsing,only use single-edge features…find preferred links…

16

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

Is this a good edge?

16

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

Is this a good edge?

yes, lots of green ...

17

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

Is this a good edge?

17

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

Is this a good edge?

jasný den(“bright day”)

18

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

Is this a good edge?

jasný den(“bright day”)

V A A A N J N V C

18

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

Is this a good edge?

jasný den(“bright day”)

jasný N(“bright NOUN”)

V A A A N J N V C

19

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

Is this a good edge?

jasný den(“bright day”)

jasný N(“bright NOUN”)

V A A A N J N V C

19

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

Is this a good edge?

jasný den(“bright day”)

jasný N(“bright NOUN”)

V A A A N J N V C

A N

20

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

Is this a good edge?

jasný den(“bright day”)

jasný N(“bright NOUN”)

V A A A N J N V C

A N

20

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

Is this a good edge?

jasný den(“bright day”)

jasný N(“bright NOUN”)

V A A A N J N V C

A Npreceding

conjunction A N

21

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

How about this competing edge?

V A A A N J N V C

21

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

How about this competing edge?

V A A A N J N V C

not as good, lots of red ...

22

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

How about this competing edge?

V A A A N J N V C

22

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

How about this competing edge?

V A A A N J N V C

jasný hodiny(“bright clocks”)

22

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

How about this competing edge?

V A A A N J N V C

jasný hodiny(“bright clocks”)

... undertrained ...

23

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

How about this competing edge?

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

jasný hodiny(“bright clocks”)

... undertrained ...

23

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

How about this competing edge?

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

jasný hodiny(“bright clocks”)

... undertrained ...

jasn hodi(“bright clock,”

stems only)

24

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

How about this competing edge?

V A A A N J N V C

jasn hodi(“bright clock,”

stems only)

byl jasn stud dubn den a hodi odbí třin

jasný hodiny(“bright clocks”)

... undertrained ...

24

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

How about this competing edge?

V A A A N J N V C

jasn hodi(“bright clock,”

stems only)

byl jasn stud dubn den a hodi odbí třin

Aplural Nsingular

jasný hodiny(“bright clocks”)

... undertrained ...

25

jasný hodiny(“bright clocks”)

... undertrained ...

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

How about this competing edge?

V A A A N J N V C

jasn hodi(“bright clock,”

stems only)

byl jasn stud dubn den a hodi odbí třin

Aplural Nsingular

25

jasný hodiny(“bright clocks”)

... undertrained ...

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

How about this competing edge?

V A A A N J N V C

jasn hodi(“bright clock,”

stems only)

byl jasn stud dubn den a hodi odbí třin

Aplural Nsingular A N

where N followsa conjunction

26

jasný

Edge-Factored Parsers (McDonald et al. 2005)

Byl studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

Which edge is better? “bright day” or “bright clocks”?

27

jasný

Edge-Factored Parsers (McDonald et al. 2005)

Byl studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

27

jasný

Edge-Factored Parsers (McDonald et al. 2005)

Byl studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

Which edge is better?

27

jasný

Edge-Factored Parsers (McDonald et al. 2005)

Byl studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

Which edge is better? our current weight vector

27

jasný

Edge-Factored Parsers (McDonald et al. 2005)

Byl studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

Which edge is better? Score of an edge e = θ ⋅ features(e)

our current weight vector

27

jasný

Edge-Factored Parsers (McDonald et al. 2005)

Byl studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

Which edge is better? Score of an edge e = θ ⋅ features(e) Standard algos valid parse with max total score

our current weight vector

28

Edge-Factored Parsers (McDonald et al. 2005)

Which edge is better? Score of an edge e = θ ⋅ features(e) Standard algos valid parse with max total score

our current weight vector

28

Edge-Factored Parsers (McDonald et al. 2005)

Which edge is better? Score of an edge e = θ ⋅ features(e) Standard algos valid parse with max total score

our current weight vector

can’t have both(one parent per word)

28

Edge-Factored Parsers (McDonald et al. 2005)

Which edge is better? Score of an edge e = θ ⋅ features(e) Standard algos valid parse with max total score

our current weight vector

can’t have both(one parent per word)

can‘t have both(no crossing links)

28

Edge-Factored Parsers (McDonald et al. 2005)

Which edge is better? Score of an edge e = θ ⋅ features(e) Standard algos valid parse with max total score

our current weight vector

can’t have both(one parent per word)

can‘t have both(no crossing links)

Can’t have all three(no cycles)

28

Edge-Factored Parsers (McDonald et al. 2005)

Which edge is better? Score of an edge e = θ ⋅ features(e) Standard algos valid parse with max total score

our current weight vector

can’t have both(one parent per word)

can‘t have both(no crossing links)

Can’t have all three(no cycles)

Thus, an edge may lose (or win) because of a consensus of other edges.

29

Finding Highest-Scoring Parse

The cat in the hat wore a stovepipe. ROOT

Convert to context-free grammar (CFG) Then use dynamic programming

29

Finding Highest-Scoring Parse

The cat in the hat wore a stovepipe. ROOT

Convert to context-free grammar (CFG) Then use dynamic programming

Thecat

in

thehat

wore

astovepipe

ROOT

let’s vertically stretch this graph drawing

29

Finding Highest-Scoring Parse

The cat in the hat wore a stovepipe. ROOT

Convert to context-free grammar (CFG) Then use dynamic programming

each subtree is a linguistic constituent(here a noun phrase)

Thecat

in

thehat

wore

astovepipe

ROOT

let’s vertically stretch this graph drawing

29

Finding Highest-Scoring Parse

The cat in the hat wore a stovepipe. ROOT

Convert to context-free grammar (CFG) Then use dynamic programming

each subtree is a linguistic constituent(here a noun phrase)

Thecat

in

thehat

wore

astovepipe

ROOT

let’s vertically stretch this graph drawing

30

Finding Highest-Scoring Parse

each subtree is a linguistic constituent(here a noun phrase)

Thecat

in

thehat

wore

astovepipe

ROOT

30

Finding Highest-Scoring Parse

each subtree is a linguistic constituent(here a noun phrase)

Thecat

in

thehat

wore

astovepipe

ROOT

Convert to context-free grammar (CFG) Then use dynamic programming

CKY algorithm for CFG parsing is O(n3)

30

Finding Highest-Scoring Parse

each subtree is a linguistic constituent(here a noun phrase)

Thecat

in

thehat

wore

astovepipe

ROOT

Convert to context-free grammar (CFG) Then use dynamic programming

CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case

30

Finding Highest-Scoring Parse

each subtree is a linguistic constituent(here a noun phrase)

Thecat

in

thehat

wore

astovepipe

ROOT

Convert to context-free grammar (CFG) Then use dynamic programming

CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case

to score “cat wore” link, not enough to know this is NP

30

Finding Highest-Scoring Parse

each subtree is a linguistic constituent(here a noun phrase)

Thecat

in

thehat

wore

astovepipe

ROOT

Convert to context-free grammar (CFG) Then use dynamic programming

CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case

to score “cat wore” link, not enough to know this is NP must know it’s rooted at “cat”

30

Finding Highest-Scoring Parse

each subtree is a linguistic constituent(here a noun phrase)

Thecat

in

thehat

wore

astovepipe

ROOT

Convert to context-free grammar (CFG) Then use dynamic programming

CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case

to score “cat wore” link, not enough to know this is NP must know it’s rooted at “cat” so expand nonterminal set by O(n): {NPthe, NPcat, NPhat, ...}

30

Finding Highest-Scoring Parse

each subtree is a linguistic constituent(here a noun phrase)

Thecat

in

thehat

wore

astovepipe

ROOT so CKY’s “grammar constant” is no longer constant

Convert to context-free grammar (CFG) Then use dynamic programming

CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case

to score “cat wore” link, not enough to know this is NP must know it’s rooted at “cat” so expand nonterminal set by O(n): {NPthe, NPcat, NPhat, ...}

31

Finding Highest-Scoring Parse

each subtree is a linguistic constituent(here a noun phrase)

Thecat

in

thehat

wore

astovepipe

ROOT

31

Finding Highest-Scoring Parse

each subtree is a linguistic constituent(here a noun phrase)

Thecat

in

thehat

wore

astovepipe

ROOT

Convert to context-free grammar (CFG) Then use dynamic programming

CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case Solution: Use a different decomposition (Eisner 1996)

Back to O(n3)

32

Spans vs. constituents

Two kinds of substring.»Constituent of the tree: links to the rest only

through its headword (root).

»Span of the tree: links to the rest only through its endwords.

The cat in the hat wore a stovepipe. ROOT

The cat in the hat wore a stovepipe. ROOT

Decomposing a tree into spans

The cat in the hat wore a stovepipe. ROOT

The cat

wore a stovepipe. ROOTcat in the hat wore

+

+

in the hat worecat in +

hat worein the hat +

cat in the hat wore a stovepipe. ROOT

34

Finding Highest-Scoring Parse

34

Finding Highest-Scoring Parse

Convert to context-free grammar (CFG) Then use dynamic programming

CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case Solution: Use a different decomposition (Eisner 1996)

Back to O(n3)

34

Finding Highest-Scoring Parse

Convert to context-free grammar (CFG) Then use dynamic programming

CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case Solution: Use a different decomposition (Eisner 1996)

Back to O(n3) Can play usual tricks for dynamic programming parsing

Further refining the constituents or spans Allow prob. model to keep track of even more internal information

A*, best-first, coarse-to-fine Training by EM etc.

34

Finding Highest-Scoring Parse

Convert to context-free grammar (CFG) Then use dynamic programming

CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case Solution: Use a different decomposition (Eisner 1996)

Back to O(n3) Can play usual tricks for dynamic programming parsing

Further refining the constituents or spans Allow prob. model to keep track of even more internal information

A*, best-first, coarse-to-fine Training by EM etc. require “outside” probabilities

of constituents, spans, or links

35

Hard Constraints on Valid Trees

Score of an edge e = θ ⋅ features(e) Standard algos valid parse with max total score

our current weight vector

can’t have both(one parent per word)

can‘t have both(no crossing links)

Can’t have all three(no cycles)

Thus, an edge may lose (or win) because of a consensus of other edges.

35

Hard Constraints on Valid Trees

can‘t have both(no crossing links)

36

Non-Projective Parses

can‘t have both(no crossing links)

The “projectivity” restriction.Do we really want it?

36

talk

Non-Projective Parses

can‘t have both(no crossing links)

The “projectivity” restriction.Do we really want it?

I give a on bootstrappingtomorrowROOT ‘ll

36

talk

Non-Projective Parses

can‘t have both(no crossing links)

The “projectivity” restriction.Do we really want it?

I give a on bootstrappingtomorrowROOT ‘ll

36

talk

Non-Projective Parses

can‘t have both(no crossing links)

The “projectivity” restriction.Do we really want it?

I give a on bootstrappingtomorrowROOT ‘ll

subtree rooted at “talk”is a discontiguous noun phrase

37

Non-Projective Parses

I give a on bootstrappingtalk tomorrowROOT ‘ll

occasional non-projectivity in English

37

Non-Projective Parses

ista meam norit gloria canitiemROOT

I give a on bootstrappingtalk tomorrowROOT ‘ll

occasional non-projectivity in English

frequent non-projectivity in Latin, etc.

37

Non-Projective Parses

ista meam norit gloria canitiemROOT

I give a on bootstrappingtalk tomorrowROOT ‘ll

That glory may-know my going-gray (i.e., it shall last till I go gray)

occasional non-projectivity in English

frequent non-projectivity in Latin, etc.

37

Non-Projective Parses

ista meam norit gloria canitiemROOT

I give a on bootstrappingtalk tomorrowROOT ‘ll

thatNOM myACC may-know gloryNOM going-grayACC

That glory may-know my going-gray (i.e., it shall last till I go gray)

occasional non-projectivity in English

frequent non-projectivity in Latin, etc.

37

Non-Projective Parses

ista meam norit gloria canitiemROOT

I give a on bootstrappingtalk tomorrowROOT ‘ll

thatNOM myACC may-know gloryNOM going-grayACC

That glory may-know my going-gray (i.e., it shall last till I go gray)

occasional non-projectivity in English

frequent non-projectivity in Latin, etc.

37

Non-Projective Parses

ista meam norit gloria canitiemROOT

I give a on bootstrappingtalk tomorrowROOT ‘ll

thatNOM myACC may-know gloryNOM going-grayACC

That glory may-know my going-gray (i.e., it shall last till I go gray)

occasional non-projectivity in English

frequent non-projectivity in Latin, etc.

38

Finding highest-scoring non-projective tree Consider the sentence “John saw Mary” (left). The Chu-Liu-Edmonds algorithm finds the maximum-

weight spanning tree (right) – may be non-projective. Can be found in time O(n2).

9

10

30

20

30 0

11

3

9

root

John

saw

Mary

10

30

30

root

John

saw

Mary

slide thanks to Dragomir Radev

Every node selects best parentIf cycles, contract them and repeat

39

Summing over all non-projective treesFinding highest-scoring non-projective tree Consider the sentence “John saw Mary” (left). The Chu-Liu-Edmonds algorithm finds the maximum-

weight spanning tree (right) – may be non-projective. Can be found in time O(n2).

How about total weight Z of all trees? How about outside probabilities or gradients? Can be found in time O(n3) by matrix determinants and

inverses (Smith & Smith, 2007).

slide thanks to Dragomir Radev

40

Graph Theory to the Rescue!

Tutte’s Matrix-Tree Theorem (1948)

The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r.

40

Graph Theory to the Rescue!

Tutte’s Matrix-Tree Theorem (1948)

The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r.

Exactly the Z we need!

40

Graph Theory to the Rescue!

Tutte’s Matrix-Tree Theorem (1948)

The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r.

Exactly the Z we need!

O(n3) time!

41

Building the Kirchoff (Laplacian) Matrix

• Negate edge scores• Sum columns

(children)• Strike root row/col.• Take determinant

41

Building the Kirchoff (Laplacian) Matrix

• Negate edge scores• Sum columns

(children)• Strike root row/col.• Take determinant

41

Building the Kirchoff (Laplacian) Matrix

• Negate edge scores• Sum columns

(children)• Strike root row/col.• Take determinant

41

Building the Kirchoff (Laplacian) Matrix

• Negate edge scores• Sum columns

(children)• Strike root row/col.• Take determinant

41

Building the Kirchoff (Laplacian) Matrix

• Negate edge scores• Sum columns

(children)• Strike root row/col.• Take determinant

N.B.: This allows multiple children of root, but see Koo et al. 2007.

42

Why Should This Work?Chu-Liu-Edmonds analogy:Every node selects best parentIf cycles, contract and recur

Clear for 1x1 matrix; use induction

Undirected case; special root cases for directed

42

Why Should This Work?Chu-Liu-Edmonds analogy:Every node selects best parentIf cycles, contract and recur

Clear for 1x1 matrix; use induction

Undirected case; special root cases for directed