Dependency Parsing by Belief Propagation

Post on 19-Jan-2016

55 views 0 download

Tags:

description

Dependency Parsing by Belief Propagation. David A. Smith (JHU  UMass Amherst) Jason Eisner (Johns Hopkins University). 1. Outline. Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing - PowerPoint PPT Presentation

transcript

11

David A. Smith (JHU UMass Amherst)

Jason Eisner (Johns Hopkins University)

Dependency Parsingby Belief Propagation

2

Outline

Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse

Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments

Conclusions

New!

Old

3

Outline

Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse

Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments

Conclusions

New!

Old

4

MOD

Word Dependency Parsing

He reckons the current account deficit will narrow to only 1.8 billion in September.

Raw sentence

Part-of-speech tagging

He reckons the current account deficit will narrow to only 1.8 billion in September.

PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .

POS-tagged sentence

Word dependency parsing

slide adapted from Yuji Matsumoto

Word dependency parsed sentence

He reckons the current account deficit will narrow to only 1.8 billion in September .

SUBJ

ROOT

S-COMP

SUBJ

SPEC

MODMOD

COMPCOMP

5

What does parsing have to do with belief propagation?

loopy belief propagation

beliefloopy propagation

6

Outline

Edge-factored parsing Dependency parses Scoring the competing parses: Edge

features Finding the best parse

Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments

Conclusions

New!

Old

77

Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) In the beginning, we used generative models.

p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …each choice depends on a limited part of the history

but which dependencies to allow?what if they’re all worthwhile?

p(D | A,B,C)?p(D | A,B,C)?

… p(D | A,B) * p(C | A,B,D)?

88

Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) In the beginning, we used generative models.

Solution: Log-linear (max-entropy) modeling

Features may interact in arbitrary ways Iterative scaling keeps adjusting the feature weights

until the model agrees with the training data.

p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …which dependencies to allow? (given limited training data)

(1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) * …throw them all in!

9

Log-linear models great for n-way classification Also good for predicting sequences

Also good for dependency parsing

9

How about structured outputs?

but to allow fast dynamic programming, only use n-gram features

but to allow fast dynamic programming or MST parsing,only use single-edge features

…find preferred links…

find preferred tags

v a n

1010

How about structured outputs? but to allow fast dynamic

programming or MST parsing,only use single-edge features

…find preferred links…

11

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

Is this a good edge?

yes, lots of green ...

12

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

Is this a good edge?

jasný den(“bright day”)

13

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

Is this a good edge?

jasný den(“bright day”)

jasný N(“bright NOUN”)

V A A A N J N V C

14

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

Is this a good edge?

jasný den(“bright day”)

jasný N(“bright NOUN”)

V A A A N J N V C

A N

15

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

Is this a good edge?

jasný den(“bright day”)

jasný N(“bright NOUN”)

V A A A N J N V C

A Npreceding

conjunction A N

16

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

How about this competing edge?

V A A A N J N V C

not as good, lots of red ...

17

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

How about this competing edge?

V A A A N J N V C

jasný hodiny(“bright clocks”)

... undertrained ...

18

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

How about this competing edge?

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

jasný hodiny(“bright clocks”)

... undertrained ...

jasn hodi(“bright clock,”

stems only)

19

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

How about this competing edge?

V A A A N J N V C

jasn hodi(“bright clock,”

stems only)

byl jasn stud dubn den a hodi odbí třin

Aplural Nsingular

jasný hodiny(“bright clocks”)

... undertrained ...

20

jasný hodiny(“bright clocks”)

... undertrained ...

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

How about this competing edge?

V A A A N J N V C

jasn hodi(“bright clock,”

stems only)

byl jasn stud dubn den a hodi odbí třin

Aplural Nsingular

A N where N follows

a conjunction

21

jasný

Edge-Factored Parsers (McDonald et al. 2005)

Byl studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

Which edge is better? “bright day” or “bright clocks”?

22

jasný

Edge-Factored Parsers (McDonald et al. 2005)

Byl studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

Which edge is better? Score of an edge e = features(e) Standard algos valid parse with max total score

our current weight vector

23

Edge-Factored Parsers (McDonald et al. 2005)

Which edge is better? Score of an edge e = features(e) Standard algos valid parse with max total score

our current weight vector

can’t have both(one parent per word)

can‘t have both(no crossing links)

Can’t have all three(no cycles)

Thus, an edge may lose (or win) because of a consensus of other edges.

24

Outline

Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse

Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments

Conclusions

New!

Old

25

Finding Highest-Scoring Parse

The cat in the hat wore a stovepipe. ROOT

Convert to context-free grammar (CFG) Then use dynamic programming

each subtree is a linguistic constituent(here a noun phrase)

Thecat

in

thehat

wore

astovepipe

ROOT

let’s vertically stretch this graph drawing

26

Finding Highest-Scoring Parse

each subtree is a linguistic constituent(here a noun phrase)

Thecat

in

thehat

wore

astovepipe

ROOT so CKY’s “grammar constant” is no longer constant

Convert to context-free grammar (CFG) Then use dynamic programming

CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case

to score “cat wore” link, not enough to know this is NP must know it’s rooted at “cat” so expand nonterminal set by O(n): {NPthe, NPcat, NPhat, ...}

27

Finding Highest-Scoring Parse

each subtree is a linguistic constituent(here a noun phrase)

Thecat

in

thehat

wore

astovepipe

ROOT

Convert to context-free grammar (CFG) Then use dynamic programming

CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case Solution: Use a different decomposition (Eisner

1996) Back to O(n3)

28

Spans vs. constituents

Two kinds of substring.» Constituent of the tree: links to the rest

only through its headword (root).

» Span of the tree: links to the rest only through its endwords.

The cat in the hat wore a stovepipe. ROOT

The cat in the hat wore a stovepipe. ROOT

Decomposing a tree into spans

The cat in the hat wore a stovepipe. ROOT

The cat

wore a stovepipe. ROOTcat in the hat wore

+

+

in the hat worecat in +

hat worein the hat +

cat in the hat wore a stovepipe. ROOT

30

Finding Highest-Scoring Parse

Convert to context-free grammar (CFG) Then use dynamic programming

CKY algorithm for CFG parsing is O(n3) Unfortunately, O(n5) in this case Solution: Use a different decomposition (Eisner

1996) Back to O(n3)

Can play usual tricks for dynamic programming parsing Further refining the constituents or spans

Allow prob. model to keep track of even more internal information A*, best-first, coarse-to-fine Training by EM etc.

require “outside” probabilitiesof constituents, spans, or links

31

Hard Constraints on Valid Trees

Score of an edge e = features(e) Standard algos valid parse with max total score

our current weight vector

can’t have both(one parent per word)

can‘t have both(no crossing links)

Can’t have all three(no cycles)

Thus, an edge may lose (or win) because of a consensus of other edges.

32

talk

Non-Projective Parses

can‘t have both(no crossing links)

The “projectivity” restriction.Do we really want it?

I give a on bootstrappingtomorrowROOT ‘ll

subtree rooted at “talk”is a discontiguous noun phrase

33

Non-Projective Parses

ista meam norit gloria canitiemROOT

I give a on bootstrappingtalk tomorrowROOT ‘ll

thatNOM myACC may-know gloryNOM going-grayACC

That glory may-know my going-gray (i.e., it shall last till I go gray)

occasional non-projectivity in English

frequent non-projectivity in Latin, etc.

34

Finding highest-scoring non-projective tree Consider the sentence “John saw Mary” (left). The Chu-Liu-Edmonds algorithm finds the maximum-

weight spanning tree (right) – may be non-projective. Can be found in time O(n2).

9

10

30

20

30 0

11

3

9

root

John

saw

Mary

10

30

30

root

John

saw

Mary

slide thanks to Dragomir Radev

Every node selects best parentIf cycles, contract them and repeat

35

Summing over all non-projective treesFinding highest-scoring non-

projective tree Consider the sentence “John saw Mary” (left). The Chu-Liu-Edmonds algorithm finds the maximum-weight spanning tree (right) – may be non-projective. Can be found in time O(n2).

How about total weight Z of all trees? How about outside probabilities or gradients? Can be found in time O(n3) by matrix determinants and inverses (Smith & Smith, 2007).

slide thanks to Dragomir Radev

36

Graph Theory to the Rescue!

Tutte’s Matrix-Tree Theorem (1948)

The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r.

Exactly the Z we need!

O(n3) time!

37

Building the Kirchoff (Laplacian) Matrix

0 s(1,0) s(2,0) s(n,0)

0 0 s(2,1) s(n,1)

0 s(1,2) 0 s(n,2)

0 s(1,n) s(2,n) 0

0 s(1,0) s(2,0) s(n,0)

0 s(1, j)j1

s(2,1) s(n,1)

0 s(1,2) s(2, j)j2

s(n,2)

0 s(1,n) s(2,n) s(n, j)

jn

nj

j

j

jnsnsns

nsjss

nssjs

),(),2(),1(

)2,(),2()2,1(

)1,()1,2(),1(

2

1

• Negate edge scores• Sum columns

(children)• Strike root row/col.• Take determinant

N.B.: This allows multiple children of root, but see Koo et al. 2007.

38

Why Should This Work?Chu-Liu-Edmonds analogy:Every node selects best parentIf cycles, contract and recur

K K with contracted edge 1,2

K K({1,2} |{1,2})

K s(1,2) K K

s(1, j)j1

s(2,1) s(n,1)

s(1,2) s(2, j)j2

s(n,2)

s(1,n) s(2,n) s(n, j)

jn

Clear for 1x1 matrix; use induction

Undirected case; special root cases for directed

39

Outline

Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse

Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments

Conclusions

New!

Old

4040

Exactly Finding the Best Parse

With arbitrary features, runtime blows up Projective parsing: O(n3) by dynamic programming

Non-projective: O(n2) by minimum spanning tree

but to allow fast dynamic programming or MST parsing,only use single-edge features

…find preferred links…

O(n4)

grandparents

O(n5)

grandp.+ siblingbigrams

O(n3g6)

POS trigrams

… O(2n)

sibling pairs (non-adjacent)

NP-hard

•any of the above features•soft penalties for crossing links•pretty much anything else!

4141

Let’s reclaim our freedom (again!)

Output probability is a product of local factors Throw in any factors we want! (log-linear model)

How could we find best parse? Integer linear programming (Riedel et al., 2006)

doesn’t give us probabilities when training or parsing MCMC

Slow to mix? High rejection rate because of hard TREE constraint? Greedy hill-climbing (McDonald & Pereira 2006)

This paper in a nutshell

(1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) * …

none of these exploittree structure of parsesas the first-order methods do

4242

Let’s reclaim our freedom (again!)

Output probability is a product of local factors Throw in any factors we want! (log-linear model)

Let local factors negotiate via “belief propagation”Links (and tags) reinforce or suppress one another

Each iteration takes total time O(n2) or O(n3)

Converges to a pretty good (but approx.) global parse

certain global factors ok too

each global factor can be handled fast via some traditional parsing algorithm (e.g., inside-outside)

This paper in a nutshell

(1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) * …

43

Let’s reclaim our freedom (again!)Training with many features Decoding with many features

Iterative scaling Belief propagation

Each weight in turn is influenced by others

Each variable in turn is influenced by others

Iterate to achieve globally optimal weights

Iterate to achievelocally consistent beliefs

To train distrib. over trees, use dynamic programming to compute normalizer Z

To decode distrib. over trees,

use dynamic programming to compute messages

This paper in a nutshell

New!

44

Outline

Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse

Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments

Conclusions

New!

Old

45

First, a familiar example Conditional Random Field (CRF) for POS tagging

45

Local factors in a graphical model

……

find preferred tags

v v v

Possible tagging (i.e., assignment to remaining variables)

Observed input sentence (shaded)

4646

Local factors in a graphical model First, a familiar example

Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v a n

Possible tagging (i.e., assignment to remaining variables)Another possible tagging

Observed input sentence (shaded)

4747

Local factors in a graphical model First, a familiar example

Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v n av 0 2 1n 2 1 0a 0 3 1

v n av 0 2 1n 2 1 0a 0 3 1

”Binary” factor that measures

compatibility of 2 adjacent tags

Model reusessame parameters

at this position

4848

Local factors in a graphical model First, a familiar example

Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v 0.2n 0.2a 0

“Unary” factor evaluates this tagIts values depend on corresponding word

can’t be adj

v 0.2n 0.2a 0

4949

Local factors in a graphical model First, a familiar example

Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v 0.2n 0.2a 0

“Unary” factor evaluates this tagIts values depend on corresponding word

(could be made to depend onentire observed sentence)

5050

Local factors in a graphical model First, a familiar example

Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v 0.2n 0.2a 0

“Unary” factor evaluates this tagDifferent unary factor at each position

v 0.3n 0.02a 0

v 0.3n 0a 0.1

5151

Local factors in a graphical model First, a familiar example

Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0.02a 0

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0a 0.1

v 0.2n 0.2a 0

v a n

p(v a n) is proportionalto the product of all

factors’ values on v a n

5252

Local factors in a graphical model First, a familiar example

Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0.02a 0

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0a 0.1

v 0.2n 0.2a 0

v a n

= … 1*3*0.3*0.1*0.2 …

p(v a n) is proportionalto the product of all

factors’ values on v a n

5353

First, a familiar example CRF for POS tagging

Now let’s do dependency parsing! O(n2) boolean variables for the possible links

v a n

Local factors in a graphical model

find preferred links ……

54

First, a familiar example CRF for POS tagging

Now let’s do dependency parsing! O(n2) boolean variables for the possible links

54

Local factors in a graphical model

find preferred links ……

tf

ft

ff

Possible parse— encoded as an assignment to these vars

v a n

55

First, a familiar example CRF for POS tagging

Now let’s do dependency parsing! O(n2) boolean variables for the possible links

55

Local factors in a graphical model

find preferred links ……

ff

tf

tf

Possible parse— encoded as an assignment to these varsAnother possible parse

v a n

56

First, a familiar example CRF for POS tagging

Now let’s do dependency parsing! O(n2) boolean variables for the possible links

(cycle)

56

Local factors in a graphical model

find preferred links ……

ft

ttf

Possible parse— encoded as an assignment to these varsAnother possible parseAn illegal parse

v a n

f

57

First, a familiar example CRF for POS tagging

Now let’s do dependency parsing! O(n2) boolean variables for the possible links

(cycle)

57

Local factors in a graphical model

find preferred links ……

t

tt

Possible parse— encoded as an assignment to these varsAnother possible parseAn illegal parseAnother illegal parse

v a n

t

(multiple parents)

f

f

58

So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation

58

Local factors for parsing

find preferred links ……

t 2f 1

t 1f 2

t 1f 2

t 1f 6

t 1f 3

as before, goodness of this link can depend on entireobserved input context

t 1f 8

some other linksaren’t as goodgiven this input

sentence

But what if the best assignment isn’t a tree??

5959

Global factors for parsing So what factors shall we multiply to define parse probability?

Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree

this is a “hard constraint”: factor is either 0 or 1

find preferred links ……

ffffff 0ffffft 0fffftf 0

… …fftfft 1

… …tttttt 0

60

So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree

this is a “hard constraint”: factor is either 0 or 1

60

Global factors for parsing

find preferred links ……

ffffff 0ffffft 0fffftf 0

… …fftfft 1

… …tttttt 0

tf

ft

ff 64 entries (0/1)

So far, this is equivalent toedge-factored parsing(McDonald et al. 2005).

Note: McDonald et al. (2005) don’t loop through this table to consider exponentially many trees one at a time.They use combinatorial algorithms; so should we!

optionally require the tree to be projective (no crossing links)

we’relegal!

61

So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree

this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables

grandparent

61

Local factors for parsing

find preferred links ……

f t

f 1 1

t 1 3

t

t

3

62

So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree

this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables

grandparent no-cross

62

Local factors for parsing

find preferred links ……

t

by

t

f t

f 1 1

t 1 0.2

6363

Local factors for parsing

find preferred links …… by

So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree

this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables

grandparent no-cross siblings hidden POS tags subcategorization …

64

Outline

Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse

Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief

propagation Experiments

Conclusions

New!

Old

65

Good to have lots of features, but … Nice model Shame about the NP-hardness

Can we approximate?

Machine learning to the rescue! ML community has given a lot to NLP In the 2000’s, NLP has been giving back to ML

Mainly techniques for joint prediction of structures Much earlier, speech recognition had HMMs, EM, smoothing …

65

66

Great Ideas in ML: Message Passing

66

3 behind

you

2 behind

you

1 behind

you

4 behind

you

5 behind

you

1 before

you

2 before

you

there’s1 of me

3 before

you

4 before

you

5 before

you

adapted from MacKay (2003) textbook

Count the soldiers

67

Great Ideas in ML: Message Passing

67

3 behind

you

2 before

you

there’s1 of me

Belief:Must be

2 + 1 + 3 = 6 of us

only seemy incoming

messages

2 31

Count the soldiers

adapted from MacKay (2003) textbook

68

Belief:Must be

2 + 1 + 3 = 6 of us

2 31

Great Ideas in ML: Message Passing

68

4 behind

you

1 before

you

there’s1 of me

only seemy incoming

messages

Belief:Must be

1 + 1 + 4 = 6 of us

1 41

Count the soldiers

adapted from MacKay (2003) textbook

69

Great Ideas in ML: Message Passing

69

7 here

3 here

11 here(=

7+3+1)

1 of me

Each soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook

70

Great Ideas in ML: Message Passing

70

3 here

3 here

7 here(=

3+3+1)

Each soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook

71

Great Ideas in ML: Message Passing

71

7 here

3 here

11 here(=

7+3+1)

Each soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook

72

Great Ideas in ML: Message Passing

72

7 here

3 here

3 here

Belief:Must be14 of us

Each soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook

73

Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree

73

7 here

3 here

3 here

Belief:Must be14 of us

wouldn’t work correctly

with a “loopy” (cyclic) graph

adapted from MacKay (2003) textbook

7474

……

find preferred tags

Great ideas in ML: Forward-Backward

v 0.3n 0a 0.1

v 1.8n 0a 4.2

α βα

belief

message message

v 2n 1a 7

In the CRF, message passing = forward-backward

v 7n 2a 1

v 3n 1a 6

βv n a

v 0 2 1n 2 1 0a 0 3 1

v 3n 6a 1

v n av 0 2 1n 2 1 0a 0 3 1

75

Extend CRF to “skip chain” to capture non-local factor More influences on belief

75

……

find preferred tags

Great ideas in ML: Forward-Backward

v 3n 1a 6

v 2n 1a 7

α β

v 3n 1a 6

v 5.4n 0a 25.2

v 0.3n 0a 0.1

76

Extend CRF to “skip chain” to capture non-local factor More influences on belief Graph becomes loopy

76

……

find preferred tags

Great ideas in ML: Forward-Backward

v 3n 1a 6

v 2n 1a 7

α β

v 3n 1a 6

v 5.4`n 0a 25.2`

v 0.3n 0a 0.1

Red messages not independent?Pretend they are!

77

Two great tastes that taste great together

You got dynamic

programming in my belief propagation!

You got belief propagation in my dynamic

programming!

Upcoming attractions …

7878

Loopy Belief Propagation for Parsing

find preferred links ……

Sentence tells word 3, “Please be a verb” Word 3 tells the 3 7 link, “Sorry, then you probably don’t exist” The 3 7 link tells the Tree factor, “You’ll have to find another

parent for 7” The tree factor tells the 10 7 link, “You’re on!” The 10 7 link tells 10, “Could you please be a noun?” …

79

Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle … Strong links are suppressing or promoting other links …

79

Loopy Belief Propagation for Parsing

find preferred links ……

80

Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle …

How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t,

given the messages it receives from all the other links?”

80

Loopy Belief Propagation for Parsing

find preferred links ……

?TREE factorffffff 0ffffft 0fffftf 0

… …fftfft 1

… …tttttt 0

81

How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t,

given the messages it receives from all the other links?”

81

Loopy Belief Propagation for Parsing

find preferred links ……

?

TREE factorffffff 0ffffft 0fffftf 0

… …fftfft 1

… …tttttt 0

But this is the outside probability of green link!

TREE factor computes all outgoing messages at once (given all incoming messages)

Projective case: total O(n3) time by inside-outside

Non-projective: total O(n3) time by inverting Kirchhoff

matrix (Smith & Smith, 2007)

82

How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t,

given the messages it receives from all the other links?”

82

Loopy Belief Propagation for Parsing

But this is the outside probability of green link!

TREE factor computes all outgoing messages at once (given all incoming messages)

Projective case: total O(n3) time by inside-outside

Non-projective: total O(n3) time by inverting Kirchhoff

matrix (Smith & Smith, 2007)

Belief propagation assumes incoming messages to TREE are independent.So outgoing messages can be computed with first-order parsing algorithms

(fast, no grammar constant).

83

Some connections …

Parser stacking (Nivre & McDonald 2008, Martins et al. 2008)

Global constraints in arc consistency ALLDIFFERENT constraint (Régin 1994)

Matching constraint in max-product BP For computer vision (Duchi et al., 2006) Could be used for machine translation

As far as we know, our parser is the first use of global constraints in sum-product BP.

84

Outline

Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse

Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments

Conclusions

New!

Old

85

Runtimes for each factor type (see paper) Factor type degree runtime count total

Tree O(n2) O(n3) 1 O(n3)

Proj. Tree O(n2) O(n3) 1 O(n3)Individual links 1 O(1) O(n2) O(n2)

Grandparent 2 O(1) O(n3) O(n3)

Sibling pairs 2 O(1) O(n3) O(n3)

Sibling bigrams O(n) O(n2) O(n) O(n3)

NoCross O(n) O(n) O(n2) O(n3)

Tag 1 O(g) O(n) O(n)

TagLink 3 O(g2) O(n2) O(n2)

TagTrigram O(n) O(ng3) 1 O(n)

TOTAL O(n3)

+=Additive, not multiplicative!

periteration

86

Runtimes for each factor type (see paper) Factor type degree runtime count total

Tree O(n2) O(n3) 1 O(n3)

Proj. Tree O(n2) O(n3) 1 O(n3)Individual links 1 O(1) O(n2) O(n2)

Grandparent 2 O(1) O(n3) O(n3)

Sibling pairs 2 O(1) O(n3) O(n3)

Sibling bigrams O(n) O(n2) O(n) O(n3)

NoCross O(n) O(n) O(n2) O(n3)

Tag 1 O(g) O(n) O(n)

TagLink 3 O(g2) O(n2) O(n2)

TagTrigram O(n) O(ng3) 1 O(n)

TOTAL O(n3)

+=Additive, not multiplicative!Each “global” factor coordinates an unbounded # of variables

Standard belief propagation would take exponential timeto iterate over all configurations of those variables

See paper for efficient propagators

87

Experimental Details Decoding

Run several iterations of belief propagation Get final beliefs at link variables Feed them into first-order parser This gives the Min Bayes Risk tree (minimizes expected error)

Training BP computes beliefs about each factor, too … … which gives us gradients for max conditional likelihood.

(as in forward-backward algorithm) Features used in experiments

First-order: Individual links just as in McDonald et al. 2005 Higher-order: Grandparent, Sibling bigrams, NoCross

87

88

Dependency AccuracyThe extra, higher-order features help! (non-projective parsing)

Danish Dutch English

Tree+Link 85.5 87.3 88.6

+NoCross 86.1 88.3 89.1

+Grandparent 86.1 88.6 89.4

+ChildSeq 86.5 88.5 90.1

89

Dependency AccuracyThe extra, higher-order features help! (non-projective parsing)

Danish Dutch English

Tree+Link 85.5 87.3 88.6

+NoCross 86.1 88.3 89.1

+Grandparent 86.1 88.6 89.4

+ChildSeq 86.5 88.5 90.1

Best projective parse with all factors

86.0 84.5 90.2

+hill-climbing 86.1 87.6 90.2

exact, slow

doesn’t fixenough edges

90

Time vs. Projective Search Error

…DP 140

Compared with O(n4) DP Compared with O(n5) DP

iterations

iterations

iterations

92

Outline

Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse

Higher-order parsing Throwing in more features: Graphical models Finding the best parse: Belief propagation Experiments

Conclusions

New!

Old

9393

Freedom Regained

Output probability defined as product of local and global factors Throw in any factors we want! (log-linear model) Each factor must be fast, but they run independently

Let local factors negotiate via “belief propagation” Each bit of syntactic structure is influenced by others Some factors need combinatorial algorithms to compute messages fast

e.g., existing parsing algorithms using dynamic programming Each iteration takes total time O(n3) or even O(n2); see paper

Compare reranking or stacking

Converges to a pretty good (but approximate) global parse Fast parsing for formerly intractable or slow models Extra features of these models really do help accuracy

This paper in a nutshell

94

Future Opportunities

Efficiently modeling more hidden structure POS tags, link roles, secondary links (DAG-shaped parses)

Beyond dependencies Constituency parsing, traces, lattice parsing

Beyond parsing Alignment, translation Bipartite matching and network flow Joint decoding of parsing and other tasks (IE, MT, reasoning ...)

Beyond text Image tracking and retrieval Social networks

95

thank you