+ All Categories
Home > Documents > 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for...

1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for...

Date post: 17-Jan-2016
Category:
Upload: phebe-sylvia-spencer
View: 217 times
Download: 0 times
Share this document with a friend
109
1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning
Transcript
Page 1: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

11

Jason EisnerNAACL Workshop Keynote – June 2009

Joint Models with Missing Datafor Semi-Supervised Learning

Page 2: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

2

Outline

1. Why use joint models?

2. Making big joint models tractable:Approximate inference and training by loopy belief propagation

3. Open questions: Semi-supervised training of joint models

Page 3: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

3

The standard story

Taskx yp(y|x) model

Semi-sup. learning: Train on many (x,?) and a few (x,y)

Page 4: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

4

Some running examples

Taskx yp(y|x) model

Semi-sup. learning: Train on many (x,?) and a few (x,y)

sentence parse

lemma morph. paradigm

E.g., in low-resource languages

(with David A. Smith)

(with Markus Dreyer)

Page 5: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

5

Semi-supervised learningSemi-sup. learning: Train on many (x,?) and a few (x,y)

Why would knowing p(x) help you learn p(y|x) ??

Shared parameters via joint model e.g., noisy channel:

p(x,y) = p(y) * p(x|y)

Estimate p(x,y) to have appropriate marginal p(x)

This affects the conditional distrib p(y|x)

Page 6: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

6

sample of p(x)

Page 7: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

7

few params

For any x, can now recover cluster c that probably generated itA few supervised examples may let us predict y from cE.g., if p(x,y) = ∑c p(x,y,c) = ∑c p(c) p(y | c) p(x | c)

(joint model!)

sample of p(x)

Page 8: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

8

Semi-supervised learningSemi-sup. learning: Train on many (x,?) and a few (x,y)

Why would knowing p(x) help you learn p(y|x) ??

Picture is misleading: No need to assume a distance metric(as in TSVM, label propagation, etc.)

But we do need to choose a model family for p(x,y)

Shared parameters via joint model e.g., noisy channel:

p(x,y) = p(y) * p(x|y)

Estimate p(x,y) to have appropriate marginal p(x)

This affects the conditional distrib p(y|x)

Page 9: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

9

NLP + ML = ???

Taskx ystructured

input(may be

only partlyobserved,so infer x,

too)

structured output(so already

need jointinference for decoding, e.g.,dynamicprogramming)

p(y|x) model

depends on features of<x,y>(sparse features?)

or features of <x,z,y> where z are latent

(so infer z, too)

Page 10: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

10

Each task in a vacuum?

Task1x1y1

Task2x2y2

Task3x3y3

Task4x4y4

Page 11: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

11

Solved tasks help later ones? (e.g, pipeline)

Task1x z1

Task2 z2

Task3 z3

Task4 y

Page 12: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

12

Feedback?

Task1x z1

Task2 z2

Task3 z3

Task4 y

What if Task3 isn’t solved yet and we have little

<z2,z3> training data?

Page 13: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

13

Feedback?

Task1x z1

Task2 z2

Task3 z3

Task4 y

What if Task3 isn’t solved yet and we have little

<z2,z3> training data?Impute <z2,z3> given x1 and y4!

Page 14: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

14

A later step benefits from many earlier ones?

Task1x z1

Task2 z2

Task3 z3

Task4 y

Page 15: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

15

A later step benefits from many earlier ones?

Task1x z1

Task2 z2

Task3 z3

Task4 y

And conversely?

Page 16: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

16

We end up with a Markov Random Field (MRF)

Φ1x z1

z2

z3

y

Φ2

Φ3

Φ4

Page 17: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

17

=Variable-centric, not task-centric

z1

z2

z3

Φ1

Φ2

Φ4

Φ3

x

y

=(1/Z) Φ2(z1,z2)Φ4(z3,y)Φ3(x,z1,z2,z3)

Φ1(x,z1)p(x,z1,z2,z3,y)

Φ5

Φ5(y)

Page 18: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

18

First, a familiar example Conditional Random Field (CRF) for POS tagging

18

Familiar MRF example

……

find preferred tags

v v v

Possible tagging (i.e., assignment to remaining variables)

Observed input sentence (shaded)

Page 19: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

1919

Familiar MRF example First, a familiar example

Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v a n

Possible tagging (i.e., assignment to remaining variables)Another possible tagging

Observed input sentence (shaded)

Page 20: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

2020

Familiar MRF example: CRF

……

find preferred tags

v n av 0 2 1n 2 1 0a 0 3 1

v n av 0 2 1n 2 1 0a 0 3 1

”Binary” factor that measures

compatibility of 2 adjacent tags

Model reusessame parameters

at this position

Page 21: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

2121

Familiar MRF example: CRF

……

find preferred tags

v 0.2n 0.2a 0

“Unary” factor evaluates this tagIts values depend on corresponding word

can’t be adj

v 0.2n 0.2a 0

Page 22: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

2222

Familiar MRF example: CRF

……

find preferred tags

v 0.2n 0.2a 0

“Unary” factor evaluates this tagIts values depend on corresponding word

(could be made to depend onentire observed sentence)

Page 23: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

2323

Familiar MRF example: CRF

……

find preferred tags

v 0.2n 0.2a 0

“Unary” factor evaluates this tagDifferent unary factor at each position

v 0.3n 0.02a 0

v 0.3n 0a 0.1

Page 24: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

2424

Familiar MRF example: CRF

……

find preferred tags

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0.02a 0

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0a 0.1

v 0.2n 0.2a 0

v a n

p(v a n) is proportionalto the product of all

factors’ values on v a n

Page 25: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

2525

Familiar MRF example: CRF

……

find preferred tags

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0.02a 0

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0a 0.1

v 0.2n 0.2a 0

v a n

= … 1*3*0.3*0.1*0.2 …

p(v a n) is proportionalto the product of all

factors’ values on v a n

NOTE: This is not just a pipeline of single-tag prediction tasks(which might work ok in well-trained supervised case …)

Page 26: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

26

Task-centered view of the world

Task1x z1

Task2 z2

Task3 z3

Task4 y

Page 27: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

27

=Variable-centered view of the world

z1

z2

z3

Φ1

Φ2

Φ4

Φ3

x

y

=(1/Z) Φ2(z1,z2)Φ4(z3,y)Φ3(x,z1,z2,z3)

Φ1(x,z1)p(x,z1,z2,z3,y)

Φ5

Φ5(y)

Page 28: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

28

Variable-centric, not task-centric

Throw in any variables that might help!Model and exploit correlations

Page 29: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

29

lexicon (word types)semantics

sentences

discourse context

resources

entailmentcorrelation

inflectioncognatestransliterationabbreviationneologismlanguage evolution

translationalignment

editingquotation

speech misspellings,typos formatting entanglement annotation

N

tokens

Page 30: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

30

Back to our (simpler!) running examples

sentence parse

lemma morph. paradigm

(with David A. Smith)

(with Markus Dreyer)

Page 31: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

31

Parser projection

sentence parse

translation parse of

translation

little directtraining data

much moretraining data

sentence parse (with David A. Smith)

Page 32: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

32

Parser projection

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

Page 33: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

33

Parser projection

sentence parse

word-to-wordalignment

translation parse of

translation

little directtraining data

much moretraining data

Page 34: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

34

Parser projection

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

NULL

Page 35: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

35

Parser projection

sentence parse

word-to-wordalignment

translation parse of

translation

little directtraining data

much moretraining data

need aninteresting

model

Page 36: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

36

Parses are not entirely isomorphic

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

NULL

monotonic

nullhead-swapping

siblings

Page 37: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

37

Dependency Relations

+ “none of the above”

Page 38: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

38

Parser projection

sentence parse

translation

word-to-wordalignment

parse oftranslation

Typical test data (no translation observed):

Page 39: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

39

sentence parse

translation

word-to-wordalignment

parse oftranslation

Small supervised training set (treebank):

Parser projection

Page 40: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

40

Parser projection

sentence parse

translation

word-to-wordalignment

parse oftranslation

Moderate treebank in other language:

Page 41: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

41

sentence parse

translation

word-to-wordalignment

parse oftranslation

Maybe a few gold alignments:

Parser projection

Page 42: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

42

sentence parse

translation

word-to-wordalignment

parse oftranslation

Lots of raw bitext:

Parser projection

Page 43: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

43

sentence parse

translation

word-to-wordalignment

parse oftranslation

Given bitext,

Parser projection

Page 44: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

44

sentence parse

translation

word-to-wordalignment

parse oftranslation

Given bitext, try to impute other variables:

Parser projection

Page 45: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

45

sentence parse

translation

word-to-wordalignment

parse oftranslation

Given bitext, try to impute other variables:Now we have more constraints on the parse …

Parser projection

Page 46: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

46

sentence parse

translation

word-to-wordalignment

parse oftranslation

Given bitext, try to impute other variables:Now we have more constraints on the parse …which should help us train the parser.

Parser projection

We’ll see how belief propagation naturally handles this.

Page 47: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

47

English does help us impute Chinese parse

中国 在 基本 建设 方面 , 开始 利用 国际 金融 组织 的 贷款 进行 国际性 竞争性 招标 采购

In the area of infrastructure construction, China has begun to utilize loans from international financial organizations to implement international competitive bidding procurement

China: in: infrastructure: construction: area:, : has begun: to utilize: international: financial: organizations: ‘s: loans: to implement: international: competitive: bidding: procurement

Seeing noisy output of an English WSJ parser fixes these Chinese links

The corresponding bad versions found without seeing the English parse

Subject attaches to intervening noun

N P J N N , V V N N N ‘s N V J N N N

Complement verbs swap objects

Page 48: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

48

Which does help us train a monolingual Chinese parser

Page 49: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

49

(Could add a 3rd language …)

sentence parse

translation parse of

translation

translation’

alignment

parse oftranslation’

alignment

alignment

Page 50: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

50

world

sentence parse

translation

word-to-wordalignment

parse oftranslation

(Could add world knowledge …)

Page 51: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

51

(Could add bilingual dictionary …)

sentence parse

translation

word-to-wordalignment

parse oftranslation

dict(since incomplete, treat as partially observed var)

N

Page 52: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

52

sentence parse

translation parse of

translation

alignment

Auf Fragediese bekommenichhabe leider Antwortkeine

I did not unfortunately receive an answer to this question

NULL

Dynamic Markov Random Field Note: These

are structured vars

Each is expanded into a collection of fine-grained variables (words, dependency links, alignment links,…)

Thus, # of fine-grained variables & factors varies by example (but all examples share a single finite parameter vector)

Page 53: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

53

Back to our running examplessentence parse (with David A. Smith)

lemma morph. paradigm (with Markus Dreyer)

Page 54: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

54

inf xyz

1st Sg

2nd Sg

3rd Sg

1st Pl

2nd Pl

3rd Pl

Present Past

Morphological paradigm

Page 55: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

55

inf werfen

1st Sg werfe warf

2nd Sg wirfst warfst

3rd Sg wirft warf

1st Pl werfen warfen

2nd Pl werft warft

3rd Pl werfen warfen

Present Past

Morphological paradigm

Page 56: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

56

inf xyz

1st Sg

2nd Sg

3rd Sg

1st Pl

2nd Pl

3rd Pl

Present Past

Morphological paradigm as MRF

Each factor is a sophisticated weighted FST

Page 57: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

57

inf 9,393

1st Sg 285 1124

2nd Sg 166 4

3rd Sg 1410 1124

1st Pl 1688 673

2nd Pl 1275 9

3rd Pl 1688 673

Present Past

# observations per form (fine-grained

semisupervision)

rare!

rare!

Question: Does joint inference help?

undertrained

Page 58: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

58

inf gelten

1st Sg gelte galt

2nd Sg giltst galtstor: galtest

3rd Sg gilt galt

1st Pl gelten galten

2nd Pl geltet galtet

3rd Pl gelten galten

Present Past

*geltst*geltst giltstgiltst

geltetgeltet geltetgeltet

galtstgaltst galtestgaltest

*gltt*gltt galtetgaltet

gelten ‘to hold, to apply’

Page 59: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

59

inf abbrechen

1st Sg abbrecheor: breche ab

abbrachor: brach ab

2nd Sg abbrichstor: brichst ab

abbrachstor: brachst ab

3rd Sg abbrichtor: bricht ab

abbrachor: brach ab

1st Pl abbrechenor: brechen ab

abbrachenor: brachen ab

2nd Pl abbrechtor: brecht ab

abbrachtor: bracht ab

3rd Pl abbrechenor: brechen ab

abbrachenor: brachen ab

Present Past

*abbrachten*abbrachten abbrachenabbrachen

abbrecheabbreche abbrecheabbreche

abbrachtabbracht abbrachtabbrachtabbrechtabbrecht abbrechtabbrecht

*atttrachst*atttrachst abbrachstabbrachst

*abbrechst*abbrechst abbrichstabbrichst

abbrichtabbricht abbrichtabbricht

*abbrachten*abbrachten abbrachenabbrachen

abbrechen ‘to quit’

Page 60: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

60

inf gackern

1st Sg gackere gackerte

2nd Sg gackerst gackertest

3rd Sg gackert gackerte

1st Pl gackern gackerten

2nd Pl gackert gackertet

3rd Pl gackern gackerten

Present Past

gackern ‘to cackle’

*gackrt*gackrt gackertetgackertet

*gackart*gackart gackertestgackertest

gackeregackere gackeregackeregackerstgackerst gackerstgackerst

gackertgackert gackertgackertgackerngackern gackerngackern

gackertgackert gackertgackertgackerngackern gackerngackern

gackertegackerte gackertegackerte

gackertegackerte gackertegackertegackertengackerten gackertengackerten

gackertengackerten gackertengackerten

Page 61: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

61

inf werfen

1st Sg werfe warf

2nd Sg wirfst warfst

3rd Sg wirft warf

1st Pl werfen warfen

2nd Pl werft warft

3rd Pl werfen warfen

Present Past

werfen ‘to throw’

warftwarft warftwarft

*werfst*werfst wirfstwirfst

werftwerft werftwerft

warfstwarfst warfstwarfst

Page 62: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

62

Preliminary results …

joint inference helps a lot on the rare forms

Hurts on the others.Can we fix?? (Is it because our jointdecoder is approx? Or because semi-supervised training is hard and we need a better method for it?)

Page 63: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

63

Outline

1. Why use joint models in NLP?

2. Making big joint models tractable:Approximate inference and training by loopy belief propagation

3. Open questions: Semisupervised training of joint models

Page 64: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

64

Key Idea! We’re using an MRF to coordinate the solutions to

several NLP problems

Each factor may be a whole NLP model over one or a few complex structured variables (strings, parses) Or equivalently, over many fine-grained variables

(individual words, tags, links)

Within a factor, use existing fast exact NLP algorithms These are the “propagators” that compute outgoing messages Even though the product of factors may be intractable or even

undecidable to work with

Page 65: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

65

MRFs great for n-way classification (maxent) Also good for predicting sequences

Also good for dependency parsing

65

Why we need approximate inference

alas, forward-backward algorithmonly allows n-gram features

alas, our combinatorial algorithms only allow single-edge features (more interactions slow them down or introduce NP-hardness)

…find preferred links…

find preferred tags

v a n

Page 66: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

66

Great Ideas in ML: Message Passing

66

3 behind

you

2 behind

you

1 behind

you

4 behind

you

5 behind

you

1 before

you

2 before

you

there’s1 of me

3 before

you

4 before

you

5 before

you

adapted from MacKay (2003) textbook

Count the soldiers

Page 67: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

67

Great Ideas in ML: Message Passing

67

3 behind

you

2 before

you

there’s1 of me

Belief:Must be

2 + 1 + 3 = 6 of us

only seemy incoming

messages

2 31

Count the soldiers

adapted from MacKay (2003) textbook

Page 68: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

68

Belief:Must be

2 + 1 + 3 = 6 of us

2 31

Great Ideas in ML: Message Passing

68

4 behind

you

1 before

you

there’s1 of me

only seemy incoming

messages

Belief:Must be

1 + 1 + 4 = 6 of us

1 41

Count the soldiers

adapted from MacKay (2003) textbook

Page 69: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

69

Great Ideas in ML: Message Passing

69

7 here

3 here

11 here(=

7+3+1)

1 of me

Each soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook

Page 70: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

70

Great Ideas in ML: Message Passing

70

3 here

3 here

7 here(=

3+3+1)

Each soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook

Page 71: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

71

Great Ideas in ML: Message Passing

71

7 here

3 here

11 here(=

7+3+1)

Each soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook

Page 72: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

72

Great Ideas in ML: Message Passing

72

7 here

3 here

3 here

Belief:Must be14 of us

Each soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook

Page 73: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

73

Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree

73

7 here

3 here

3 here

Belief:Must be14 of us

wouldn’t work correctly

with a “loopy” (cyclic) graph

adapted from MacKay (2003) textbook

Page 74: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

7474

……

find preferred tags

Great ideas in ML: Forward-Backward

v 0.3n 0a 0.1

v 1.8n 0a 4.2

α βα

belief

message message

v 2n 1a 7

In the CRF, message passing = forward-backward

v 7n 2a 1

v 3n 1a 6

βv n a

v 0 2 1n 2 1 0a 0 3 1

v 3n 6a 1

v n av 0 2 1n 2 1 0a 0 3 1

Page 75: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

75

Extend CRF to “skip chain” to capture non-local factor More influences on belief

75

……

find preferred tags

Great ideas in ML: Forward-Backward

v 3n 1a 6

v 2n 1a 7

α β

v 3n 1a 6

v 5.4n 0a 25.2

v 0.3n 0a 0.1

Page 76: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

76

Extend CRF to “skip chain” to capture non-local factor More influences on belief Graph becomes loopy

76

……

find preferred tags

Great ideas in ML: Forward-Backward

v 3n 1a 6

v 2n 1a 7

α β

v 3n 1a 6

v 5.4`n 0a 25.2`

v 0.3n 0a 0.1

Red messages not independent?Pretend they are!

Page 77: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

77

inf xyz

1st Sg

2nd Sg

3rd Sg

1st Pl

2nd Pl

3rd Pl

Present Past

MRF over string-valued variables!

Each factor is a sophisticated weighted FST

Page 78: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

78

inf xyz

1st Sg

2nd Sg

3rd Sg

1st Pl

2nd Pl

3rd Pl

Present Past

Each factor is a sophisticated weighted FST

MRF over string-valued variables!

What are these messages?Probability distributions over strings …

Represented by weighted FSAsConstructed by finite-state operations

Parameters trainable using finite-state methods

Warning: FSAs can get larger and larger;must prune back using k-best or variational approx

Page 79: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

79

Key Idea! We’re using an MRF to coordinate the solutions to

several NLP problems

Each factor may be a whole NLP model over one or a few complex structured variables (strings, parses) Or equivalently, over many fine-grained variables

(individual words, tags, links)

Within a factor, use existing fast exact NLP algorithms These are the “propagators” that compute outgoing messages Even though the product of factors may be intractable or even

undecidable to work with We just saw this for morphology; now let’s see it for parsing

Page 80: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

8080

Back to simple variables … CRF for POS tagging

Now let’s do dependency parsing! O(n2) boolean variables for the possible links

v a n

Local factors in a graphical model

find preferred links ……

Page 81: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

81

Back to simple variables … CRF for POS tagging

Now let’s do dependency parsing! O(n2) boolean variables for the possible links

81

Local factors in a graphical model

find preferred links ……

tf

ft

ff

Possible parse— encoded as an assignment to these vars

v a n

Page 82: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

82

Back to simple variables … CRF for POS tagging

Now let’s do dependency parsing! O(n2) boolean variables for the possible links

82

Local factors in a graphical model

find preferred links ……

ff

tf

tf

Possible parse— encoded as an assignment to these varsAnother possible parse

v a n

Page 83: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

83

Back to simple variables … CRF for POS tagging

Now let’s do dependency parsing! O(n2) boolean variables for the possible links

(cycle)

83

Local factors in a graphical model

find preferred links ……

ft

ttf

Possible parse— encoded as an assignment to these varsAnother possible parseAn illegal parse

v a n

f

Page 84: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

84

Back to simple variables … CRF for POS tagging

Now let’s do dependency parsing! O(n2) boolean variables for the possible links

(cycle)

84

Local factors in a graphical model

find preferred links ……

t

tt

Possible parse— encoded as an assignment to these varsAnother possible parseAn illegal parseAnother illegal parse

v a n

t

(multiple parents)

f

f

Page 85: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

104

So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation

104

Local factors for parsing

find preferred links ……

t 2f 1

t 1f 2

t 1f 2

t 1f 6

t 1f 3

as before, goodness of this link can depend on entireobserved input context

t 1f 8

some other linksaren’t as goodgiven this input

sentence

But what if the best assignment isn’t a tree??

Page 86: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

105105

Global factors for parsing So what factors shall we multiply to define parse probability?

Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree

this is a “hard constraint”: factor is either 0 or 1

find preferred links ……

ffffff 0ffffft 0fffftf 0

… …fftfft 1

… …tttttt 0

Page 87: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

106

So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree

this is a “hard constraint”: factor is either 0 or 1

106

Global factors for parsing

find preferred links ……

ffffff 0ffffft 0fffftf 0

… …fftfft 1

… …tttttt 0

tf

ft

ff 64 entries (0/1)

So far, this is equivalent toedge-factored parsing(McDonald et al. 2005).

Note: McDonald et al. (2005) don’t loop through this table to consider exponentially many trees one at a time.They use combinatorial algorithms; so should we!

optionally require the tree to be projective (no crossing links)

we’relegal!

Page 88: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

107

So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree

this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables

grandparent

107

Local factors for parsing

find preferred links ……

f t

f 1 1

t 1 3

t

t

3

Page 89: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

108

So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree

this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables

grandparent no-cross

108

Local factors for parsing

find preferred links ……

t

by

t

f t

f 1 1

t 1 0.2

Page 90: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

109109

Local factors for parsing

find preferred links …… by

So what factors shall we multiply to define parse probability? Unary factors to evaluate each link in isolation Global TREE factor to require that the links form a legal tree

this is a “hard constraint”: factor is either 0 or 1 Second-order effects: factors on 2 variables

grandparent no-cross coordination with other parse & alignment hidden POS tags siblings subcategorization …

Page 91: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

110110

Exactly Finding the Best Parse

With arbitrary features, runtime blows up Projective parsing: O(n3) by dynamic programming

Non-projective: O(n2) by minimum spanning tree

but to allow fast dynamic programming or MST parsing,only use single-edge features

…find preferred links…

O(n4)

grandparents

O(n5)

grandp.+ siblingbigrams

O(n3g6)

POS trigrams

… O(2n)

sibling pairs (non-adjacent)

NP-hard

•any of the above features•soft penalties for crossing links•pretty much anything else!

Page 92: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

111

Two great tastes that taste great together

You got dynamic

programming in my belief propagation!

You got belief propagation in my dynamic

programming!

Page 93: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

113113

Loopy Belief Propagation for Parsing

find preferred links ……

Sentence tells word 3, “Please be a verb” Word 3 tells the 3 7 link, “Sorry, then you probably don’t exist” The 3 7 link tells the Tree factor, “You’ll have to find another

parent for 7” The tree factor tells the 10 7 link, “You’re on!” The 10 7 link tells 10, “Could you please be a noun?” …

Page 94: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

114

Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle … Strong links are suppressing or promoting other links …

114

Loopy Belief Propagation for Parsing

find preferred links ……

Page 95: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

115

Higher-order factors (e.g., Grandparent) induce loops Let’s watch a loop around one triangle …

How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t,

given the messages it receives from all the other links?”

115

Loopy Belief Propagation for Parsing

find preferred links ……

?TREE factorffffff 0ffffft 0fffftf 0

… …fftfft 1

… …tttttt 0

Page 96: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

116

How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t,

given the messages it receives from all the other links?”

116

Loopy Belief Propagation for Parsing

find preferred links ……

?

TREE factorffffff 0ffffft 0fffftf 0

… …fftfft 1

… …tttttt 0

But this is the outside probability of green link!

TREE factor computes all outgoing messages at once (given all incoming messages)

Projective case: total O(n3) time by inside-outside

Non-projective: total O(n3) time by inverting Kirchhoff

matrix (Smith & Smith, 2007)

Page 97: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

117

How did we compute outgoing message to green link? “Does the TREE factor think that the green link is probably t,

given the messages it receives from all the other links?”

117

Loopy Belief Propagation for Parsing

But this is the outside probability of green link!

TREE factor computes all outgoing messages at once (given all incoming messages)

Projective case: total O(n3) time by inside-outside

Non-projective: total O(n3) time by inverting Kirchhoff

matrix (Smith & Smith, 2007)

Belief propagation assumes incoming messages to TREE are independent.So outgoing messages can be computed with first-order parsing algorithms

(fast, no grammar constant).

Page 98: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

118

Some interesting connections … Parser stacking (Nivre & McDonald 2008, Martins et al. 2008)

Global constraints in arc consistency ALLDIFFERENT constraint (Régin 1994)

Matching constraint in max-product BP For computer vision (Duchi et al., 2006) Could be used for machine translation

As far as we know, our parser is the first use of global constraints in sum-product BP.

And nearly the first use of BP in natural language processing.

Page 99: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

119

Runtimes for each factor type (see paper) Factor type degree runtime count total

Tree O(n2) O(n3) 1 O(n3)

Proj. Tree O(n2) O(n3) 1 O(n3)Individual links 1 O(1) O(n2) O(n2)

Grandparent 2 O(1) O(n3) O(n3)

Sibling pairs 2 O(1) O(n3) O(n3)

Sibling bigrams O(n) O(n2) O(n) O(n3)

NoCross O(n) O(n) O(n2) O(n3)

Tag 1 O(g) O(n) O(n)

TagLink 3 O(g2) O(n2) O(n2)

TagTrigram O(n) O(ng3) 1 O(n)

TOTAL O(n3)

+=Additive, not multiplicative!

periteration

Page 100: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

120

Runtimes for each factor type (see paper) Factor type degree runtime count total

Tree O(n2) O(n3) 1 O(n3)

Proj. Tree O(n2) O(n3) 1 O(n3)Individual links 1 O(1) O(n2) O(n2)

Grandparent 2 O(1) O(n3) O(n3)

Sibling pairs 2 O(1) O(n3) O(n3)

Sibling bigrams O(n) O(n2) O(n) O(n3)

NoCross O(n) O(n) O(n2) O(n3)

Tag 1 O(g) O(n) O(n)

TagLink 3 O(g2) O(n2) O(n2)

TagTrigram O(n) O(ng3) 1 O(n)

TOTAL O(n3)

+=Additive, not multiplicative!Each “global” factor coordinates an unbounded # of variables

Standard belief propagation would take exponential timeto iterate over all configurations of those variables

See paper for efficient propagators

Page 101: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

121

Dependency AccuracyThe extra, higher-order features help! (non-projective parsing)

Danish Dutch English

Tree+Link 85.5 87.3 88.6

+NoCross 86.1 88.3 89.1

+Grandparent 86.1 88.6 89.4

+ChildSeq 86.5 88.5 90.1

Page 102: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

122

Dependency AccuracyThe extra, higher-order features help! (non-projective parsing)

Danish Dutch English

Tree+Link 85.5 87.3 88.6

+NoCross 86.1 88.3 89.1

+Grandparent 86.1 88.6 89.4

+ChildSeq 86.5 88.5 90.1

Best projective parse with all factors

86.0 84.5 90.2

+hill-climbing 86.1 87.6 90.2

exact, slow

doesn’t fixenough edges

Page 103: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

123

Time vs. Projective Search Error

…DP 140

Compared with O(n4) DP Compared with O(n5) DP

iterations

iterations

iterations

Page 104: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

125125

Summary of MRF parsing by BP

Output probability defined as product of local and global factors Throw in any factors we want! (log-linear model) Each factor must be fast, but they run independently

Let local factors negotiate via “belief propagation” Each bit of syntactic structure is influenced by others Some factors need combinatorial algorithms to compute messages fast

e.g., existing parsing algorithms using dynamic programming Each iteration takes total time O(n3) or even O(n2); see paper

Compare reranking or stacking

Converges to a pretty good (but approximate) global parse Fast parsing for formerly intractable or slow models Extra features of these models really do help accuracy

Page 105: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

126

Outline

1. Why use joint models in NLP?

2. Making big joint models tractable:Approximate inference and training by loopy belief propagation

3. Open questions: Semisupervised training of joint models

Page 106: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

127

Training with missing data is hard! Semi-supervised learning of HMMs or PCFGs: ouch!

Merialdo: Just stick with the small supervised training set Adding unsupervised data tends to hurt

A stronger model helps (McClosky et al. 2007, Cohen et al. 2009)

So maybe some hope from good models @ factors And from having lots of factors (i.e., take cues from lots of

correlated variables at once; cf. Yarowsky et al.) Naïve Bayes would be okay …

Variables with unknown values can’t hurt you. They have no influence on training or decoding.

But can’t help you, either! And indep. assumptions are flaky. So I’d like to keep discussing joint models …

Page 107: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

128

Case #1: Missing data that you can’t impute

sentence parse

translation

word-to-wordalignment

parse oftranslation

Treat like multi-task learning? Shared features between 2 tasks: parse Chinese vs. parse Chinese w/ English translationOr 3 tasks: parse Chinese w/ inferred English gist vs. parse Chinese w/ English translation vs. parse English gist derived from English (supervised)

Page 108: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

129

inf xyz

1st Sg

2nd Sg

3rd Sg

1st Pl

2nd Pl

3rd Pl

Present Past

Case #2: Missing data you can impute, but maybe badly

Each factor is a sophisticated weighted FST

Page 109: 1 1 Jason Eisner NAACL Workshop Keynote – June 2009 Joint Models with Missing Data for Semi-Supervised Learning.

130

inf xyz

1st Sg

2nd Sg

3rd Sg

1st Pl

2nd Pl

3rd Pl

Present Past

Case #2: Missing data you can impute, but maybe badly

Each factor is a sophisticated weighted FST

This is where simple cases of EM go wrong Could reduce to case #1 and throw away these variables Or: Damp messages from imputed variables to the extent you’re

not confident in them Requires confidence estimation. (cf. strapping) Crude versions: Confidence depends in a fixed way on time, or on

entropy of belief at that node, or on length of input sentence. But could train a confidence estimator on supervised data to

pay attention to all sorts of things! Correspondingly, scale up features for related missing-data tasks

since the damped data are “partially missing”


Recommended