From Baby Steps to Leapfrog: How “Less is More” Baby Steps to Leapfrog: How “Less is More”...

From Baby Steps to Leapfrog:

How “Less is More”in Unsupervised Dependency Parsing

Valentin I. Spitkovsky

with Hiyan Alshawi (Google Inc.)and Daniel Jurafsky (Stanford University)

Spitkovsky et al. (Stanford & Google) From Baby Steps to Leapfrog NAACL HLT (2010-06-04) 1 / 30

Overview

Idea: (At Least) Two Axes worth Scaffolding


Overview

Idea: (At Least) Two Axes worth ScaffoldingModel (or Algorithmic) Complexity


Overview

Idea: (At Least) Two Axes worth ScaffoldingModel (or Algorithmic) Complexity [classic NLP]

— word alignment (unsupervised), e.g., IBM models 1-5(Brown et al., 1993)


Overview



— parsing (supervised), e.g., “coarse-to-fine” grammars(Charniak and Johnson, 2005; Petrov 2009)


Overview




Data (or Problem / Task) Complexity


Overview




Data (or Problem / Task) Complexity [rare in NLP]

— reinforcement learning, e.g., robot navigation(Singh, 1992; Sanger 1994)


Overview




Data (or Problem / Task) Complexity [rare in NLP]

— reinforcement learning, e.g., robot navigation(Singh, 1992; Sanger 1994)

— closest in NLP: cautious named entity classification(Collins and Singer, 1999; Yarowsky, 1995)


Overview

Outline: Three Data-Complexity-Aware Techniques


Overview


Baby Steps: scaffolding on data complexity— iterative, requires no initialization


Overview



Less is More: filtering by data complexity— batch, capable of using a good initializer


Overview



Less is More: filtering by data complexity— batch, capable of using a good initializer

Leapfrog: a combination (best of both worlds)— intended as an efficiency hack (but performs best)


The Problem

Problem: Unsupervised Learning of Parsing


The Problem


Input: Raw Text

... By most measures, the nation’s industrial sector is now

growing very slowly — if at all. Factory payrolls fell in

September. So did the Federal Reserve ...


The Problem


NN NNS VBD IN NN ♦| | | | | |

Factory payrolls fell in September .

Input: Raw Text (Sentences, Tokens and POS-tags)





The Problem




Input: Raw Text (Sentences, Tokens and POS-tags)




Output: Syntactic Structures (and a Probabilistic Grammar)


The Problem

Motivation: Unsupervised (Dependency) Parsing


The Problem


Insert your favorite reason(s) why you’d like to parseanything in the first place...


The Problem



... adjust for any data without reference tree banks:


The Problem



... adjust for any data without reference tree banks:— i.e., exotic languages


The Problem



... adjust for any data without reference tree banks:— i.e., exotic languages and/or genres (e.g., legal).


The Problem




Potential applications:


The Problem




Potential applications:◮ machine translation


The Problem





— word alignment, phrase extraction, reordering;


The Problem






◮ web search


The Problem






◮ web search— retrieval, query refinement;


The Problem







◮ question answering


The Problem







◮ question answering, speech recognition, etc.


State-of-the-Art

State-of-the-Art: Directed Dependency Accuracy


State-of-the-Art

State-of-the-Art: Directed Dependency Accuracy42.2% on Section 23 (all sentences) of WSJ

(Cohen and Smith, 2009)


State-of-the-Art



31.7% for the (right-branching) baseline(Klein and Manning, 2004)


State-of-the-Art



31.7% for the (right-branching) baseline(Klein and Manning, 2004)

Scoring example:



Directed Score: 35 = 60% (baseline: 2

5 = 40%);

Undirected Score: 45 = 80% (baseline: 4

5 = 80%).


State-of-the-Art

State-of-the-Art: A Brief History


State-of-the-Art


1992 — word classes (Carroll and Charniak)


State-of-the-Art



1998 — greedy linkage via mutual information (Yuret)


State-of-the-Art




2001 — iterative re-estimation with EM (Paskin)


State-of-the-Art





2004 — right-branching baseline— valence (DMV) (Klein and Manning)


State-of-the-Art







State-of-the-Art






2004 — annealing techniques (Smith and Eisner)


State-of-the-Art







2005 — contrastive estimation (Smith and Eisner)


State-of-the-Art








2006 — structural biasing (Smith and Eisner)


State-of-the-Art









2007 — common cover link representation (Seginer)


State-of-the-Art










2008 — logistic normal priors (Cohen et al.)


State-of-the-Art











2009 — lexicalization and smoothing (Headden et al.)


State-of-the-Art











2009 — lexicalization and smoothing (Headden et al.)

2009 — soft parameter tying (Cohen and Smith)


State-of-the-Art

State-of-the-Art: Dependency Model with Valence


State-of-the-Art


a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)


State-of-the-Art



h


State-of-the-Art



h


State-of-the-Art



h

a1


State-of-the-Art



h

a1


State-of-the-Art



h

a1


State-of-the-Art



h

a1 a2


State-of-the-Art



h

a1 a2


State-of-the-Art



h

a1 a2

STOP


State-of-the-Art



h

a1 a2

STOP

P(th) =∏

dir∈{L,R}

PSTOP(ch, dir,

adj︷︸︸︷

1n=0)

n∏

i=1

P(tai ) PATTACH(ch, dir, cai )

(1− PSTOP(ch, dir,

adj︷︸︸︷

1i=1))

n=|args(h,dir)|Spitkovsky et al. (Stanford & Google) From Baby Steps to Leapfrog NAACL HLT (2010-06-04) 8 / 30

State-of-the-Art

State-of-the-Art: Unsupervised Learning Engine


State-of-the-Art

State-of-the-Art: Unsupervised Learning EngineEM, via inside-outside re-estimation (Baker, 1979)


State-of-the-Art


w1 wmwp−1 wp wq wq+1

N1 (Manning and Schutze, 1999)

N j

· · · · · · · · ·

α

β


State-of-the-Art


w1 wmwp−1 wp wq wq+1

N1 (Manning and Schutze, 1999)

N j

· · · · · · · · ·

α

β

BLACK

BOX


State-of-the-Art

State-of-the-Art: The Standard Corpus


State-of-the-Art


Training: WSJ10 (Klein, 2005)


State-of-the-Art



◮ The Wall Street Journal section of thePenn Treebank Project (Marcus et al., 1993)


State-of-the-Art




◮ ... stripped of punctuation, etc.


State-of-the-Art




◮ ... stripped of punctuation, etc.◮ ... filtered down to sentences left

with no more than 10 POS tags;


State-of-the-Art





with no more than 10 POS tags;◮ ... and converted to reference dependencies

using “head percolation rules” (Collins, 1999).


State-of-the-Art





with no more than 10 POS tags;◮ ... and converted to reference dependencies

using “head percolation rules” (Collins, 1999).

Evaluation: Section 23 of WSJ∞ (all sentences).


State-of-the-Art


5 10 15 20 25 30 35 40 45

5

10

15

20

25

30

35

40

45Sentences (1,000s)

Tokens (1,000s)100

200

300

400

500

600

700

800

900

WSJk


State-of-the-Art


5 10 15 20 25 30 35 40 45

5

10

15

20

25

30

35

40

45Sentences (1,000s)

Tokens (1,000s)100

200

300

400

500

600

700

800

900

WSJk


(At Least) Two Issues

Issue I: Why so little data?




extra unlabeled datahelps semi-supervised parsing (Suzuki et al., 2009)





yet state-of-the-art unsupervised methods use evenless than what’s available for supervised training...





yet state-of-the-art unsupervised methods use evenless than what’s available for supervised training...

we will explore (three) judicious uses of dataand simple, scalable machine learning techniques



Issue II: Non-convex objective...




maximizing the probability of data (sentences):

θUNS = argmaxθ

∑

s

log∑

t∈T (s)

Pθ(t)

︸︷︷︸

Pθ(s)





θUNS = argmaxθ

∑

s

log∑

t∈T (s)

Pθ(t)

︸︷︷︸

Pθ(s)

supervised objective would be convex (counting):

θSUP = argmaxθ

∑

s

log Pθ(t∗(s)).





θUNS = argmaxθ

∑

s

log∑

t∈T (s)

Pθ(t)

︸︷︷︸

Pθ(s)


θSUP = argmaxθ

∑

s

log Pθ(t∗(s)).

in general, θSUP 6= θUNS and θUNS 6= θUNS... (see CoNLL)





θUNS = argmaxθ

∑

s

log∑

t∈T (s)

Pθ(t)

︸︷︷︸

Pθ(s)


θSUP = argmaxθ

∑

s

log Pθ(t∗(s)).

in general, θSUP 6= θUNS and θUNS 6= θUNS... (see CoNLL)

initialization matters!



Issues: The Lay of the Land

5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk

Directed Dependency Accuracy (%)on WSJk



5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk




5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk


Uninformed



5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk


Uninformed



5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk


Uninformed



5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk


Uninformed



5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk


Uninformed

Oracle



5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk


Uninformed

Oracle



5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk


Uninformed

Oracle

K&M (Ad-Hoc Harmonic Init)


Baby Steps

Idea I: Baby Steps ... as Non-convex Optimization


Baby Steps


global non-convex optimization is hard ...


Baby Steps



meta-heuristic: take guesswork out of local search


Baby Steps




start with an easy (convex) case


Baby Steps





slowly extend it to the fully complex target task


Baby Steps






take tiny (cautious) steps in the problem space


Baby Steps







... try not to stray far from relevantneighborhoods in the solution space


Baby Steps








base case: sentences of length one (trivial — no init)


Baby Steps









incremental step: smooth WSJk; re-init WSJ(k + 1)


Baby Steps









incremental step: smooth WSJk; re-init WSJ(k + 1)

... this really is grammar induction!


Baby Steps

Idea I: Baby Steps ... as Graduated Learning


Baby Steps


WSJ1 — Atone (verbs!)


Baby Steps



Darkness fell. (nouns!)WSJ2 — It is.

Judge Not


Baby Steps



Darkness fell. (nouns!)WSJ2 — It is.

Judge Not

Become a Lobbyist (determiners!)WSJ3 — But many have.

They didn’t.


Baby Steps

Idea I: Baby Steps ... and Related Notions


Baby Steps


shaping (Skinner, 1938)


Baby Steps



less is more (Kail, 1984; Newport, 1988; 1990)


Baby Steps




starting small (Elman, 1993)


Baby Steps





◮ scaffold on model complexity [restrict memory]


Baby Steps





◮ scaffold on model complexity [restrict memory]◮ scaffold on data complexity [restrict input]


Baby Steps






controversy! (Rohde and Plaut, 1999)


Baby Steps







stepping stones (Brown et al., 1993)


Baby Steps








coarse-to-fine (Charniak and Johnson, 2005)


Baby Steps









curriculum learning (Bengio et al., 2009)


Baby Steps










continuation methods (Allgower and Georg, 1990)


Baby Steps










continuation methods (Allgower and Georg, 1990)

successive approximations!


Baby Steps

Idea I: Baby Steps ... Results!

5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk


Uninformed

Oracle

K&M (Ad-Hoc Harmonic Init)

Baby Steps


5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk


Uninformed

Oracle

K&M Baby Steps

Baby Steps


5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk


Uninformed

Oracle

K&M Baby Steps


Baby Steps

Idea I: Baby Steps ... Concerns?


Baby Steps


ignores a good initializer


Baby Steps



unnecessarily meticulous


Baby Steps




excruciatingly slow!


Baby Steps




excruciatingly slow!

about a year behind state-of-the-art (on long sentences)


Less is More

Idea II: Less is More


Less is More


short sentences are not representative (and few)


Less is More



long sentences are overwhelmingly difficult ...


Less is More




is there a sweet spot data gradation?


Less is More




is there a sweet spot data gradation?

perhaps train where Baby Steps flatlines!


Less is More

Idea II: Less is More ... the Learning Curve

5 10 15 20 25 30 35 40 45

3.0

3.5

4.0

4.5

5.0

WSJk

bpt

Cross-entropy h (in bits per token) on WSJ45

Less is More


5 10 15 20 25 30 35 40 45

3.0

3.5

4.0

4.5

5.0

WSJk

bpt


Knee

[7, 15] Tight, Flat, Asymptotic Bound

Less is More


5 10 15 20 25 30 35 40 45

3.0

3.5

4.0

4.5

5.0

WSJk

bpt


Knee


— automatically detect the knee: [7, 15]

Less is More


5 10 15 20 25 30 35 40 45

3.0

3.5

4.0

4.5

5.0

WSJk

bpt


Knee


— automatically detect the knee: [7, 15]

— train at the “sweet spot” gradation: WSJ15


Less is More

Idea II: Less is More ... Results!

5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk

Directed Dependency Accuracy (%)

Oracle

Baby Steps

on WSJk

K&M

Less is More


5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90 on WSJ40

WSJk


Oracle

Baby Steps

Less is More


5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90 on WSJ40

WSJk


Oracle

Baby Steps

Less is More︸︷︷︸

K&M∗


Less is More

Idea II: Less is More ... Concerns?


Less is More


discards most of the data


Less is More



beats state-of-the-art (on long sentences, off WSJ15)


Less is More



beats state-of-the-art (on long sentences, off WSJ15)

ignores a decent complementary initialization strategy


Leapfrog

Idea III: Leapfrog ... a Hack


Leapfrog


use both good systems!


Leapfrog



thorough training up to WSJ15, where it’s cheap


Leapfrog




use both good initializers (mix their best parse trees)


Leapfrog





execute just a few steps of EM where it’s expensive


Leapfrog





execute just a few steps of EM where it’s expensive

hop on from WSJ15 to WSJ45, via WSJ30...


Leapfrog

Idea III: Leapfrog ... Results!

5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk


Oracle

Uninformed

Baby Steps

Leapfrog


5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk


Oracle

Uninformed

Baby Steps

K&M∗

Leapfrog


5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk


Oracle

Uninformed

Baby Steps

K&M∗

Leapfrog


5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk


Oracle

Uninformed

Baby Steps

K&M∗

Leapfrog


5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk


Oracle

Uninformed

Baby Steps

K&M∗

Leapfrog


5 10 15 20 25 30 35 40

20

30

40

50

60

70

80

90

WSJk


Oracle

Uninformed

Baby Steps

K&M∗

Leapfrog


Results

Results: ... on Section 23 of WSJ


Results


Right-Branching (Klein and Manning, 2004) 31.7%


Results


Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%


Results


Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%Baby Steps @15 39.2%Baby Steps @45 39.4%


Results


Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%Baby Steps @15 39.2%Baby Steps @45 39.4%Soft Parameter Tying (Cohen and Smith, 2009) 42.2%


Results


Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%Baby Steps @15 39.2%Baby Steps @45 39.4%Soft Parameter Tying (Cohen and Smith, 2009) 42.2%Less is More @15 44.1%


Results


Right-Branching (Klein and Manning, 2004) 31.7%DMV @10 34.2%Baby Steps @15 39.2%Baby Steps @45 39.4%Soft Parameter Tying (Cohen and Smith, 2009) 42.2%Less is More @15 44.1%Leapfrog @45 45.0%


Conclusion

Summary


Conclusion

Summary

explored scaffolding on data complexity


Conclusion

Summary


awareness of data complexity does help!


Conclusion

Summary


awareness of data complexity does help!

beats state-of-the-art with older techniques


Conclusion

Conclusion


Conclusion

Conclusion

(need a less adversarial learning algorithm)


Conclusion

Conclusion


paradox: improved performance with less data


Conclusion

Conclusion



despite discarding samples from the true (test) distribution


Conclusion

Conclusion




focusing on simple examples guides unsupervised learning


Conclusion

Conclusion




focusing on simple examples guides unsupervised learning

mirrors supervised boosting (Freund and Schapire, 1997)


Conclusion

Teaser


Conclusion

Teaser

we push the state-of-the-art further, to 50.4% (upanother 5%) using even faster and simpler methods!


Conclusion

Teaser


... hear us at CoNLL and ACL (Spitkovsky et al., 2010)


Conclusion

Teaser



similar approaches may apply in other settings(e.g., word alignment)


Conclusion

Teaser



similar approaches may apply in other settings(e.g., word alignment)

... more to come!


Conclusion

Thanks!

Questions?


Date post:	05-May-2018
Category:	Documents
Upload:	ngokhue
View:	216 times
Download:	0 times

From Baby Steps to Leapfrog: How “Less is More” Baby Steps to Leapfrog: How “Less is More”...

Documents