Synchronous Grammars and Tree Automata David Chiang and Kevin Knight USC/Information Sciences...

Synchronous Grammars andTree Automata

David Chiang and Kevin KnightUSC/Information Sciences Institute

Viterbi School of EngineeringUniversity of Southern California

Why Worry About Formal Language Theory?

• They already figured out most of the key things back in the 1960s & 1970s

• Lucky us!– Helps clarify our thinking about applications– Helps modeling & algorithm development– Helps promote code re-use– Opportunity to develop novel, efficient algorithms and

data structures that bring ancient theorems to life• Formal grammar and automata theory are the

daughters of natural language processing – let’s keep in touch

[Chomsky 57]

• Distinguish grammatical English from ungrammatical English:– John thinks Sara hit the boy– * The hit thinks Sara John boy– John thinks the boy was hit by Sara– Who does John think Sara hit?– John thinks Sara hit the boy and the girl– * Who does John think Sara hit the boy and?– John thinks Sara hit the boy with the bat– What does John think Sara hit the boy with?– Colorless green ideas sleep furiously.– * Green sleep furiously ideas colorless.

This Research Program has Contributed Powerful Ideas

Formal

language

hierarchy

Compiler

technology

Context-free grammar

This Research Program has NLP Applications

Alternative speech recognition or translation outputs:

Green = how grammaticalBlue = how sensible

This Research Program hasNLP Applications

Alternative speech recognition or translation outputs:

Pick this one!

Green = how grammaticalBlue = how sensible

This Research ProgramHas Had Wide Reach

• What makes a legal RNA sequence?

• What is the structure of a given RNA sequence?

Yasubumi Sakakibara, Michael Brown, Richard Hughey, I. Saira Mian, Kimmen Sjolander, Rebecca C. Underwood and David Haussler. Stochastic Context-Free Grammars for tRNA, Modeling Nucleic Acids Research,22(23):5112-5120, 1994.

This Research Program is Really Unfinished!

Type in your English sentence here:

Is this grammatical?

Is this sensible?

Acceptors and Transformers

• Chomsky’s program is about which utterances are acceptable

• Other research programs are aimed at transforming utterances– Translate a English sentence into Japanese…– Transform a speech waveform into transcribed words…– Compress a sentence, summarize a text…– Transform a syntactic analysis into a semantic

analysis…– Generate a text from a semantic representation…

Strings and Trees

• Early on, trees were realized to be a useful tool in describing what is grammatical– A sentence is a noun phrase (NP) followed by a verb phrase (VP)– A noun phrase is a determiner (DT) followed by a noun (NN)– A noun phrase is a noun phrase (NP) followed by a prepositional

phrase (PP)– A PP is a preposition (IN) followed by an NP

• A string is acceptable if it has an acceptable tree …• Transformations may take place at the tree level …

S

NP VP

NP PP

IN NP

Natural Language Processing

• 1980s: Many tree-based grammatical formalisms• 1990s: Regression to string-based formalisms

– Hidden Markov Models (HMM), Finite-State Acceptors (FSAs) and Transducers (FSTs)

– N-gram models for accepting sentences [eg, Jelinek 90]– Taggers and other statistical transformations [eg, Church 88]– Machine translation [eg, Brown et al 93]– Software toolkits implementing generic weighted FST operations

[eg, Mohri, Pereira, Riley 00]

WFSA WFST WFST WFSTw e j k

“backwards application of string k through a composition of a transducer cascade, intersected with a weighted FSA language model”

Natural Language Processing

• 2000s: Emerging interest in tree-based probabilistic models– Machine translation

• [Wu 97, Yamada & Knight 02, Melamed 03, Chiang 05, …]– Summarization

• [Knight & Marcu 00, …]– Paraphrasing

• [Pang et al 03, …]– Question answering

• [Echihabi & Marcu 03, …]– Natural language generation

• [Bangalore & Rambow 00, …]

What are the conceptual tools to help us get a grip on encoding andexploiting knowledge aboutlinguistic tree transformations?

Goal of this tutorial

Part 1: Introduction

Part 2: Synchronous Grammars < break >

Part 3: Tree Automata

Part 4: Conclusion




Part 4: Conclusion




Part 4: Conclusion

Tree Automata

• David talked about synchronous grammars

• I’ll talk about tree automata

Output tree 1Output tree 2

Output treeInput tree

synchronized

Steps to Get There

Strings Trees

Grammars Regular grammar ?

Acceptors Finite-state acceptor (FSA), PDA

?

String Pairs (inputoutput)

Tree Pairs (inputoutput)

Transducers Finite-state transducer (FST)

?

Context-Free Grammar

• Example:– S NP VP [p=1.0]– NP DET N [p=0.7]– NP NP PP [p=0.3]– PP P NP [p=1.0]– VP V NP [p=0.4]– DET the [p=1.0]– N boy [p=1.0]– V saw [p=1.0]– P with [p=1.0]

• Defines a set of strings• Language described by CFG can also

be described by a push-down acceptor

Generative Process:

S





Generative Process:

S

VPNP





Generative Process:

S

NP

VPNP

PP





Generative Process:

S

NP

DET N

VPNP

PP

the boy





Generative Process:

S

NP

DET N

VPNP

PP

P NP

the boy

…





Generative Process:

S

NP

DET N

VP

V NP

NP

PP

P NP DET N

DET Nthe boy

the boy

the boy

saw

with


• Is this what lies behind modern parsers like Collins and Charniak?

• No… they do not have a finite list of productions, but an essentially infinite list.– “Markovized” grammar

• Generative process is head-out:

S

VP

Ph(VP | S)





S

VPNP

Pleft(NP | VP, S)





S

VPNPPP

Pleft(PP | VP, S)





S

VPNPPPSTOP

Pleft(STOP | VP, S)





S

VPNPPPSTOP ADVP





S

VPNPPPSTOP ADVP STOP





S


VBD





S


VBD NP





S


VBD NP NP ……

ECFG [Thatcher 67]

• Example:– S ADVP* NP VP PP*

– VP VBD NP [NP] PP*

• Defines a set of strings• Can model Charniak and Collins:

– VP (ADVP | PP)* VBD (NP | PP)*

• Can even model probabilistic version:

*e* : Pright(STOP|VP)VBD:1.0*e* : Pleft(STOP|VP)

ADVP : Pleft(ADVP|VP) NP

PP : Pleft(PP|VP) PP

VP *e*:Phead(VBD|VP)

ECFG [Thatcher 67]

• Is ECFG more powerful than CFG?

• It is possible to build a CFG that generates the same set of strings as your ECFG– As long as right-hand side of every rule is regular– Or even context-free! (see footnote, [Thatcher 67])

• BUT:– the CFG won’t generate the same derivation trees

as the ECFG

ECFG [Thatcher 67]

• For example:

– ECFG can (of course) accept the string language a*– Therefore, we can build a CFG that also accepts a*– But ECFG can do it via this set of derivation trees, if so desired:

S S S S S …

a a aaa aa a a a a aa a a

, , , , ,

Tree Generating Systems• Sometimes the trees are important

– Parsing, RNA structure prediction– Tree transformation

• We can view CFG or ECFG as a tree-generating system, if we squint hard …

• Or, we can look at tree-generating systems– Regular tree grammar (RTG) is standard in automata

literature• Slightly more powerful than tree substitution grammar (TSG)• Less powerful than tree-adjoining grammar (TAG)

– Top-down tree acceptors (TDTA) are also standard

What We Want• Device or grammar D for compactly representing

possibly-infinite set S of trees (possibly with weights)

• Want to support operations like:– Membership testing: Is tree X in the set S?– Equality testing: Do sets described by D1 and D2

contain exactly the same trees?– Intersection: Compute (possibly-infinite) set of trees

described by both D1 and D2– Weighted tree-set intersection, e.g.:

• D1 describes a set of candidate English translations of some Chinese sentence, including disfluent ones

• D2 describes a general model of fluent English

Regular Tree Grammar (RTG)

• Example:– q S(qnp, VP(V(run))) [p=1.0]

– qnp NP(qdet, qn) [p=0.7]

– qnp NP(qnp, qpp) [p=0.3]

– qpp PP(qprep, qnp) [p=1.0]

– qdet DET(the) [p=1.0]

– qprep PREP(of) [p=1.0]

– qn N(sons) [p=0.5]

– qn N(daughters)[p=0.5]

• Defines a set of trees

Generative Process:

q











Generative Process:

S

VP

V

qnp

run











Generative Process:

S

VP

V

NP

run

qnp qpp











Generative Process:

S

VP

V

NP

run

qppNP

qdet qn











Generative Process:

S

VP

V

NP

run

qppNP

DET

the

qn











Generative Process:

S

VP

V

NP

run

qppNP

DET

the

N

sons











Generative Process:

S

VP

V

NP

run

PPNP

DET

the

N

sons

PREP NP

of DET

the

N

daughters

P(t) = 1.0 x 0.3 x 0.7 x 0.5 x 0.7 x 0.5

{ , }

Relation Between RTG and CFG

• For every CFG, there is an RTG that directly generates its derivation trees– TRUE (just convert the notation)

• For every RTG, there is a CFG that generates the same trees in its derivations– FALSE

RTG:q NP(NN(clown)

NN(killer))q NP(NN(killer)

NN(clown))

No CFG possible.

NP

NN NN

killer clown

NP

NN NN

clown killer

but reject “clown clown”

Relation Between RTG and TSG

• For every TSG, there is an RTG that directly generates its trees– TRUE (just convert the notation)

• For every RTG, there is an TSG that generates the same trees– FALSE

• Using states, an RTG can accept all trees (over symbols a and b) that contain exactly one a

Relation of RTG to Johnson [98]Syntax Model

TOP

S

NP VP

PRO VB NP

PRO

P(PRO | NP:VP)

P(PRO | NP:S)

= 0.21

= 0.03

Relation of RTG to Johnson [98]Syntax Model

TOP

S:TOP

NP:S VP:S

PRO:NP VB:VP NP:VP

PRO:NP

P(PRO | NP:VP)

P(PRO | NP:S)

= 0.21

= 0.03

Relation of RTG to Johnson [98] Syntax Model

S

NP VP

PRO VB NP

PRO

RTG:qstart S(q.np:s, q.vp:s)q.np:s NP(q.pro:np) p=0.21

q.np:vp NP(q.pro:np) p=0.03

q.pro:np PRO(q.pro)q.pro:np heq.pro:np sheq.pro:np himq.pro.np her

P(PRO | NP:VP)= 0.03

Relation of RTG to Lexicalized Syntax Models

• Collins’ model assigns P(t) to every tree• Can weighted RTG (wRTG) implement P(t) for

language modeling? – Just like a WFSA can implement smoothed n-gram

language model…

• Something like yes. – States can encode relevant context that is passed up and

down the tree.– Technical problems:

• “Markovized” grammar (Extended RTG?)• Some models require keeping head-word information, and if we

back off to a state that forgets the head-word, we can’t get it back.

Something an RTG Can’t Do

{b, , , , …}a a a

b b a a

b b b b

a

a a

b b b b

a

a a

b b b b

Note also that the yield language is {b2n : n 1}, which is not context-free.

Language Classes in String World and Tree World

String World Tree

World

indexedlanguages

RegularLanguages

CFL

RTL

CFTL

yield

yield

…

…

Picture So Far

Strings Trees

Grammars Regular grammar, CFG

DONE (RTG)


?




?

FSA Analog: Top-Down Tree Acceptor [Rabin, 1969; Doner, 1970]

For any RTG, there is a TDTA that accepts the same trees.For any TDTA, there is an RTG that accepts the same trees.

RTG:q NP(NN(clown) NN(killer))q NP(NN(killer) NN(clown))

q NP

NN NN

clown killer

NP

q2 NN q3 NN

clown killer

NP

q2 clown q3 killer

NN NN

NP

clown killer

NN NN

accept accept

TDTA:q NP q2 q3 q NP q3 q2q2 NN q2

q3 NN q3

q2 clown acceptq3 killer accept

FSA Analog: Bottom-Up Tree Acceptor [Thatcher & Wright 68; Doner 70]

• A similar story for bottom-up acceptors…

• To summarize tree acceptor automata and RTGs:– Tree acceptors are the analogs of string FSAs

– They are often used in proofs

– The visual appeal is not as great at FSAs

– People often prefer to read and write RTGs

Properties of Language Classes

RSL(FSA, rexp)

CFL(PDA, CFG)

RTL(TDTA, RTG)

Closed under union YES YES YESClosed under intersection YES NO YESClosed under complement YES NO YESMembership testing O(n) O(n3) O(n)Emptiness decidable? YES YES YESEquality decidable? YES NO YES

(references: Hopcroft & Ullman 79, Gécseg & Steinby 84)

String Sets Tree Sets

Picture So Far

Strings Trees

Grammars Regular grammar, CFG

DONE (RTG)


DONE (TDTA)




?

Transducers

Tree Transducer

Tree transducers compactly represent a possibly-infinite set of tree pairs

Probabilistic tree transducer assigns P(t2 | t1) to every tree pair

Can ask: What’s the best transformation of tree t1?

String Transducer

FST compactly represents a possibly-infinite set of string pairs

Probabilistic FST assigns P(s2 | s1) to every string pair

Can ask: What’s the best transformation of input string s1?

.

Reorder

S

NP VB2 VB1

PP VB

NN P

he adores

listening

music to

Insert

desu

S

NP VB2 VB1

PP VB

NN P

he ha

music to

ga

adores

listening no

Translate

Kare ha ongaku wo kiku no ga daisuki desu

Take Leaves

desu

S

NP VB2 VB1

PP VB

NN P

kare ha

ongaku wo

ga

daisuki

kiku no

S

NP VB1

he adores

listening

VB2

VB PP

NNP

musicto

Parse Tree(E)

Sentence(J)

Example 1: Machine Translation [Yamada & Knight 01]

Example 2: Sentence Compression[Knight & Marcu 00]

S

PRP VB

he

adores

listening

VP

VB PP

NN

P

music

to

NP

VP

S

PRP VB

he

adores

listening

VP

VB PP

P

to

NP

VP

JJ

good

NP

NN

music

NP

PP

NN

P

home

atNP

What We Want

• Device T for compactly representing possibly-infinite set of tree pairs (possibly with weights)

• Want to support operations like:– Forward application: Send input tree X through T, get all

possible output trees, perhaps represented as an RTG.

– Backward application: What input trees, when sent through T, would output tree X?

– Composition: Given a T1 and T2, can we build a transducer T3 that does the work of both in sequence?

– Equality testing: Do T1 and T2 contain exactly the same tree pairs?

– Training: How to set weights for T given a training corpus of input/output tree pairs?

Finite-State Transducer (FST)

k

n

i

g

h

t

q k q2 ε

q2 n q N

q i q AYq g q3 ε

q4 t qfinal Tq3 h q4 ε

Original input: Transformation:

q k

n

i

g

h

t

FST

qq2

qfinal

q3 q4

k : ε

n : N

h : εg : ε

t : T

i : AY

Finite-State (String) Transducer

q2 n

i

g

h

t

q k q2 ε

q2 n q N

q i q AYq g q3 ε



k

n

i

g

h

t

FST

qq2

qfinal

q3 q4

k : ε

n : N

h : εg : ε

t : T

i : AY


N

q i

g

h

t

q k q2 ε

q2 n q N

q i q AYq g q3 ε



k

n

i

g

h

t

FST

qq2

qfinal

q3 q4

k : ε

n : N

h : εg : ε

t : T

i : AY


q g

h

t

q k q2 ε

q2 n q N

q i q AYq g q3 ε


AY

N


k

n

i

g

h

t

FST

qq2

qfinal

q3 q4

k : ε

n : N

h : εg : ε

t : T

i : AY


q3 h

t

q k q2 ε

q2 n q N

q i q AYq g q3 ε


AY

N


k

n

i

g

h

t

FST

qq2

qfinal

q3 q4

k : ε

n : N

h : εg : ε

t : T

i : AY


q4 t

q k q2 ε

q2 n q N

q i q AYq g q3 ε


AY

N


k

n

i

g

h

t

FST

qq2

qfinal

q3 q4

k : ε

n : N

h : εg : ε

t : T

i : AY


q k q2 ε

q2 n q N

q i q AYq g q3 ε


T

qfinal

AY

N

k

n

i

g

h

t


FST

qq2

qfinal

q3 q4

k : ε

n : N

h : εg : ε

t : T

i : AY


q k q2 ε

q2 n q N

q i q AYq g q3 ε


T

AY

N

k

n

i

g

h

t

Original input: Final output:

FST

Trees

S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music

S

NP S

PRO

kare

kiku

ongaku o

NP VPPN

wa VB

daisuki

PV

desu

SBAR PN

gaS PS

NP VB

NN PN

no

Original input:

?

Target output:

Tree Transformation

S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music


q S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music

?

Tree Transformation

S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music


S

q.np NP S

PRO

he

q.vbz VBZ

enjoys

q.np NP

VBG

listening

VP

P

to

NP

SBAR

music

?

Trees

S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music


S

S

q.vbz VBZ

enjoys

q.np NP

VBG

listening

VP

P

to

NP

SBAR

music

NP

PRO

kare

PN

wa?

Trees

S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music

S

NP S

PRO

kare

kiku

ongaku o

NP VPPN

wa VB

daisuki

PV

desu

SBAR PN

gaS PS

NP VB

NN PN

no


qfinal

?

Top-Down Tree Transducer

• Introduced by Rounds (1970) & Thatcher (1970)“Recent developments in the theory of automata have pointed to an

extension of the domain of definition of automata from strings to trees … parts of mathematical linguistics can be formalized easily in a tree-automaton setting … Our results should clarify the nature of syntax-directed translations and transformational grammars …” (Mappings on Grammars and Trees, Math. Systems Theory 4(3), Rounds 1970)

• “FST for trees”• Large literature

– e.g., Gécseg & Steinby (1984), Comon et al (1997)

• Many classes of transducers– Top-down (R), bottom-up (F), copying, deleting, look-ahead…


Original Input:

TargetOutput:

S

PRO VP

VB NPhe

likes spinach

X

PROVP

VB NP he

likes spinach

AUX

does

?

R (“root to frontier”) transducer


Transformation:q S

x0 x1

q S

PRO VP

VB NPhe

likes spinach

Or in computer-speak:

q S(x0, x1) X(q x1, q x0, AUX(does))

X

q x1 q x0 AUX

does

Original Input:

S

PRO VP

VB NPhe

likes spinach



Transformation:q S

x0 x1

X

q x1 q x0 AUX

does

Or in computer-speak:

q S(x0, x1) X(q x1, q x0, AUX(does))

X

q PROq VP

VB NP he

likes spinach

AUX

does

Original Input:

S

PRO VP

VB NPhe

likes spinach


• Tree transducers generalize FSTs (strings are “monadic trees”)

Relation of Tree Transducers to (String) FSTs

q rA/B

q r*e*/B

q rA/*e*

q A

x0

B

r x0

q A

x0

r x0

q x B

r x

FSTtransition

Equivalent tree transducer rule

Tree Transducer Simulating FST

k

n

i

g

h

t

q k q2 x0


q k

n

i

g

h

t

x0

q n N

x0 q x0



q2 n

i

g

h

t


k

n

i

g

h

t

q k q2 x0

x0

q n N

x0 q x0



N

q i

g

h

t


k

n

i

g

h

t

q k q2 x0

x0

q n N

x0 q x0

(and so on …)


Top-Down Transducers can Copy and Delete

VP

q VB

VP

VB PP

q VP

x0 x1

VP

q x0

R transducer that deletes

d sin

x0

MULT

cos

MULT

cos d y

R transducer that copies

i x0

d x0

d sin

y

i y

Example from Rounds (70)

calculusd sin(y) = cos(y) · d y

sentencecompression

Complex Re-Ordering

S

PROVB NP

Original Input:

TargetOutput:

S

PRO VP

VB NP

q S

x0 x1

?

R transducer

Complex Re-Ordering

q S

x0 x1

S

qleft x1 q x0 qright x1

S

PRO VP

VB NP

qleft VP

x0 x1

q x0

qright VP

x0 x1

q x1

q PRO PRO

q VB VB

q NP NP

Transformation:

q S

PRO VP

VB NP

Original Input:

R transducer

Complex Re-Ordering

S

q PROqleft VP

VB NP

qright VP

VB NP

Transformation:q S

x0 x1

S


qleft VP

x0 x1

q x0

qright VP

x0 x1

q x1

q PRO PRO

q VB VB

q NP NP

S

PRO VP

VB NP

Original Input:

R transducer

Complex Re-Ordering

q PROq VB qright VP

VB NP

Transformation:q S

x0 x1

S


qleft VP

x0 x1

q x0

qright VP

x0 x1

q x1

q PRO PRO

q VB VB

q NP NP

SS

PRO VP

VB NP

Original Input:

R transducer

Complex Re-Ordering

q PROVB qright VP

VB NP

Transformation:q S

x0 x1

S


qleft VP

x0 x1

q x0

qright VP

x0 x1

q x1

q PRO PRO

q VB VB

q NP NP

SS

PRO VP

VB NP

Original Input:

R transducer

Complex Re-Ordering

PROVB qright VP

VB NP

Transformation:q S

x0 x1

S


qleft VP

x0 x1

q x0

qright VP

x0 x1

q x1

q PRO PRO

q VB VB

q NP NP

SS

PRO VP

VB NP

Original Input:

R transducer

Complex Re-Ordering

PROVB q NP

Transformation:q S

x0 x1

S


qleft VP

x0 x1

q x0

qright VP

x0 x1

q x1

q PRO PRO

q VB VB

q NP NP

SS

PRO VP

VB NP

Original Input:

R transducer

Complex Re-Ordering

S

PROVB NP

FinalOutput:

S

PRO VP

VB NP

q S

x0 x1

S


qleft VP

x0 x1

q x0

qright VP

x0 x1

q x1

q PRO PRO

q VB VB

q NP NP

Original Input:

R transducer

Extended Left-Hand Side Transducer: xR

S

PROVB NP

FinalOutput:

S

PRO VP

VB NP

q S

x0:PRO VP

S

q x1 q x0 q x2

q PRO PRO

q VB VB

q NP NP

Original Input:

x1:VB x2:NP

Mentioned already in Section 4, Rounds 1970.Not defined or used in proofs there.See [Knight & Graehl 05].

xR transducer

Tree-to-String Transducers: xRSSyntax-directed translation for compilers [Aho & Ullman 71]

S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music


q S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music

q S

x0:NP VP

q x0 , q x2 , q x1

x1:VBZ x2:NP

xRS transducer

S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music


q.np NP

PRO

he

q.vbz VBZ

enjoys

q.np NP

VBG

listening

VP

P

to

NP

SBAR

music

, ,


xRS transducerq S

x0:NP VP

q x0 , q x2 , q x1

x1:VBZ x2:NP


S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music


q.vbz VBZ

enjoys

q.np NP

VBG

listening

VP

P

to

NP

SBAR

music

, ,kare wa,

xRS transducerq S

x0:NP VP

q x0 , q x2 , q x1

x1:VBZ x2:NP


S

NP VP

PRO

he

VBZ

enjoys

NP

VBG

listening

VP

P

to

NP

SBAR

music

kare kikuongaku owa daisuki desugano


, , , , , , ,,

xRS transducer

q S

x0:NP VP

q x0 , q x2 , q x1

x1:VBZ x2:NP

Used in a practical MTsystem [Galley et al 04].

Are Tree Transducers Expressive Enough for Natural Language?

• Can published tree-based probabilistic models be cast in this framework?– This was previously done for string-based models,

e.g.• [Knight & Al-Onaizan 98] – word-based MT as FST• [Kumar & Byrne 03] – phrase-based MT as FST

• Can published tree-based probabilistic models be naturally extended in this framework?

.

Reorder

S

NP VB2 VB1

PP VB

NN P

he adores

listening

music to

Insert

desu

S

NP VB2 VB1

PP VB

NN P

he ha

music to

ga

adores

listening no

Translate


Take Leaves

desu

S

NP VB2 VB1

PP VB

NN P

kare ha

ongaku wo

ga

daisuki

kiku no

S

NP VB1

he adores

listening

VB2

VB PP

NNP

musicto

Sentence (J)


[Yamada & Knight 01]

Parse (E)

.

Reorder

S

NP VB2 VB1

PP VB

NN P

he adores

listening

music to

Insert

desu

S

NP VB2 VB1

PP VB

NN P

he ha

music to

ga

adores

listening no

Translate


Take Leaves

desu

S

NP VB2 VB1

PP VB

NN P

kare ha

ongaku wo

ga

daisuki

kiku no

S

NP VB1

he adores

listening

VB2

VB PP

NNP

musicto

Sentence(J)


Can be cast as a single 4-statetree transducer.

See [Graehl & Knight 04]

[Yamada & Knight 01]

Parse (E)

Extensions Possible within theTransducer Framework

S

x0:NP VP

x1:VB x2:NP2

x1, x0, x2

S

PRO VP

VB x0:NPthere

are

hay, x0

NP

x0:NP PP

of

P x1:NP

x1, , x0

Multilevel Re-Ordering

Non-constituent Phrases

Lexicalized Re-Ordering

VP

VBZ VBG

is

está, cantando

Phrasal Translation

singing

VP

VB x0:NP PRT

put

poner, x0

Non-contiguous Phrases

on

NPB

DT x0:NNS

the

x0

Context-SensitiveWord Insertion

Extensions Possible within theTransducer Framework

S

x0:NP VP

x1:VB x2:NP2

x1, x0, x2

S

PRO VP

VB x0:NPthere

are

hay, x0

NP

x0:NP PP

of

P x1:NP

x1, , x0

Multilevel Re-Ordering

Non-constituent Phrases

Lexicalized Re-Ordering

VP

VBZ VBG

is

está, cantando

Phrasal Translation

singing

VP

VB x0:NP PRT

put

poner, x0

Non-contiguous Phrases

on

NPB

DT x0:NNS

the

x0

Context-SensitiveWord Insertion

Practical MT system[Galley et al 04, 06; Marcu et al 06]

uses 40m suchautomatically collected rules

Limitations of theTop-Down Transducer Model

q S

WhNP VP

AUX S

NP VP

V S

NP VP

V S

NP VP

Who

does

John

think

Mary

believes

I

saw

S

NP VP

V qwho S

NP VP

V S

NP

John

thinks

Mary

believes

I

V

VP

saw

V

*

Who does John think Mary believes I saw? John thinks Mary believes I saw who?


q S

WhNP VP

AUX S

NP VP

V S

NP VP

V S

NP VP

Who

does

John

think

Mary

believes

I

saw

S

NP VP

V S

NP VP

V S

NP

John

thinks

Mary

believes

I

V

qwhoVP

saw

V

*



q S

WhNP VP

AUX S

NP VP

V S

NP VP

V S

NP VP

Who

does

John

think

Mary

believes

I

saw

S

NP VP

V S

NP VP

V S

NP

John

thinks

Mary

believes

I

V

VP

saw

V

*


who

WhNP

Limitations of the Top-Down Transducer Model

q S

WhNP VP

AUX S

NP VP

V S

NP VP

V S

NP VP

DT ADJ N

Whose blue dogdoes

John

think

Mary

believes

I

saw

S

NP VP

V S

NP VP

V S

NP

John

thinks

Mary

believes

I

V

VP

saw

V

*

Whose blue dog does John think Mary believes I saw? John thinks Mary believes I saw whose blue dog?

WhNP

DT ADJ N

whose blue dog

Can’t do this

R

RL

RLN | FLN

+ delete

copying

non-copying

deleting

non-deleting

Tree Transducers: A Digression

+ copy (before Nprocess)+ complex re-order

WFST

+ branching LHS rules

R

RL

RLN | FLN

xRLN+ delete

+ finite-checkbefore delete+ complexre-order

xR

xRL

+ complexre-order

+ delete+ finiite-checkbefore delete

+ finite-checkbefore delete

+ copy (beforenprocess)

copying

non-copying

deleting

non-deleting



WFST


R

RT

RL

RN

RTDRLN | FLN

RD

xRLN+ delete


xR

RR

xRL

+ complexre-order




+ reg-check before delete

to R

copying

non-copying

deleting

non-deleting



WFST


F

RRLR | FL

RT

RL

RN

RTDRLN | FLN

F2R2

RD

PDTT

xRLN+ delete

+ delete+ reg-check before delete




xR

RR

xRL

+ complexre-order





+ nprocess (before copy)+ complex re-order

to R

+ reg-check before delete+ nprocess before copy

FLD

+ non-determinism

copying

non-copying

deleting

non-deleting


WFST


WFST

F

RRLR | FL

RT

RL

RN

RTDRLN | FLN

F2R2

RD

PDTT

+ delete

+ delete+ reg-check before delete




xR

RR

xRL

+ complexre-order





+ nprocess (before copy)+ complex re-order

to R

+ reg-check before delete+ nprocess before copy

FLD

+ non-determinism


copying

non-copying

deleting

non-deleting


xRLN

Bold circles =Closed underComposition

Top Down Tree Transducers are Not Closed Under Composition [Rounds 70]

X|Y|X|

…|a

R that changesX’s and Y’s to X’s and Y’s, non-deterministically

Y|X|X|

…|a

R that duplicatesinput tree under new label Z

Z/ \

Y Y| |X X| |X X| |

… …| |a a

No problem! No problem!

X|Y|X|

…|a

R that changesX’s and Y’s to X’s and Y’s, non-deterministically,then duplicatesresulting tree undernew label Z

Z/ \

Y Y| |X X| |X X| |

… …| |a a

Impossible for R

Top Down Tree Transducers are Not Closed Under Composition [Rounds 70]

NOTE: Deterministic R-transducers are composable.

Properties of Transducer Classes

FST R RL RLN=FLN F FL

Closed under composition YES NO NO YES NO YES

Closed under intersection NO NO NO NO NO NO

Image of tree (or finite forest) is:

RSL RTL RTL RTL Not RTL

RTL

Image of RTL is: RSL Not RTL

RTL RTL Not RTL

RTL

Inverse image of tree is: RSL RTL RTL RTL RTL RTL

Inverse image of RTL is: RSL RTL RTL RTL RTL RTL

Efficiently trainable YES YES YES YES

String Transducers Tree Transducers

(references: Hopcroft & Ullman 79, Gécseg & Steinby 84, Baum et al 70, Graehl & Knight 04)

K-best Trees & Determinization

• String sets (FSA)– [Dijkstra 59] gives best path through FSA

• O(e + n log n)

– [Eppstein 98] gives k-best paths through FSA, including cycles• O(e + n log n + k log k) /* to print k-best lengths */

– [Mohri & Riley 02] give k-best unique strings (determinization followed by k-best paths)

• Tree sets (RTG)– [Knuth 77] gives best derivation tree from CFG (easily adapted to RTG)

• O(t + n log n) running time

– [Huang & Chiang 05] give k-best derivations in RTG– [May & Knight 05] give k-best unique trees (determinization followed by

k-best derivations)

The Training Problem [Graehl & Knight 04]

• Given: – an xR transducer with a set of non-deterministic rules (r1…rm)– a training corpus of input-tree/output-tree pairs

• Produce:– conditional probabilities p1…pm that maximize P(output trees | input trees)

= Πi,o P(o | i) input/output trees <i, o>

= Πi,o Σd P(d, o | i) derivations d mapping i to o

= Πi,o Σd Πr pr rules r in derivation d

Derivations in R

Input tree:

A

B C

F GD E

A

R S

XT U

V W

q A

x0 x1

q B

x0 x1

q F V

A

R Srule1

rule2

rule5

R Transducer rules:

Xq x1 q x0

U

q C

x0 x1rule3

T

q x0 q x1

q C

x0 x1rule4

T

q x1 q x0

q F Wrule6

q G Vrule7

q G Wrule8

1.0

1.0

0.6

0.4

0.9

0.1

0.5

0.5

Output tree: Packed Derivations:

qstart 1.0 rule1(q.1.12, q.2.11)q.1.12 1.0 rule2q.2.11 0.6 rule3(q.21.111, q.22.112)q.2.11 0.4 rule4(q.21.112, q.22.111)q.21.111 0.9 rule5q.22.112 0.5 rule8q.21.112 0.1 rule6q.22.111 0.5 rule7

Derivations:

rule1

rule3 rule2

rule5 rule8

rule1

rule4 rule2

rule6 rule7

(total weight = 0.27)


Derivations in R

Input tree:

A

B C

F GD E

A

R S

XT U

V W

q A

x0 x1

q B

x0 x1

q F V

A

R Srule1

rule2

rule5

R Transducer rules:

Xq x1 q x0

U

q C

x0 x1rule3

T

q x0 q x1

q C

x0 x1rule4

T

q x1 q x0

q F Wrule6

q G Vrule7

q G Wrule8

1.0

1.0

0.6

0.4

0.9

0.1

0.5

0.5



Derivations:

rule1

rule3 rule2

rule5 rule8

rule1

rule4 rule2

rule6 rule7



Derivations in R

Input tree:

A

B C

F GD E

A

R S

XT U

V W

q A

x0 x1

q B

x0 x1

q F V

A

R Srule1

rule2

rule5

R Transducer rules:

Xq x1 q x0

U

q C

x0 x1rule3

T

q x0 q x1

q C

x0 x1rule4

T

q x1 q x0

q F Wrule6

q G Vrule7

q G Wrule8

1.0

1.0

0.6

0.4

0.9

0.1

0.5

0.5



Derivations:

rule1

rule3 rule2

rule5 rule8

rule1

rule4 rule2

rule6 rule7



Naïve EM Algorithm for xR

Initialize uniform probabilities

Until tired do:

Make zero rule count table

For each training case <i, o>

Compute P(o | i) by summing over derivations d

For each derivation d for <i, o> /* exponential! */

Compute P(d | i, o) = P(d, o | i) / P(o | i)

For each rule r used in d

count(r) += P(d | i, o)

Normalize counts to probabilities

Efficient EM for xR

Initialize uniform probabilitiesUntil tired do:

Make zero rule count tableFor each training case <i, o>

Build packed derivation forest O(q n2 r) time/spaceUse inside-outside algorithm to collect rule counts

Inside pass O(q n2 r) time/spaceOutside pass O(q n2 r) time/spaceCount collection pass O(q n2 r) time/space

Normalize counts to probabilities (joint or conditional)

Per-example training complexity is O(n2) x “transducer constant” -- same as forward-backward training for string transducers [Baum & Welch 71].

Variation for xRS runs in O(q n4 r) time (if |RHS| = 2).

Training: Related Work

• Generalizes specific MT model training algorithm in the appendix of [Yamada/Knight 01]– xR training not tied to particular MT model, or to MT at all

• Generalizes forward-backward HMM training [Baum et al 70]– Write strings vertically, and use xR training

• Generalizes inside-outside PCFG training [Lari/Young 90]– Attach fixed input tree to each “output” string, and use xRS

training

• Generalizes synchronous tree-substitution grammar (STSG) training [Eisner 03]– xR and xRS allow copying of subtrees

Picture So Far

Strings Trees

Grammars Regular grammar, CFG DONE (RTG)


DONE (TDTA)




DONE

If You Want to Play Around …

• Tiburon: A Weighted Tree Automata Toolkit– www.isi.edu/licensed-sw/tiburon– Developed by Jonathan May at USC/ISI– Portable, written in Java

• Implements operations on tree grammars and tree transducers– k-best trees, determinization, application, training, …

• Tutorial primer comes with explanations, sample automata, and exercises– Hands-on follow-up to a lot of what we covered today

• Beta version released June 2006




Part 4: Conclusion

Why Worry About Formal Languages?

• They already figured out most of the key things back in the 1960s & 1970s

• Lucky us!

Relation of Synchronous Grammars with Tree Transducers

• Developed largely independently– Synchronous grammars

• [Aho & Ullman 69, Lewis & Stearns 68] inspired by compilers • [Shieber and Schabes 90] inspired by syntax/semantics mapping

– Tree Transducers• [Rounds 70] inspired by transformational grammar

• Recent work is starting to relate the two– [Shieber 04, 06] explains both in terms of abstract

bimorphisms– [Graehl, Hopkins, Knight in prep] relates STSG to xRLN &

gives composition closure results

Wait, Maybe They Didn’t Figure Out Everything in the 1960s and 1970s …

• Is there a formal tree transformation model that fits natural language problems and is closed under composition?

• Is there an “markovized” version of R-style transducers, with horizontal Thatcher-like processing of children?

• Are there transducers that can move material across unbounded distance in a tree? Are they trainable?

• Are there automata models that can work with string/string data?• Can we build large, accurate, probabilistic syntax-based language

models of English using RTGs?• What about root-less transformations like

– JJ(big) NN(cheese) jefe

• Can we build probabilistic tree toolkits as powerful and ubiquitous as string toolkits?

More To Do on Synchronous Grammars

• Connections between synchronous grammars and tree transducers and transfer results between them

• Degree of formal power needed in theory and practice for translation and paraphrase [Wellington et al 06]

• New kinds of synchronous grammars that are more powerful but not more expensive, or not much more

• Efficient and optimal algorithms for minimizing the rank of a synchronous CFG [Gildea, Satta, Zhang 06]

• Faster strategies for translation with an n-gram language model [Zollmann & Venugopal 06; Chiang, to appear?]

Further Readings• Synchronous grammar

– Attached notes by David (with references)

• Tree automata– [Knight and Graehl 05] overview and relationship to

natural language processing

– Tree Automata textbook [Gécseg & Steinby 84]

– TATA textbook on the Internet [Comon et al 97]

References on These Slides[Aho & Ullman 69] ACM STOC[Aho & Ullman 71] Information and Control, 19[Bangalore & Rambow 00] COLING[Baum et al 70] Ann. Math. Stat. 41[Chiang 05] ACL[Chomsky 57] Syntactic Structures[Church 77] ANLP[Comon et al 97] www.grappa.univ-lille3.fr/tata/[Dijkstra 59] Numer. Math. 1[Doner 70] J. Computer & Sys. Sci. 4[Echihabi & Marcu 03] ACL[Eisner 03] HLT[Eppstein 98] Siam J. Computing 28[Galley et al 04, 06] HLT, ACL-COLING[Gecseg & Steinby 84] textbook[Gildea, Satta, Zhang 06] COLING-ACL[Graehl & Knight 04] HLT[Jelinek 90] Readings in Speech Recog.[Knight & Al-Onaizan 98] AMTA[Knight and Graehl 05] Cicling[Knight & Marcu 00] AAAI[Knuth 77] Info. Proc. Letters 6[Kumar & Byrne 03] HLT

[Lewis & Stearns 68] JACM[Lari & Young 90] Comp. Speech

and Lang. 4[Marcu et al 06] EMNLP[May & Knight 05] HLT[Melamed 03] HLT[Mohri & Riley 02] ICSLP[Mohri, Pereira, Riley 00] Theo. Comp. Sci.[Pang et al 03] HLT[Rabin 69] Trans. Am. Math. Soc. 141[Rounds 70] Math. Syst. Theory 4[Shieber & Schabes 90] COLING[Shieber 04, 06] TAG+, EACL[Thatcher 67] J. Computer & Sys. Sci. 1[Thatcher 70] J. Computer & Sys. Sci. 4[Thatcher & Wright 68] Math. Syst. Theory 2[Wellington et al 06] ACL-COLING[Wu 97] Comp. Linguistics[Yamada & Knight 01, 02] ACL[Zollmann & Venugopal 06] HLT

Thank you

Date post:	19-Dec-2015
Category:	Documents
View:	218 times
Download:	0 times

Synchronous Grammars and Tree Automata David Chiang and Kevin Knight USC/Information Sciences...

Documents