Download - Globally Normalized Transition-Based Neural Networks · Globally Normalized Transition-Based Neural Networks Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro

Globally Normalized Transition-Based Neural Networks

Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, Michael Collins

Parsey McParseface Now Has

40 Multi-lingual Cousins!Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn,

Alessandro Presta, Kuzman Ganchev, Slav Petrov, Michael Collins

eat pizzaAlice saw Bob

?

Transition-Based Parsing

Stack Buffer



Stack Buffer

RIGHT-ARC



Stack Buffer

LEFT-ARC


Stack Buffer

SHIFT



?

Stack Buffer

Transition-Based Neural Networks

4Embeddings


?

Stack Buffer


4

3

Embeddings


?

ReLU 1

Stack Buffer


4

3

2

Embeddings


?

ReLU 1

ReLU 2

Stack Buffer


1

4

3

2

Embeddings


?

ReLU 1

ReLU 2

Activations

Stack Buffer


1

4

3

2

Embeddings


?

ReLU 1

ReLU 2

Activations

Action Softmax

Stack Buffer


1

4

3

2

Embeddings


?

ReLU 1

ReLU 2

Activations

Action Softmax

P (action|context)

Locally normalized

model:

Stack Buffer


1

4

3

2

Embeddings


?

ReLU 1

ReLU 2

Activations

Action Softmax

P (action|context)

Locally normalized

model:

Stack Buffer


• Locally normalized models are often easy to train

• Globally normalized models using the same #params can be much more accurate

• Applies to multiple tasks

1

4

3

2

Embeddings


?

ReLU 1

ReLU 2

Activations

Action Softmax

P (action|context)

Locally normalized

model:

Stack Buffer


Alice saw Bob eat pizza with Charlie

Locally Normalized Training

[Chen & Manning ’14, Weiss et al. ’15]



Oracle maps gold structures to gold action sequences:

Gold sentences




Gold sentences

Gold sentences




Mini-batches



Mini-batches


Some advantages: • Trivially Parallelizable • SGD Training recipes • Standard NN Packages



Locally Normalized Inference

Alice saw Bob eat pizza with Charlie?

How Important is Lookahead?


UAS

(§2

2 of

the

WSJ

)

75

80

85

90

95

0 1 2 3 4

Local

?



UAS

(§2

2 of

the

WSJ

)

75

80

85

90

95

0 1 2 3 4

Local

?


UAS

75

80

85

90

95

0 1 2 3 4

Local



UAS

75

80

85

90

95

0 1 2 3 4

Local




UAS

75

80

85

90

95

0 1 2 3 4

Local

?


UAS

75

80

85

90

95

0 1 2 3 4

Local



UAS

75

80

85

90

95

0 1 2 3 4

Local


Alice saw Bob eat pizza with CharlieBi-LSTM

UAS

75

80

85

90

95

0 1 2 3 4

Local


LSTM [Kiperwasser & Goldberg '16]

Alice saw Bob eat pizza with CharlieBi-LSTM

Beam

Beam Search with Local Model

Alice saw Bob eat pizza with Charlie(Schematic)

Bett

er

Beam



Bett

er

Beam



Bett

er

Beam



Bett

er

Beam



Bett

er


UAS

75

80

85

90

95

Lookahead

0 1 2 3 4

Local +Beam


UAS

75

80

85

90

95

Lookahead

0 1 2 3 4

Local +Beam

Beam

Training with Early Updates

[Collins and Roark ’04, Zhou et al.’15]

Beam



Beam



Beam



Beam


Globally normalized with respect to the beam:


BACKPROP

Backpropagate through all steps, paths, and layers

exp

Pi �

(⇤)iP|Beam|

j=1 exp

Pi �

(j)iX

i

�(4)i

X

i

�(3)i

X

i

�(2)i

X

i

�(1)i

X

i

�(⇤)i

Beam




BACKPROP


exp

Pi �

(⇤)iP|Beam|

j=1 exp

Pi �

(j)iX

i

�(4)i

X

i

�(3)i

X

i

�(2)i

X

i

�(1)i

X

i

�(⇤)i

Beam




BACKPROP


exp

Pi �

(⇤)iP|Beam|

j=1 exp

Pi �

(j)iX

i

�(4)i

X

i

�(3)i

X

i

�(2)i

X

i

�(1)i

X

i

�(⇤)i

Beam




BACKPROP


exp

Pi �

(⇤)iP|Beam|

j=1 exp

Pi �

(j)iX

i

�(4)i

X

i

�(3)i

X

i

�(2)i

X

i

�(1)i

X

i

�(⇤)i

Beam




BACKPROP


exp

Pi �

(⇤)iP|Beam|

j=1 exp

Pi �

(j)iX

i

�(4)i

X

i

�(3)i

X

i

�(2)i

X

i

�(1)i

X

i

�(⇤)i

Beam




BACKPROP


exp

Pi �

(⇤)iP|Beam|

j=1 exp

Pi �

(j)iX

i

�(4)i

X

i

�(3)i

X

i

�(2)i

X

i

�(1)i

X

i

�(⇤)i

Globally Normalized Model

UAS

75

80

85

90

95

Lookahead

0 1 2 3 4

Local +Beam Global

Globally Normalized Model

UAS

75

80

85

90

95

Lookahead

0 1 2 3 4

Local +Beam Global

English WSJ ResultsU

AS

90

91

92

93

94

9594.61

93.9093.99

92.83

93.2093.19

91.80

93.0093.22

This

Wor

k: G

loba

l (su

perv

ised

)

NN

Per

cept

ron

(Wei

ss e

t al

. ’1

5)

Zhou

et

al.

‘15

LSTM

(D

yer

et a

l ’15

)

Zhan

g &

McD

onal

d ‘1

4

Zhan

g &

Niv

re ‘

11

Loca

l (W

eiss

et

al.

’15)

Chen

&

Man

ning

’14

LSTM

(Ki

perw

asse

r &

Gol

dber

g ‘1

6)

CoNLL’09 POS Tagging and Parsing ResultsAc

cura

cy

94

96

98

100

Ca Ch Cz En Ge Jp Sp

LSTM (Ling et al. '15)This Work

UAS

80

85

90

95

Ca Ch Cz En Ge Jp Sp

Bohnet and Nivre '12Alberti et al. '15This Work

ParsingTagging

In Pakistan, former leader Pervez Musharraf has appeared in court for the first time, on treason charges.

Sentence Compression Results


Transition System decides to KEEP or DROP words



Transition System decides to KEEP or DROP words


Pervez Musharraf has appeared in court on treason charges.




Whole-sentence test accuracy

Human eval rating

Relative throughput

35.36

4.66

1x

35.16

4.67

100x

Seq2seq LSTM (Filippova et al. ’15)

Global model (This work)



Whole-sentence test accuracy

Human eval rating

Relative throughput

35.36

4.66

1x

35.16

4.67

100x

Seq2seq LSTM (Filippova et al. ’15)

Global model (This work)




Local

+Beam

Global

Predicted compression Sequence probability under Local Global

0.13 0.05

0.16 <10-4

0.06 0.07

Sentence Compression: Label Bias

Why does it work?

1. Global Models are More Expressive

Let • set of distributions under a Local model • set of distributions under a Global model

Theorem:

Therefore there are some distributions over sequences that cannot be captured in a finite-lookahead locally-normalized model.

PL

PG

PL

[This work, Smith and Johnson ’07]

PL ( PG

PG

2. Backprop with a Beam

2. Backprop with a BeamU

AS

92

93

94

95

93.32

92.85

Gre

edy

+Bea

m

Local training


AS

92

93

94

95

93.4593.32

92.85

Activations

ReLU 1

ReLU 2

Embeddings

Gre

edy

+Bea

m

Trai

n on

ly

Acti

vati

ons

Local training

Global training


AS

92

93

94

95

94.01

93.4593.32

92.85

Activations

ReLU 1

ReLU 2

Embeddings

Gre

edy

+Bea

m

Trai

n on

ly

Acti

vati

ons

+ReL

U 2

Local training

Global training


AS

92

93

94

95

94.0994.01

93.4593.32

92.85

Activations

ReLU 1

ReLU 2

Embeddings

Gre

edy

+Bea

m

Trai

n on

ly

Acti

vati

ons

+ReL

U 2

+ReL

U 1

Local training

Global training


AS

92

93

94

95

94.3894.0994.01

93.4593.32

92.85

Activations

ReLU 1

ReLU 2

Embeddings

Gre

edy

+Bea

m

Trai

n on

ly

Acti

vati

ons

+ReL

U 2

+ReL

U 1

+Em

bedd

ings

Local training

Global training

Conclusions

Global models:

• can be taught to do search better

• more accurate, in exchange for more training time

• same wicked fast decoding

• applicable to multiple tasks

Open Source: SyntaxNet

Parsey McParseface + 40 languages

https://github.com/tensorflow/models/tree/master/syntaxnet

ACL 2016 Google Booth

And check out the Natural Language Understanding

team page: g.co/NLUTeam

Come by for demos, info and swag

Thank You!

[Nivre ’06] [Nivre ’09]

[Bohnet and Nivre ’12] [Martins et al.’13]

[Chen and Manning ’14] [Zhang and McDonald ’14]

[Alberti et al.’15] [Ballesteros et al.’15]

[Dyer et al.’15] [Weiss et al.’15]

[Yazdani and Henderson ’15] [Zhou et al.’15]

[Vaswani and Sagae ’16]

[Henderson ’03] [Henderson ’04]

[Durrett and Klein ’15] [Vinyals et al.’15]

[Watanabe and Sumita ’15]

[Ross et al.’11] [Yao et al.’14]

[Zheng et al.’15] [Zhou and Xu’15][Lei et al.’14]

[Ling et al.’15] [Peng et al.’09]

[Do and Artires ’10] [Filippova et al.’15]

[Goldberg and Nivre ’13] [Hochreiter and Schmidhuber ’97]

[Huang et al.’15]

[Collins and Roark ’04] [Collins ’99]

[Liang et al.’08] [Daume III et al.’09]

[Abney et al.’99] [Chi ’99]

[Smith and Johnson ’07]

[Bottou ’91] [Bottou et al.’97]

[Lafferty et al.’01] [Bottou and LeCun ’05]

[Le Cun et al.’98]

Appendix

Longer examples of ambiguity