Globally Normalized Transition-Based Neural Networks
Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, Michael Collins
Parsey McParseface Now Has
40 Multi-lingual Cousins!Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn,
Alessandro Presta, Kuzman Ganchev, Slav Petrov, Michael Collins
eat pizzaAlice saw Bob
?
Transition-Based Parsing
Stack Buffer
eat pizzaAlice saw Bob
Transition-Based Parsing
Stack Buffer
RIGHT-ARC
eat pizzaAlice saw Bob
Transition-Based Parsing
Stack Buffer
LEFT-ARC
eat pizzaAlice saw Bob
Stack Buffer
SHIFT
Transition-Based Parsing
eat pizzaAlice saw Bob
?
Stack Buffer
Transition-Based Neural Networks
4Embeddings
eat pizzaAlice saw Bob
?
Stack Buffer
Transition-Based Neural Networks
4
3
Embeddings
eat pizzaAlice saw Bob
?
ReLU 1
Stack Buffer
Transition-Based Neural Networks
4
3
2
Embeddings
eat pizzaAlice saw Bob
?
ReLU 1
ReLU 2
Stack Buffer
Transition-Based Neural Networks
1
4
3
2
Embeddings
eat pizzaAlice saw Bob
?
ReLU 1
ReLU 2
Activations
Stack Buffer
Transition-Based Neural Networks
1
4
3
2
Embeddings
eat pizzaAlice saw Bob
?
ReLU 1
ReLU 2
Activations
Action Softmax
Stack Buffer
Transition-Based Neural Networks
1
4
3
2
Embeddings
eat pizzaAlice saw Bob
?
ReLU 1
ReLU 2
Activations
Action Softmax
P (action|context)
Locally normalized
model:
Stack Buffer
Transition-Based Neural Networks
1
4
3
2
Embeddings
eat pizzaAlice saw Bob
?
ReLU 1
ReLU 2
Activations
Action Softmax
P (action|context)
Locally normalized
model:
Stack Buffer
Transition-Based Neural Networks
• Locally normalized models are often easy to train
• Globally normalized models using the same #params can be much more accurate
• Applies to multiple tasks
1
4
3
2
Embeddings
eat pizzaAlice saw Bob
?
ReLU 1
ReLU 2
Activations
Action Softmax
P (action|context)
Locally normalized
model:
Stack Buffer
Transition-Based Neural Networks
Alice saw Bob eat pizza with Charlie
Locally Normalized Training
[Chen & Manning ’14, Weiss et al. ’15]
Locally Normalized Training
[Chen & Manning ’14, Weiss et al. ’15]
Oracle maps gold structures to gold action sequences:
Gold sentences
Locally Normalized Training
[Chen & Manning ’14, Weiss et al. ’15]
Oracle maps gold structures to gold action sequences:
Gold sentences
Gold sentences
Locally Normalized Training
[Chen & Manning ’14, Weiss et al. ’15]
Oracle maps gold structures to gold action sequences:
Mini-batches
Locally Normalized Training
[Chen & Manning ’14, Weiss et al. ’15]
Mini-batches
Locally Normalized Training
Some advantages: • Trivially Parallelizable • SGD Training recipes • Standard NN Packages
[Chen & Manning ’14, Weiss et al. ’15]
Alice saw Bob eat pizza with Charlie
Locally Normalized Inference
Alice saw Bob eat pizza with Charlie?
How Important is Lookahead?
Alice saw Bob eat pizza with Charlie
UAS
(§2
2 of
the
WSJ
)
75
80
85
90
95
0 1 2 3 4
Local
?
How Important is Lookahead?
Alice saw Bob eat pizza with Charlie
UAS
(§2
2 of
the
WSJ
)
75
80
85
90
95
0 1 2 3 4
Local
?
How Important is Lookahead?
UAS
75
80
85
90
95
0 1 2 3 4
Local
Alice saw Bob eat pizza with Charlie?
How Important is Lookahead?
UAS
75
80
85
90
95
0 1 2 3 4
Local
Alice saw Bob eat pizza with Charlie?
How Important is Lookahead?
Alice saw Bob eat pizza with Charlie
UAS
75
80
85
90
95
0 1 2 3 4
Local
?
How Important is Lookahead?
UAS
75
80
85
90
95
0 1 2 3 4
Local
How Important is Lookahead?
Alice saw Bob eat pizza with Charlie?
UAS
75
80
85
90
95
0 1 2 3 4
Local
How Important is Lookahead?
Alice saw Bob eat pizza with CharlieBi-LSTM
UAS
75
80
85
90
95
0 1 2 3 4
Local
How Important is Lookahead?
LSTM [Kiperwasser & Goldberg '16]
Alice saw Bob eat pizza with CharlieBi-LSTM
Beam
Beam Search with Local Model
Alice saw Bob eat pizza with Charlie(Schematic)
Bett
er
Beam
Beam Search with Local Model
Alice saw Bob eat pizza with Charlie(Schematic)
Bett
er
Beam
Beam Search with Local Model
Alice saw Bob eat pizza with Charlie(Schematic)
Bett
er
Beam
Beam Search with Local Model
Alice saw Bob eat pizza with Charlie(Schematic)
Bett
er
Beam
Beam Search with Local Model
Alice saw Bob eat pizza with Charlie(Schematic)
Bett
er
Beam Search with Local Model
UAS
75
80
85
90
95
Lookahead
0 1 2 3 4
Local +Beam
Beam Search with Local Model
UAS
75
80
85
90
95
Lookahead
0 1 2 3 4
Local +Beam
Beam
Training with Early Updates
[Collins and Roark ’04, Zhou et al.’15]
Beam
Training with Early Updates
[Collins and Roark ’04, Zhou et al.’15]
Beam
Training with Early Updates
[Collins and Roark ’04, Zhou et al.’15]
Beam
Training with Early Updates
[Collins and Roark ’04, Zhou et al.’15]
Beam
Training with Early Updates
Globally normalized with respect to the beam:
[Collins and Roark ’04, Zhou et al.’15]
BACKPROP
Backpropagate through all steps, paths, and layers
exp
Pi �
(⇤)iP|Beam|
j=1 exp
Pi �
(j)iX
i
�(4)i
X
i
�(3)i
X
i
�(2)i
X
i
�(1)i
X
i
�(⇤)i
Beam
Training with Early Updates
Globally normalized with respect to the beam:
[Collins and Roark ’04, Zhou et al.’15]
BACKPROP
Backpropagate through all steps, paths, and layers
exp
Pi �
(⇤)iP|Beam|
j=1 exp
Pi �
(j)iX
i
�(4)i
X
i
�(3)i
X
i
�(2)i
X
i
�(1)i
X
i
�(⇤)i
Beam
Training with Early Updates
Globally normalized with respect to the beam:
[Collins and Roark ’04, Zhou et al.’15]
BACKPROP
Backpropagate through all steps, paths, and layers
exp
Pi �
(⇤)iP|Beam|
j=1 exp
Pi �
(j)iX
i
�(4)i
X
i
�(3)i
X
i
�(2)i
X
i
�(1)i
X
i
�(⇤)i
Beam
Training with Early Updates
Globally normalized with respect to the beam:
[Collins and Roark ’04, Zhou et al.’15]
BACKPROP
Backpropagate through all steps, paths, and layers
exp
Pi �
(⇤)iP|Beam|
j=1 exp
Pi �
(j)iX
i
�(4)i
X
i
�(3)i
X
i
�(2)i
X
i
�(1)i
X
i
�(⇤)i
Beam
Training with Early Updates
Globally normalized with respect to the beam:
[Collins and Roark ’04, Zhou et al.’15]
BACKPROP
Backpropagate through all steps, paths, and layers
exp
Pi �
(⇤)iP|Beam|
j=1 exp
Pi �
(j)iX
i
�(4)i
X
i
�(3)i
X
i
�(2)i
X
i
�(1)i
X
i
�(⇤)i
Beam
Training with Early Updates
Globally normalized with respect to the beam:
[Collins and Roark ’04, Zhou et al.’15]
BACKPROP
Backpropagate through all steps, paths, and layers
exp
Pi �
(⇤)iP|Beam|
j=1 exp
Pi �
(j)iX
i
�(4)i
X
i
�(3)i
X
i
�(2)i
X
i
�(1)i
X
i
�(⇤)i
Globally Normalized Model
UAS
75
80
85
90
95
Lookahead
0 1 2 3 4
Local +Beam Global
Globally Normalized Model
UAS
75
80
85
90
95
Lookahead
0 1 2 3 4
Local +Beam Global
English WSJ ResultsU
AS
90
91
92
93
94
9594.61
93.9093.99
92.83
93.2093.19
91.80
93.0093.22
This
Wor
k: G
loba
l (su
perv
ised
)
NN
Per
cept
ron
(Wei
ss e
t al
. ’1
5)
Zhou
et
al.
‘15
LSTM
(D
yer
et a
l ’15
)
Zhan
g &
McD
onal
d ‘1
4
Zhan
g &
Niv
re ‘
11
Loca
l (W
eiss
et
al.
’15)
Chen
&
Man
ning
’14
LSTM
(Ki
perw
asse
r &
Gol
dber
g ‘1
6)
CoNLL’09 POS Tagging and Parsing ResultsAc
cura
cy
94
96
98
100
Ca Ch Cz En Ge Jp Sp
LSTM (Ling et al. '15)This Work
UAS
80
85
90
95
Ca Ch Cz En Ge Jp Sp
Bohnet and Nivre '12Alberti et al. '15This Work
ParsingTagging
In Pakistan, former leader Pervez Musharraf has appeared in court for the first time, on treason charges.
Sentence Compression Results
In Pakistan, former leader Pervez Musharraf has appeared in court for the first time, on treason charges.
Transition System decides to KEEP or DROP words
Sentence Compression Results
In Pakistan, former leader Pervez Musharraf has appeared in court for the first time, on treason charges.
Transition System decides to KEEP or DROP words
Sentence Compression Results
Pervez Musharraf has appeared in court on treason charges.
Sentence Compression Results
Pervez Musharraf has appeared in court on treason charges.
Sentence Compression Results
Whole-sentence test accuracy
Human eval rating
Relative throughput
35.36
4.66
1x
35.16
4.67
100x
Seq2seq LSTM (Filippova et al. ’15)
Global model (This work)
Pervez Musharraf has appeared in court on treason charges.
Sentence Compression Results
Whole-sentence test accuracy
Human eval rating
Relative throughput
35.36
4.66
1x
35.16
4.67
100x
Seq2seq LSTM (Filippova et al. ’15)
Global model (This work)
In Pakistan, former leader Pervez Musharraf has appeared in court for the first time, on treason charges.
In Pakistan, former leader Pervez Musharraf has appeared in court for the first time, on treason charges.
In Pakistan, former leader Pervez Musharraf has appeared in court for the first time, on treason charges.
Local
+Beam
Global
Predicted compression Sequence probability under Local Global
0.13 0.05
0.16 <10-4
0.06 0.07
Sentence Compression: Label Bias
Why does it work?
1. Global Models are More Expressive
Let • set of distributions under a Local model • set of distributions under a Global model
Theorem:
Therefore there are some distributions over sequences that cannot be captured in a finite-lookahead locally-normalized model.
PL
PG
PL
[This work, Smith and Johnson ’07]
PL ( PG
PG
2. Backprop with a Beam
2. Backprop with a BeamU
AS
92
93
94
95
93.32
92.85
Gre
edy
+Bea
m
Local training
2. Backprop with a BeamU
AS
92
93
94
95
93.4593.32
92.85
Activations
ReLU 1
ReLU 2
Embeddings
Gre
edy
+Bea
m
Trai
n on
ly
Acti
vati
ons
Local training
Global training
2. Backprop with a BeamU
AS
92
93
94
95
94.01
93.4593.32
92.85
Activations
ReLU 1
ReLU 2
Embeddings
Gre
edy
+Bea
m
Trai
n on
ly
Acti
vati
ons
+ReL
U 2
Local training
Global training
2. Backprop with a BeamU
AS
92
93
94
95
94.0994.01
93.4593.32
92.85
Activations
ReLU 1
ReLU 2
Embeddings
Gre
edy
+Bea
m
Trai
n on
ly
Acti
vati
ons
+ReL
U 2
+ReL
U 1
Local training
Global training
2. Backprop with a BeamU
AS
92
93
94
95
94.3894.0994.01
93.4593.32
92.85
Activations
ReLU 1
ReLU 2
Embeddings
Gre
edy
+Bea
m
Trai
n on
ly
Acti
vati
ons
+ReL
U 2
+ReL
U 1
+Em
bedd
ings
Local training
Global training
Conclusions
Global models:
• can be taught to do search better
• more accurate, in exchange for more training time
• same wicked fast decoding
• applicable to multiple tasks
Open Source: SyntaxNet
Parsey McParseface + 40 languages
https://github.com/tensorflow/models/tree/master/syntaxnet
ACL 2016 Google Booth
And check out the Natural Language Understanding
team page: g.co/NLUTeam
Come by for demos, info and swag
Thank You!
[Nivre ’06] [Nivre ’09]
[Bohnet and Nivre ’12] [Martins et al.’13]
[Chen and Manning ’14] [Zhang and McDonald ’14]
[Alberti et al.’15] [Ballesteros et al.’15]
[Dyer et al.’15] [Weiss et al.’15]
[Yazdani and Henderson ’15] [Zhou et al.’15]
[Vaswani and Sagae ’16]
[Henderson ’03] [Henderson ’04]
[Durrett and Klein ’15] [Vinyals et al.’15]
[Watanabe and Sumita ’15]
[Ross et al.’11] [Yao et al.’14]
[Zheng et al.’15] [Zhou and Xu’15][Lei et al.’14]
[Ling et al.’15] [Peng et al.’09]
[Do and Artires ’10] [Filippova et al.’15]
[Goldberg and Nivre ’13] [Hochreiter and Schmidhuber ’97]
[Huang et al.’15]
[Collins and Roark ’04] [Collins ’99]
[Liang et al.’08] [Daume III et al.’09]
[Abney et al.’99] [Chi ’99]
[Smith and Johnson ’07]
[Bottou ’91] [Bottou et al.’97]
[Lafferty et al.’01] [Bottou and LeCun ’05]
[Le Cun et al.’98]
Appendix
Longer examples of ambiguity