Should Neural Network Architecture Reflect Linguistic Structure?
Wang Ling (Google/DeepMind)Austin Matthews (CMU)
Noah A. Smith (UW)
Miguel Ballesteros (UPF) Joint work with:
Chris Dyer (CMU → Google/DeepMind)
ICLR 2016 May 2, 2016
Learning languageARBITRARINESS
COMPOSITIONALITY
(de Saussure, 1916)
(Frege, 1892)
Learning languageARBITRARINESS
COMPOSITIONALITY
car c b bar+� =
(de Saussure, 1916)
(Frege, 1892)
Learning languageARBITRARINESS
COMPOSITIONALITY
car c b bar+� = cat c b bat+� =
(de Saussure, 1916)
(Frege, 1892)
Learning languageARBITRARINESS
COMPOSITIONALITY
car c b bar+� = cat c b bat+� =
(de Saussure, 1916)
(Frege, 1892)
car
Learning languageARBITRARINESS
COMPOSITIONALITY
Auto voiture xehơiọkọayọkẹlẹ koloi sakyanan
car c b bar+� = cat c b bat+� =
(de Saussure, 1916)
(Frege, 1892)
car
Learning languageARBITRARINESS
COMPOSITIONALITY
Auto voiture xehơiọkọayọkẹlẹ koloi sakyanan
car c b bar+� = cat c b bat+� =
(de Saussure, 1916)
(Frege, 1892)
car
Learning languageARBITRARINESS
COMPOSITIONALITY
Auto voiture xehơiọkọayọkẹlẹ koloi sakyanan
car c b bar+� = cat c b bat+� =
(de Saussure, 1916)
(Frege, 1892)Johndances John Mary Marydances+� =
dance(John) dance(Mary)
car
Learning languageARBITRARINESS
COMPOSITIONALITY
Auto voiture xehơiọkọayọkẹlẹ koloi sakyanan
car c b bar+� = cat c b bat+� =
(de Saussure, 1916)
(Frege, 1892)Johndances John Mary Marydances+� =
dance(John) dance(Mary)
Johnsings John Mary Marysings+� =
sing(John) sing(Mary)
car
Learning languageARBITRARINESS
COMPOSITIONALITY
Auto voiture xehơiọkọayọkẹlẹ koloi sakyanan
car c b bar+� = cat c b bat+� =
(de Saussure, 1916)
(Frege, 1892)Johndances John Mary Marydances+� =
dance(John) dance(Mary)
Johnsings John Mary Marysings+� =
sing(John) sing(Mary)
car
Learning languageARBITRARINESS
COMPOSITIONALITY
Auto voiture xehơiọkọayọkẹlẹ koloi sakyanan
car c b bar+� = cat c b bat+� =
(de Saussure, 1916)
(Frege, 1892)Johndances John Mary Marydances+� =
dance(John) dance(Mary)
Johnsings John Mary Marysings+� =
sing(John) sing(Mary)
car Memorize
Generalize
Learning language
John gave the book to Mary
Learning language
00 0 1 0 · · ·· · ·· · ·· · · 1>John
John gave the book to Mary
Learning language
P⇥00 0 1 0 · · ·· · ·· · ·· · · 1>
John
John gave the book to Mary
Learning language
P⇥00 0 1 0 · · ·· · ·· · ·· · · 1>
John
vJohn
John gave the book to Mary
Learning language
John gave the book to Mary
Learning language
John gave the book to Mary
Learning language
John gave the book to Mary
f
Learning language
John gave the book to Mary
f Recurrent NN, ConvNet, Recursive NN, …
Learning language
John gave the book to Mary
f Recurrent NN, ConvNet, Recursive NN, …
prediction
Learning language
John gave the book to Mary
f Recurrent NN, ConvNet, Recursive NN, …
prediction
Memorize
Learning language
John gave the book to Mary
f Recurrent NN, ConvNet, Recursive NN, …
prediction
Memorize
Generalize
Learning languageCHALLENGE 1: IDIOMS
CHALLENGE 2: MORPHOLOGY
Learning languageCHALLENGE 1: IDIOMSJohnsawthefootballJohnsawthebucket
see(John, football)
see(John,bucket)
CHALLENGE 2: MORPHOLOGY
Learning languageCHALLENGE 1: IDIOMSJohnsawthefootballJohnsawthebucket
see(John, football)
see(John,bucket)
kick(John, football)JohnkickedthefootballJohnkickedthebucket die(John)
CHALLENGE 2: MORPHOLOGY
Learning languageCHALLENGE 1: IDIOMSJohnsawthefootballJohnsawthebucket
see(John, football)
see(John,bucket)
kick(John, football)JohnkickedthefootballJohnkickedthebucket die(John)
CHALLENGE 2: MORPHOLOGY
Learning languageCHALLENGE 1: IDIOMSJohnsawthefootballJohnsawthebucket
see(John, football)
see(John,bucket)
kick(John, football)JohnkickedthefootballJohnkickedthebucket die(John)
CHALLENGE 2: MORPHOLOGYcool|coooool|coooooooool
Learning languageCHALLENGE 1: IDIOMSJohnsawthefootballJohnsawthebucket
see(John, football)
see(John,bucket)
kick(John, football)JohnkickedthefootballJohnkickedthebucket die(John)
CHALLENGE 2: MORPHOLOGYcool|coooool|coooooooool
cat s+ = cats
Learning languageCHALLENGE 1: IDIOMSJohnsawthefootballJohnsawthebucket
see(John, football)
see(John,bucket)
kick(John, football)JohnkickedthefootballJohnkickedthebucket die(John)
CHALLENGE 2: MORPHOLOGYcool|coooool|coooooooool
cat s+ = cats bat s+ = bats
Learning languageCHALLENGE 1: IDIOMSJohnsawthefootballJohnsawthebucket
see(John, football)
see(John,bucket)
kick(John, football)JohnkickedthefootballJohnkickedthebucket die(John)
CHALLENGE 2: MORPHOLOGYcool|coooool|coooooooool
cat s+ = cats bat s+ = bats
Compositional words
cats
Memorize Generalize
Compositional words
cats
00 0 1 0 · · ·· · ·· · ·· · ·
Memorize Generalize
Compositional words
cats
00 0 1 0 · · ·· · ·· · ·· · ·
Memorize Generalize
Compositional words
cats
00 0 1 0 · · ·· · ·· · ·· · ·
Memorize Generalize
c a t STOPSTART s
Compositional words
cats
00 0 1 0 · · ·· · ·· · ·· · ·
Memorize Generalize
c a t STOPSTART s
Compositional words
cats
00 0 1 0 · · ·· · ·· · ·· · ·
Memorize Generalize
c a t STOPSTART s
Compositional words
cats
00 0 1 0 · · ·· · ·· · ·· · ·
Memorize Generalize
c a t STOPSTART s
Compositional words
cats
00 0 1 0 · · ·· · ·· · ·· · ·
Memorize Generalize
c a t STOPSTART s
• Does a “compositional” model have the capacity to learn the “arbitrariness” that is required?
• We might think so—RNNs/LSTMs can definitely overfit!
• Will we see better improvements in languages with more going on in the morphology?
Compositional wordsQuestions
ExampleDependency parsing
I saw her duck
ExampleDependency parsing
I saw her duck
ExampleDependency parsing
I saw her duck
ROOT
ExampleDependency parsing
I saw her duck
ROOT
ExampleDependency parsing
I saw her duck
ROOT
00 0 1 0 · · ·· · ·· · ·· · ·
SS
Word embedding models:
Memorize Generalize
Dependency parsingCharLSTM > Word Lookup
Word Chars Δ
English 91.2 91.5 +0.3
Turkish 71.7 76.3 +4.6
Hungarian 72.8 80.4 +7.6
Basque 77.1 85.2 +8.1
Korean 78.7 88.4 +9.7
Swedish 76.4 79.2 +3.2
Swedish 76.4 79.2 +2.8
Arabic 85.2 86.1 +0.9
Chinese 79.1 79.9 +0.8
Dependency parsingCharLSTM > Word Lookup
Word Chars Δ
English 91.2 91.5 +0.3
Turkish 71.7 76.3 +4.6
Hungarian 72.8 80.4 +7.6
Basque 77.1 85.2 +8.1
Korean 78.7 88.4 +9.7
Swedish 76.4 79.2 +3.2
Swedish 76.4 79.2 +2.8
Arabic 85.2 86.1 +0.9
Chinese 79.1 79.9 +0.8
In English parsing, the character LSTM is roughly equivalent to the lookup approach.
00 0 1 0 · · ·· · ·· · ·· · ·
⇡
Dependency parsingCharLSTM > Word Lookup
Word Chars Δ
English 91.2 91.5 +0.3
Turkish 71.7 76.3 +4.6
Hungarian 72.8 80.4 +7.6
Basque 77.1 85.2 +8.1
Korean 78.7 88.4 +9.7
Swedish 76.4 79.2 +3.2
Swedish 76.4 79.2 +2.8
Arabic 85.2 86.1 +0.9
Chinese 79.1 79.9 +0.8
In English parsing, the character LSTM is roughly equivalent to the lookup approach.
00 0 1 0 · · ·· · ·· · ·· · ·
⇡What about languages with richer lexicons?MuvaffakiyetsizleştiricileştiriveremeyebileceklerimizdenmişsinizcesineTurkish:
MegszentségteleníthetetlenségeskedéseitekértHungarian:
Word Chars Δ
English 91.2 91.5 +0.3
Turkish 71.7 76.3 +4.6
Hungarian 72.8 80.4 +7.6
Basque 77.1 85.2 +8.1
Korean 78.7 88.4 +9.7
Swedish 76.4 79.2 +3.2
Swedish 76.4 79.2 +2.8
Arabic 85.2 86.1 +0.9
Chinese 79.1 79.9 +0.8
Dependency parsingCharLSTM > Word Lookup
Word Chars Δ
English 91.2 91.5 +0.3
Turkish 71.7 76.3 +4.6
Hungarian 72.8 80.4 +7.6
Basque 77.1 85.2 +8.1
Korean 78.7 88.4 +9.7
Swedish 76.4 79.2 +3.2
Swedish 76.4 79.2 +2.8
Arabic 85.2 86.1 +0.9
Chinese 79.1 79.9 +0.8
00 0 1 0 · · ·· · ·· · ·· · ·
In agglutinative languages,
⌧
Dependency parsingCharLSTM > Word Lookup
Word Chars Δ
English 91.2 91.5 +0.3
Turkish 71.7 76.3 +4.6
Hungarian 72.8 80.4 +7.6
Basque 77.1 85.2 +8.1
Korean 78.7 88.4 +9.7
Swedish 76.4 79.2 +3.2
Swedish 76.4 79.2 +2.8
Arabic 85.2 86.1 +0.9
Chinese 79.1 79.9 +0.8
00 0 1 0 · · ·· · ·· · ·· · ·
In agglutinative languages,
⌧
In fusional/templatic languages,
00 0 1 0 · · ·· · ·· · ·· · ·<
Dependency parsingCharLSTM > Word Lookup
Word Chars Δ
English 91.2 91.5 +0.3
Turkish 71.7 76.3 +4.6
Hungarian 72.8 80.4 +7.6
Basque 77.1 85.2 +8.1
Korean 78.7 88.4 +9.7
Swedish 76.4 79.2 +3.2
Swedish 76.4 79.2 +2.8
Arabic 85.2 86.1 +0.9
Chinese 79.1 79.9 +0.8
00 0 1 0 · · ·· · ·· · ·· · ·
In agglutinative languages,
⌧
In fusional/templatic languages,
00 0 1 0 · · ·· · ·· · ·· · ·<
In analytic languages, the models are roughly equivalent.
Dependency parsingCharLSTM > Word Lookup
Language modelingWord similarities
increased John Noahshire phding
reduced Richard Nottinghamshire mixing
improved George Bucharest modelling
expected James Saxony styling
decreased Robert Johannesburg blaming
targeted Edward Gloucestershire christening
query
Language modelingWord similarities
increased John Noahshire phding
reduced Richard Nottinghamshire mixing
improved George Bucharest modelling
expected James Saxony styling
decreased Robert Johannesburg blaming
targeted Edward Gloucestershire christening
query
5 nearest neighbors
Language modelingWord similarities
increased John Noahshire phding
reduced Richard Nottinghamshire mixing
improved George Bucharest modelling
expected James Saxony styling
decreased Robert Johannesburg blaming
targeted Edward Gloucestershire christening
query
5 nearest neighbors
• Lots of exciting work from a variety of places
• Google Brain: language models
• Harvard/NYU (Kim, Rush, Sontag): language models
• NYU/FB: document representation “from scratch”
• CMU (me, Cohen, Salakhutdinov): Twitter, morphologically rich languages, translation
• Now for something a bit more controversial…
Character vs. word modelingSummary
Structure-aware words
cats
00 0 1 0 · · ·· · ·· · ·· · ·
Memorize Generalize
c a t STOPSTART s
Structure-aware words
STOPSTARTcats
00 0 1 0 · · ·· · ·· · ·· · ·
Memorize Generalize
Would we benefit from a more knowledge-rich
decomposition?
cat +PL
• Rather than assuming a fixed vocabulary, model any sequence in where is the inventory of characters.
Open Vocabulary LMs
⌃⇤ ⌃
Open Vocabulary LMsTurkish
Open Vocabulary LMsTurkish morphology
Open Vocabulary LMsTurkish morphology
Open Vocabulary LMsTurkish morphology
süreç+NOUN+A3SG+P3SG+ACC
Open Vocabulary LMsTurkish morphology
süreç+NOUN+A3SG+P3SG+ACC süreç+NOUN+A3SG+P2SG+ACC
Open Vocabulary LMsTurkish morphology
süreç+NOUN+A3SG+P3SG+ACC süreç+NOUN+A3SG+P2SG+ACC
POOLING
Open Vocabulary LMsTurkish morphology
süreç+NOUN+A3SG+P3SG+ACC süreç+NOUN+A3SG+P2SG+ACC
POOLING
sürecini s ü r e c i n i
Input word representation:
Input word representation:
Output generation process: (marginalized)
Input word representation:
Output generation process: (marginalized)
Input word representation:
Output generation process: (marginalized)
Input word representation:
Output generation process: (marginalized)
Input word representation:
Output generation process: (marginalized)
Input word representation:
Output generation process: (marginalized)
Open Vocabulary LMperplexity per
word
Characters 18600
Characters +Morphs 8165
Characters +Words 5021
Characters +Words +Morphs
4116
• Model performance is essentially equivalent in morphologically simple languages (e.g., Chinese, English)
• In morphologically rich languages (e.g., Hungarian, Turkish, Finnish), performance improvements are most pronounced
• We need far fewer parameters to represent words as “compositions” of characters
• Word and morpheme level information adds additional value
• Where else could we add linguistic structural knowledge?
Character vs. word modelingSummary
(1) a. The talk I gave did not appeal to anybody.
Modeling syntaxLanguage is hierarchical
Examples adapted from Everaert et al. (TICS 2015)
(1) a. The talk I gave did not appeal to anybody.b. *The talk I gave appealed to anybody.
Modeling syntaxLanguage is hierarchical
Examples adapted from Everaert et al. (TICS 2015)
(1) a. The talk I gave did not appeal to anybody.b. *The talk I gave appealed to anybody.
Modeling syntaxLanguage is hierarchical
NPI
Examples adapted from Everaert et al. (TICS 2015)
(1) a. The talk I gave did not appeal to anybody.b. *The talk I gave appealed to anybody.
Generalization hypothesis: not must come before anybody
Modeling syntaxLanguage is hierarchical
NPI
Examples adapted from Everaert et al. (TICS 2015)
(1) a. The talk I gave did not appeal to anybody.b. *The talk I gave appealed to anybody.
Generalization hypothesis: not must come before anybody
(2) *The talk I did not give appealed to anybody.
Modeling syntaxLanguage is hierarchical
NPI
Examples adapted from Everaert et al. (TICS 2015)
Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural configuration is called c(onstituent)-command in thelinguistics literature [31].) When the relationship between not and anybody adheres to thisstructural configuration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating notin Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentenceis not well-formed.
The reader may confirm that the same hierarchical constraint dictates whether the examples in(4–5) are well-formed or not, where we have depicted the hierarchical sentence structure interms of conventional labeled brackets:
(4) [S1 [NP The book [S2 I bought]S2]NP did not [VP appeal to anyone]VP]S1(5) *[S1 [NP The book [S2 I did not buy]S2]NP [VP appealed to anyone]VP]S1
Only in example (4) does the hierarchical structure containing not (corresponding to the sentenceThe book I bought did not appeal to anyone) also immediately dominate the NPI anybody. In (5)not is embedded in at least one phrase that does not also include the NPI. So (4) is well-formedand (5) is not, exactly the predicted result if the hierarchical constraint is correct.
Even more strikingly, the same constraint appears to hold across languages and in many othersyntactic contexts. Note that Japanese-type languages follow this same pattern if we assumethat these languages have hierarchically structured expressions similar to English, but linearizethese structures somewhat differently – verbs come at the end of sentences, and so forth [32].Linear order, then, should not enter into the syntactic–semantic computation [33,34]. This israther independent of possible effects of linearly intervening negation that modulate acceptabilityin NPI contexts [35].
The Syntax of SyntaxObserve an example as in (6):
(6) Guess which politician your interest in clearly appeals to.
The construction in (6) is remarkable because a single wh-phrase is associated bothwith the prepositional object gap of to and with the prepositional object gap of in, as in(7a). We talk about ‘gaps’ because a possible response to (6) might be as in (7b):
(7) a. Guess which politician your interest in GAP clearly appeals to GAP.b. response to (7a): Your interest in Donald Trump clearly appeals to Donald Trump
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybodydid not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
734 Trends in Cognitive Sciences, December 2015, Vol. 19, No. 12
Language is hierarchical
The talk
I gavedid not
appeal to anybody
appealed to anybodyThe talk
I did not give
Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural configuration is called c(onstituent)-command in thelinguistics literature [31].) When the relationship between not and anybody adheres to thisstructural configuration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating notin Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentenceis not well-formed.
The reader may confirm that the same hierarchical constraint dictates whether the examples in(4–5) are well-formed or not, where we have depicted the hierarchical sentence structure interms of conventional labeled brackets:
(4) [S1 [NP The book [S2 I bought]S2]NP did not [VP appeal to anyone]VP]S1(5) *[S1 [NP The book [S2 I did not buy]S2]NP [VP appealed to anyone]VP]S1
Only in example (4) does the hierarchical structure containing not (corresponding to the sentenceThe book I bought did not appeal to anyone) also immediately dominate the NPI anybody. In (5)not is embedded in at least one phrase that does not also include the NPI. So (4) is well-formedand (5) is not, exactly the predicted result if the hierarchical constraint is correct.
Even more strikingly, the same constraint appears to hold across languages and in many othersyntactic contexts. Note that Japanese-type languages follow this same pattern if we assumethat these languages have hierarchically structured expressions similar to English, but linearizethese structures somewhat differently – verbs come at the end of sentences, and so forth [32].Linear order, then, should not enter into the syntactic–semantic computation [33,34]. This israther independent of possible effects of linearly intervening negation that modulate acceptabilityin NPI contexts [35].
The Syntax of SyntaxObserve an example as in (6):
(6) Guess which politician your interest in clearly appeals to.
The construction in (6) is remarkable because a single wh-phrase is associated bothwith the prepositional object gap of to and with the prepositional object gap of in, as in(7a). We talk about ‘gaps’ because a possible response to (6) might be as in (7b):
(7) a. Guess which politician your interest in GAP clearly appeals to GAP.b. response to (7a): Your interest in Donald Trump clearly appeals to Donald Trump
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybodydid not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
734 Trends in Cognitive Sciences, December 2015, Vol. 19, No. 12
Generalization: not must “structurally precede” anybody
Language is hierarchical
The talk
I gavedid not
appeal to anybody
appealed to anybodyThe talk
I did not give
Examples adapted from Everaert et al. (TICS 2015)
containing anybody. (This structural configuration is called c(onstituent)-command in thelinguistics literature [31].) When the relationship between not and anybody adheres to thisstructural configuration, the sentence is well-formed.
In sentence (3), by contrast, not sequentially precedes anybody, but the triangle dominating notin Figure 1B fails to also dominate the structure containing anybody. Consequently, the sentenceis not well-formed.
The reader may confirm that the same hierarchical constraint dictates whether the examples in(4–5) are well-formed or not, where we have depicted the hierarchical sentence structure interms of conventional labeled brackets:
(4) [S1 [NP The book [S2 I bought]S2]NP did not [VP appeal to anyone]VP]S1(5) *[S1 [NP The book [S2 I did not buy]S2]NP [VP appealed to anyone]VP]S1
Only in example (4) does the hierarchical structure containing not (corresponding to the sentenceThe book I bought did not appeal to anyone) also immediately dominate the NPI anybody. In (5)not is embedded in at least one phrase that does not also include the NPI. So (4) is well-formedand (5) is not, exactly the predicted result if the hierarchical constraint is correct.
Even more strikingly, the same constraint appears to hold across languages and in many othersyntactic contexts. Note that Japanese-type languages follow this same pattern if we assumethat these languages have hierarchically structured expressions similar to English, but linearizethese structures somewhat differently – verbs come at the end of sentences, and so forth [32].Linear order, then, should not enter into the syntactic–semantic computation [33,34]. This israther independent of possible effects of linearly intervening negation that modulate acceptabilityin NPI contexts [35].
The Syntax of SyntaxObserve an example as in (6):
(6) Guess which politician your interest in clearly appeals to.
The construction in (6) is remarkable because a single wh-phrase is associated bothwith the prepositional object gap of to and with the prepositional object gap of in, as in(7a). We talk about ‘gaps’ because a possible response to (6) might be as in (7b):
(7) a. Guess which politician your interest in GAP clearly appeals to GAP.b. response to (7a): Your interest in Donald Trump clearly appeals to Donald Trump
(A) (B)
X X
X X X X
The book X X X The book X appealed to anybodydid not
that I bought appeal to anybody that I did not buy
Figure 1. Negative Polarity. (A) Negative polarity licensed: negative element c-commands negative polarity item.(B) Negative polarity not licensed. Negative element does not c-command negative polarity item.
734 Trends in Cognitive Sciences, December 2015, Vol. 19, No. 12
Generalization: not must “structurally precede” anybody- many theories of the details of structure - the psychological reality of structural sensitivty
is not empirically controversial - much more than NPIs follow such constraints
Language is hierarchical
The talk
I gavedid not
appeal to anybody
appealed to anybodyThe talk
I did not give
One theory of hierarchy
• Generate symbols sequentially using an RNN
• Add some “control symbols” to rewrite the history periodically
• Periodically “compress” a sequence into a single “constituent”
• Augment RNN with an operation to compress recent history into a single vector (-> “reduce”)
• RNN predicts next symbol based on the history of compressed elements and non-compressed terminals (“push”)
• RNN must also predict “control symbols” that decide how big constituents are
• We call such models recurrent neural network grammars.
One theory of hierarchy
• Generate symbols sequentially using an RNN
• Add some “control symbols” to rewrite the history periodically
• Periodically “compress” a sequence into a single “constituent”
• Augment RNN with an operation to compress recent history into a single vector (-> “reduce”)
• RNN predicts next symbol based on the history of compressed elements and non-compressed terminals (“shift” or “generate”)
• RNN must also predict “control symbols” that decide how big constituents are
• We call such models recurrent neural network grammars.
One theory of hierarchy
• Generate symbols sequentially using an RNN
• Add some “control symbols” to rewrite the history periodically
• Periodically “compress” a sequence into a single “constituent”
• Augment RNN with an operation to compress recent history into a single vector (-> “reduce”)
• RNN predicts next symbol based on the history of compressed elements and non-compressed terminals (“shift” or “generate”)
• RNN must also predict “control symbols” that decide how big constituents are
• We call such models recurrent neural network grammars.
Stack ActionTerminals
Stack ActionNT(S)
Terminals
Stack ActionNT(S)
(S
Terminals
Stack ActionNT(S)
(S NT(NP)
Terminals
Stack ActionNT(S)
(S NT(NP)
(S (NP
Terminals
Stack ActionNT(S)
(S NT(NP)
(S (NP GEN(The)
Terminals
Stack ActionNT(S)
(S NT(NP)
(S (NP GEN(The)
(S (NP TheThe
Terminals
Stack ActionNT(S)
(S NT(NP)
(S (NP GEN(The)
GEN(hungry)(S (NP TheThe
Terminals
Stack ActionNT(S)
(S NT(NP)
(S (NP GEN(The)
GEN(hungry)
(S (NP The hungryThe hungry
(S (NP TheThe
Terminals
Stack ActionNT(S)
(S NT(NP)
(S (NP GEN(The)
GEN(hungry)
GEN(cat)(S (NP The hungryThe hungry
(S (NP TheThe
Terminals
Stack ActionNT(S)
(S NT(NP)
(S (NP GEN(The)
GEN(hungry)
GEN(cat)
(S (NP The hungry catThe hungry cat
(S (NP The hungryThe hungry
(S (NP TheThe
Terminals
Stack ActionNT(S)
(S NT(NP)
(S (NP GEN(The)
GEN(hungry)
GEN(cat)
REDUCE(S (NP The hungry catThe hungry cat
(S (NP The hungryThe hungry
(S (NP TheThe
Terminals
Stack ActionNT(S)
(S NT(NP)
(S (NP GEN(The)
GEN(hungry)
GEN(cat)
REDUCE(S (NP The hungry catThe hungry cat
(S (NP The hungryThe hungry
(S (NP TheThe
Terminals
The hungry cat (S (NP The hungry cat )
Stack ActionNT(S)
(S NT(NP)
(S (NP GEN(The)
GEN(hungry)
GEN(cat)
REDUCE(S (NP The hungry catThe hungry cat
(S (NP The hungryThe hungry
(S (NP TheThe
Terminals
The hungry cat (S (NP The hungry cat )
(S (NP The hungry cat)
Compress “The hungry cat” into a single composite symbol
Stack ActionNT(S)
(S NT(NP)
(S (NP GEN(The)
GEN(hungry)
GEN(cat)
REDUCE
(S (NP The hungry cat)The hungry cat
(S (NP The hungry catThe hungry cat
(S (NP The hungryThe hungry
(S (NP TheThe
Terminals
Stack ActionNT(S)
(S NT(NP)
(S (NP GEN(The)
GEN(hungry)
GEN(cat)
REDUCE
NT(VP)(S (NP The hungry cat)The hungry cat
(S (NP The hungry catThe hungry cat
(S (NP The hungryThe hungry
(S (NP TheThe
Terminals
Stack ActionNT(S)
(S NT(NP)
(S (NP GEN(The)
GEN(hungry)
GEN(cat)
REDUCE
NT(VP)
(S (NP The hungry cat) (VPThe hungry cat
(S (NP The hungry cat)The hungry cat
(S (NP The hungry catThe hungry cat
(S (NP The hungryThe hungry
(S (NP TheThe
Terminals
Stack ActionNT(S)
(S NT(NP)
(S (NP GEN(The)
GEN(hungry)
GEN(cat)
REDUCE
NT(VP)
GEN(meows)(S (NP The hungry cat) (VPThe hungry cat
(S (NP The hungry cat)The hungry cat
(S (NP The hungry catThe hungry cat
(S (NP The hungryThe hungry
(S (NP TheThe
Terminals
Stack ActionNT(S)
(S NT(NP)
(S (NP GEN(The)
GEN(hungry)
GEN(cat)
REDUCE
NT(VP)
GEN(meows)
The hungry cat meows
REDUCE
(S (NP The hungry cat) (VP meows) GEN(.)
REDUCE
(S (NP The hungry cat) (VP meows) .)The hungry cat meows .
(S (NP The hungry cat) (VP meows) .The hungry cat meows .
(S (NP The hungry cat) (VP meowsThe hungry cat meows
(S (NP The hungry cat) (VPThe hungry cat
(S (NP The hungry cat)The hungry cat
(S (NP The hungry catThe hungry cat
(S (NP The hungryThe hungry
(S (NP TheThe
Terminals
Syntactic Composition(NP The hungry cat)Need representation for:
Syntactic Composition(NP The hungry cat)Need representation for:
NP
What head type?
Syntactic Composition
The
(NP The hungry cat)Need representation for:
NP
What head type?
Syntactic Composition
The hungry
(NP The hungry cat)Need representation for:
NP
What head type?
Syntactic Composition
The hungry cat
(NP The hungry cat)Need representation for:
NP
What head type?
Syntactic Composition
The hungry cat )
(NP The hungry cat)Need representation for:
NP
What head type?
Syntactic Composition
TheNP hungry cat )
(NP The hungry cat)Need representation for:
Syntactic Composition
TheNP hungry cat ) NP
(NP The hungry cat)Need representation for:
Syntactic Composition
TheNP hungry cat ) NP
(NP The hungry cat)Need representation for:
Syntactic Composition
TheNP hungry cat ) NP
(NP The hungry cat)Need representation for:
Syntactic Composition
TheNP hungry cat ) NP
(NP The hungry cat)Need representation for:
Syntactic Composition
TheNP hungry cat ) NP (
(NP The hungry cat)Need representation for:
Syntactic Composition
TheNP hungry cat ) NP (
(NP The hungry cat)Need representation for:
Recursion
TheNP cat ) NP (
(NP The (ADJP very hungry) cat)Need representation for: (NP The hungry cat)
hungry
Recursion
TheNP cat ) NP (
(NP The (ADJP very hungry) cat)Need representation for: (NP The hungry cat)
| {z }v
v
Syntactic Composition• Inspired by Socher et al (2011, 2012 …)
• words and constituents embedded in same space
• Composition functions designed to
• capture linguistic notion of headedness (LSTMs know what type of head they are looking for while they traverse children)
• support any number of children
• are learned via backpropagation through structure
• RNNGs jointly model sequences of words together with a “tree structure”,
• Any parse tree can be converted to a sequence of actions (depth first traversal) and vice versa (subject to wellformedness constraints)
• We use trees from the Penn Treebank
• We could treat the non-generation actions as latent variables or learn them with RL, effectively making this a problem of grammar induction. Future work…
Implementing RNNGsParameter Estimation
p✓(x,y)
• An RNNG is a joint distribution p(x,y) over strings (x) and parse trees (y)
• We are interested in two inference questions:
• What is p(x) for a given x? [language modeling]
• What is max p(y | x) for a given x? [parsing]
• Unfortunately, the dynamic programming algorithms we often rely on are of no help here
• We can use importance sampling to do both by sampling from a discriminatively trained model
y
Implementing RNNGsInference
Type F1
Petrov and Klein (2007) G 90.1
Shindo et al (2012)Single model G 91.1
Shindo et al (2012)Ensemble ~G 92.4
Vinyals et al (2015)PTB only D 90.5
Vinyals et al (2015)Ensemble S 92.8
Discriminative D 89.8
Generative (IS) G 92.4
English PTB (Parsing)
Type F1
Petrov and Klein (2007) G 90.1
Shindo et al (2012)Single model G 91.1
Shindo et al (2012)Ensemble ~G 92.4
Vinyals et al (2015)PTB only D 90.5
Vinyals et al (2015)Ensemble S 92.8
Discriminative D 89.8
Generative (IS) G 92.4
English PTB (Parsing)
Type F1
Petrov and Klein (2007) G 90.1
Shindo et al (2012)Single model G 91.1
Shindo et al (2012)Ensemble ~G 92.4
Vinyals et al (2015)PTB only D 90.5
Vinyals et al (2015)Ensemble S 92.8
Discriminative D 89.8
Generative (IS) G 92.4
English PTB (Parsing)
Perplexity
5-gram IKN 169.3
LSTM + Dropout 113.4
Generative (IS) 102.4
English PTB (LM)
Perplexity
5-gram IKN 255.2
LSTM + Dropout 207.3
Generative (IS) 171.9
Chinese CTB (LM)
This Talk, In a Nutshell• Facts about language:
• Arbitrariness and compositionality exist at all levels
• Language is sensitive to hierarchy, not strings
• My work’s hypothesis:Models designed with these considerations structure explicit will outperform models that don’t
STOPSTARTt h a n k s !
STOPSTARTt h a n k s !
questions?
• Augment a sequential RNN with a stack pointer
• Two constant-time operations
• push - read input, add to top of stack, connect to current location of the stack pointer
• pop - move stack pointer to its parent
• A summary of stack contents is obtained by accessing the output of the RNN at location of the stack pointer
• Note: push and pop are discrete actions here (cf. Grefenstette et al., 2015)
Implementing RNNGsStack RNNs
;
y0
PUSH
Implementing RNNGsStack RNNs
; x1
y0 y1
POP
Implementing RNNGsStack RNNs
; x1
y0 y1
Implementing RNNGsStack RNNs
; x1
y0 y1
PUSH
Implementing RNNGsStack RNNs
; x1
y0 y1 y2
x2
POP
Implementing RNNGsStack RNNs
; x1
y0 y1 y2
x2
Implementing RNNGsStack RNNs
; x1
y0 y1 y2
x2
PUSH
Implementing RNNGsStack RNNs
; x1
y0 y1 y2
x2 x3
y3
Implementing RNNGsStack RNNs
Importance Samplingq(y | x)Assume we’ve got a conditional distribution
p(x,y) > 0 =) q(y | x) > 0y ⇠ q(y | x)
(i)(ii) is tractable and
q(y | x)(iii) is tractable
s.t.
Importance Samplingq(y | x)Assume we’ve got a conditional distribution
p(x,y) > 0 =) q(y | x) > 0y ⇠ q(y | x)
(i)(ii) is tractable and
q(y | x)(iii) is tractable
s.t.
w(x,y) =p(x,y)
q(y | x)Let the importance weights
Importance Samplingq(y | x)Assume we’ve got a conditional distribution
p(x,y) > 0 =) q(y | x) > 0y ⇠ q(y | x)
(i)(ii) is tractable and
q(y | x)(iii) is tractable
s.t.
w(x,y) =p(x,y)
q(y | x)Let the importance weights
p(x) =X
y2Y(x)
p(x,y) =X
y2Y(x)
w(x,y)q(y | x)
= Ey⇠q(y|x)w(x,y)
Importance Samplingp(x) =
X
y2Y(x)
p(x,y) =X
y2Y(x)
w(x,y)q(y | x)
= Ey⇠q(y|x)w(x,y)
Importance Samplingp(x) =
X
y2Y(x)
p(x,y) =X
y2Y(x)
w(x,y)q(y | x)
= Ey⇠q(y|x)w(x,y)
Replace this expectation with its Monte Carlo estimate.
y
(i) ⇠ q(y | x) for i 2 {1, 2, . . . , N}
Importance Samplingp(x) =
X
y2Y(x)
p(x,y) =X
y2Y(x)
w(x,y)q(y | x)
= Ey⇠q(y|x)w(x,y)
Replace this expectation with its Monte Carlo estimate.
y
(i) ⇠ q(y | x) for i 2 {1, 2, . . . , N}
Eq(y|x)w(x,y)MC⇡ 1
N
NX
i=1
w(x,y(i))