+ All Categories
Home > Documents > CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf ·...

CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf ·...

Date post: 20-Jun-2018
Category:
Upload: danganh
View: 223 times
Download: 0 times
Share this document with a friend
84
CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher [email protected]
Transcript
Page 1: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

CS224d:DeepNLP

Lecture12:MidtermReview

[email protected]

Page 2: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

OverviewToday– Mostlyopenforquestions!

• LinguisticBackground:Levelsandtasks

• WordVectors

• Backprop

• RNNs

5/5/16RichardSocherLecture1,Slide 2

Page 3: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Overviewoflinguisticlevels

5/5/16RichardSocherLecture1,Slide 3

Page 4: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Tasks:NER

5/5/16RichardSocherLecture1,Slide 4

Page 5: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Tasks:POS

5/5/16RichardSocherLecture1,Slide 5

Page 6: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Tasks:Sentimentanalysis

5/5/16RichardSocherLecture1,Slide 6

Page 7: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

MachineTranslation

5/5/16RichardSocherLecture1,Slide 7

Page 8: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Skip-gram

I Task: given a center word, predict itscontext words

I For each word, we have an “input vector”vw and an “output vector” v 0w

w(t)

INPUT PROJECTION OUTPUT

w(t-2)

w(t-1)

w(t+1)

w(t+2)

Figure 1: New model architectures. The CBOW architecture predicts the current word based on thecontext, and the Skip-gram predicts surrounding words given the current word.

R words from the future of the current word as correct labels. This will require us to do R � 2

word classifications, with the current word as input, and each of the R + R words as output. In thefollowing experiments, we use C = 10.

4 Results

To compare the quality of different versions of word vectors, previous papers typically use a tableshowing example words and their most similar words, and understand them intuitively. Althoughit is easy to show that word France is similar to Italy and perhaps some other countries, it is muchmore challenging when subjecting those vectors in a more complex similarity task, as follows. Wefollow previous observation that there can be many different types of similarities between words, forexample, word big is similar to bigger in the same sense that small is similar to smaller. Exampleof another type of relationship can be word pairs big - biggest and small - smallest [20]. We furtherdenote two pairs of words with the same relationship as a question, as we can ask: ”What is theword that is similar to small in the same sense as biggest is similar to big?”

Somewhat surprisingly, these questions can be answered by performing simple algebraic operationswith the vector representation of words. To find a word that is similar to small in the same sense asbiggest is similar to big, we can simply compute vector X = vector(”biggest”)�vector(”big”)+

vector(”small”). Then, we search in the vector space for the word closest to X measured by cosinedistance, and use it as the answer to the question (we discard the input question words during thissearch). When the word vectors are well trained, it is possible to find the correct answer (wordsmallest) using this method.

Finally, we found that when we train high dimensional word vectors on a large amount of data, theresulting vectors can be used to answer very subtle semantic relationships between words, such asa city and the country it belongs to, e.g. France is to Paris as Germany is to Berlin. Word vectorswith such semantic relationships could be used to improve many existing NLP applications, suchas machine translation, information retrieval and question answering systems, and may enable otherfuture applications yet to be invented.

5

Page 9: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Skip-gram v.s. CBOW

All word2vec figures are from http://arxiv.org/pdf/1301.3781.pdf

Skip-gram CBOW

w(t)

INPUT PROJECTION OUTPUT

w(t-2)

w(t-1)

w(t+1)

w(t+2)

Figure 1: New model architectures. The CBOW architecture predicts the current word based on thecontext, and the Skip-gram predicts surrounding words given the current word.

R words from the future of the current word as correct labels. This will require us to do R � 2

word classifications, with the current word as input, and each of the R + R words as output. In thefollowing experiments, we use C = 10.

4 Results

To compare the quality of different versions of word vectors, previous papers typically use a tableshowing example words and their most similar words, and understand them intuitively. Althoughit is easy to show that word France is similar to Italy and perhaps some other countries, it is muchmore challenging when subjecting those vectors in a more complex similarity task, as follows. Wefollow previous observation that there can be many different types of similarities between words, forexample, word big is similar to bigger in the same sense that small is similar to smaller. Exampleof another type of relationship can be word pairs big - biggest and small - smallest [20]. We furtherdenote two pairs of words with the same relationship as a question, as we can ask: ”What is theword that is similar to small in the same sense as biggest is similar to big?”

Somewhat surprisingly, these questions can be answered by performing simple algebraic operationswith the vector representation of words. To find a word that is similar to small in the same sense asbiggest is similar to big, we can simply compute vector X = vector(”biggest”)�vector(”big”)+

vector(”small”). Then, we search in the vector space for the word closest to X measured by cosinedistance, and use it as the answer to the question (we discard the input question words during thissearch). When the word vectors are well trained, it is possible to find the correct answer (wordsmallest) using this method.

Finally, we found that when we train high dimensional word vectors on a large amount of data, theresulting vectors can be used to answer very subtle semantic relationships between words, such asa city and the country it belongs to, e.g. France is to Paris as Germany is to Berlin. Word vectorswith such semantic relationships could be used to improve many existing NLP applications, suchas machine translation, information retrieval and question answering systems, and may enable otherfuture applications yet to be invented.

5

w(t-2)

w(t+1)

w(t-1)

w(t+2)

w(t)

SUM

INPUT PROJECTION OUTPUT

Figure 1: New model architectures. The CBOW architecture predicts the current word based on thecontext, and the Skip-gram predicts surrounding words given the current word.

R words from the future of the current word as correct labels. This will require us to do R � 2

word classifications, with the current word as input, and each of the R + R words as output. In thefollowing experiments, we use C = 10.

4 Results

To compare the quality of different versions of word vectors, previous papers typically use a tableshowing example words and their most similar words, and understand them intuitively. Althoughit is easy to show that word France is similar to Italy and perhaps some other countries, it is muchmore challenging when subjecting those vectors in a more complex similarity task, as follows. Wefollow previous observation that there can be many different types of similarities between words, forexample, word big is similar to bigger in the same sense that small is similar to smaller. Exampleof another type of relationship can be word pairs big - biggest and small - smallest [20]. We furtherdenote two pairs of words with the same relationship as a question, as we can ask: ”What is theword that is similar to small in the same sense as biggest is similar to big?”

Somewhat surprisingly, these questions can be answered by performing simple algebraic operationswith the vector representation of words. To find a word that is similar to small in the same sense asbiggest is similar to big, we can simply compute vector X = vector(”biggest”)�vector(”big”)+

vector(”small”). Then, we search in the vector space for the word closest to X measured by cosinedistance, and use it as the answer to the question (we discard the input question words during thissearch). When the word vectors are well trained, it is possible to find the correct answer (wordsmallest) using this method.

Finally, we found that when we train high dimensional word vectors on a large amount of data, theresulting vectors can be used to answer very subtle semantic relationships between words, such asa city and the country it belongs to, e.g. France is to Paris as Germany is to Berlin. Word vectorswith such semantic relationships could be used to improve many existing NLP applications, suchas machine translation, information retrieval and question answering systems, and may enable otherfuture applications yet to be invented.

5

Task Center word ! Context Context ! Center wordr vwi f (vwi�C , · · · , vwi�1 , vwi+1 , · · · , vwi+C )

Page 10: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

word2vec as matrix factorization (conceptually)

I Matrix factorization2

4 M

3

5

n⇥n

2

4.

A>

.

3

5

n⇥k

⇥. B .

⇤k⇥n

Mij ⇡ a>i bj

I Imagine M is a matrix of counts for events co-occurring, butwe only get to observe the co-occurrences one at a time. E.g.

M =

2

41 0 40 0 21 3 0

3

5

but we only see(1,1), (2,3), (3,2), (2,3), (1,3), . . .

Page 11: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

word2vec as matrix factorization (conceptually)

Mij ⇡ a>i bj

I Whenever we see a pair (i , j) co-occur, we try to increasinga>i bj

I We also try to make all the other inner-products smaller toaccount for pairs never observed (or unobserved yet), bydecreasing a>¬ibj and a>i b¬j

I Remember from the lecture that the word co-occurrencematrix usually captures the semantic meaning of a word?For word2vec models, roughly speaking, M is the windowedword co-occurrence matrix, A is the output vector matrix, andB is the input vector matrix.

I Why not just use one set of vectors? It’s equivalent to A = Bin our formulation here, but less constraints is usually easierfor optimization.

Page 12: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

GloVe v.s. word2vec

* Skip-gram and CBOW are qualitatively di↵erent when it comes to smaller corpora

Fasttraining

E�cientusage ofstatistics

Qualitya↵ectedby size ofcorpora

Capturescomplexpatterns

Directprediction(word2vec)

Scaleswith sizeof corpus

No No* Yes

GloVeYes Yes No Yes

Page 13: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Overview • Neural Network Example • Terminology • Example 1:

•  Forward Pass •  Backpropagation Using Chain Rule •  What is delta? From Chain Rule to Modular Error Flow

• Example 2: •  Forward Pass •  Backpropagation

CS224D: Deep Learning for NLP 2

Page 14: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Neural Networks • One of many different types of non-linear classifiers (i.e. leads to non-linear decision boundaries)

• Most common design involves the stacking of affine transformations followed by point-wise (element-wise) non-linearity

CS224D: Deep Learning for NLP 3

Page 15: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

An example of a neural network

• This is a 4 layer neural network. •  2 hidden-layer neural network. •  2-10-10-3 neural network (complete architecture defn.)

CS224D: Deep Learning for NLP 4

Page 16: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example

• This is a 3 layer neural network •  1 hidden-layer neural network

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

CS224D: Deep Learning for NLP 5

Layer 1 Layer 2 Layer 3

Page 17: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Terminology

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

CS224D: Deep Learning for NLP 6

Model Input Model Output

Layer 1 Layer 2 Layer 3

Page 18: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Terminology

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

CS224D: Deep Learning for NLP 7

Model Input Model Output

Activation Units

Layer 1 Layer 2 Layer 3

Page 19: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Activation Unit Terminology

σz1

(2) a1(2)

CS224D: Deep Learning for NLP 8

z1(2) a1

(2) σ+

We draw this This is actually what’s going on

z1(2) = W11

(1)a1(1) + W12

(1)a2(1) + W13

(1)a3(1) + W14

(1)a4(1)

a1(2) is the 1st activation unit of layer 2

a1(2) = σ(z1

(2))

Page 20: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Forward Pass

CS224D: Deep Learning for NLP 9

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

z1(1) = x1

z2(1) = x2

z3(1) = x3

z4(1) = x4

Page 21: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Forward Pass

CS224D: Deep Learning for NLP 10

a1(1) = z1

(1)

a2(1) = z2

(1)

a3(1) = z3

(1)

a4(1) = z4

(1)

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

Page 22: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Forward Pass

CS224D: Deep Learning for NLP 11

z1(2) = W11

(1)a1(1) + W12

(1)a2(1) + W13

(1)a3(1) + W14

(1)a4(1)

z2(2) = W21

(1)a1(1) + W22

(1)a2(1) + W23

(1)a3(1) + W24

(1)a4(1)

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

Page 23: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Forward Pass

CS224D: Deep Learning for NLP 12

W11(1) W12

(1) W13(1) W14

(1) W21

(1) W22(1) W23

(1) W24(1)

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

a1(1)

a2(1)

a3(1)

a4(1)

z1(2)

z2(2)

=

W(1) z(2) a(1)

Page 24: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Forward Pass

CS224D: Deep Learning for NLP 13

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

z(2) =W(1)a(1)

Affine transformation

Page 25: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Forward Pass

CS224D: Deep Learning for NLP 14

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

a(2) = σ(z(2)) Point-wise/Element-wise non-linearity

Page 26: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Forward Pass

CS224D: Deep Learning for NLP 15

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

z(3) = W(2)a(2)

Affine transformation

Page 27: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Forward Pass

CS224D: Deep Learning for NLP 16

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

a(3) = z(3)

s = a(3)

Page 28: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation using chain rule

CS224D: Deep Learning for NLP 17

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

Let us try to calculate the error gradient wrt W14(1)

Thus we want to find:

Page 29: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation using chain rule

CS224D: Deep Learning for NLP 18

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

Let us try to calculate the error gradient wrt W14

(1)

Thus we want to find:

Page 30: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation using chain rule

CS224D: Deep Learning for NLP 19

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

This is simply 1

Page 31: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation using chain rule

CS224D: Deep Learning for NLP 20

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

!(!!!(!)!!! + !!!"(!)!!(!))!!!

(!)!!!(!)

!!!(!)

!!!(!)

!!!"(!)!

!!!(!)

!!!(!)!!!(!)

!!!(!)

!!!(!)

!!!"(!)!

Page 32: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation using chain rule

CS224D: Deep Learning for NLP 21

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

Page 33: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation using chain rule

CS224D: Deep Learning for NLP 22

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

!!!! !′ !!!!!!(!)

!!!"(!)!

Page 34: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation using chain rule

CS224D: Deep Learning for NLP 23

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

Page 35: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation using chain rule

CS224D: Deep Learning for NLP 24

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

!!!! !′ !!! !!(!)!

δ1(2)

Page 36: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation Observations

CS224D: Deep Learning for NLP 25

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

We got error gradient wrt W14

(1)

Required: •  the signal forwarded by W14

(1) = a4(1)

•  the error propagating backwards W11(2)

•  the local gradient σ’(z1(2))

Page 37: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation Observations

CS224D: Deep Learning for NLP 26

1

1

1

1

1

x1

x2

x3

x4

z1(1)

z2(1)

z3(1)

z4(1)

z1(2)

z2(2)

a1(1)

a4(1)

a1(2)

a2(2)

z1(3) a1

(3) s

We tried to get error gradient wrt W14

(1) Required: •  the signal forwarded by W14

(1) = a4(1)

•  the error propagating backwards W11(2)

•  the local gradient σ’(z1(2))

We can do this for all of W(1): (as outer product)

δ1(2)a1

(1) δ1(2)a2

(1) δ1(2)a3

(1) δ1(2)a4

(1) δ2

(2)a1(1) δ2

(2)a2(1) δ2

(2)a3(1) δ2

(2)a4(1)

δ1(2)

δ2(2)

a1(1) a2

(1 a3(1) a4

(1)

Page 38: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Let us define δ

CS224D: Deep Learning for NLP 27

z1(2) a1

(2) σ+

Recall that this is forward pass

δ1(2)

σ+

This is the backpropagation

δ1(2) is the error flowing backwards at the same

point where z1(2) passed forwards. Thus it is simply the gradient

of the error wrt z1(2).

Page 39: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation using error vectors

CS224D: Deep Learning for NLP 28

The chain rule of differentiation just boils down very simple patterns in error backpropagation: 1.  An error x flowing backwards passes a

neuron by getting amplified by the local gradient.

2.  An error δ that needs to go through an affine transformation distributes itself in the way signal combined in forward pass.

σx δ = σ’(z)x

+ δ z

a1w1

a2w2

a3w3

δw1

δw2

δw3

Orange = Backprop.Green = Fwd. Pass

Page 40: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation using error vectors

CS224D: Deep Learning for NLP 29

1 σ 1 z(1) a(1)

W(1) z(2) a(2) W(2)

z(3) s

Page 41: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation using error vectors

CS224D: Deep Learning for NLP 30

1 σ 1 z(1) a(1)

W(1) z(2) a(2) W(2)

z(3) s

δ(3)

This is for softmax

Page 42: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation using error vectors

CS224D: Deep Learning for NLP 31

1 σ 1 z(1) a(1)

W(1) z(2) a(2) W(2)

z(3) s

δ(3)

Gradient w.r.t W(2) = δ(3)a(2)T

Page 43: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation using error vectors

CS224D: Deep Learning for NLP 32

1 σ 1 z(1) a(1)

W(1) z(2) a(2) W(2)

z(3) s

δ(3) W(2)T δ(3)

--Reusing the δ(3) for downstream updates. --Moving error vector across affine transformation simply requires multiplication with the transpose of forward matrix --Notice that the dimensions will line up perfectly too!

Page 44: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation using error vectors

CS224D: Deep Learning for NLP 33

1 σ 1 z(1) a(1)

W(1) z(2) a(2) W(2)

z(3) s

W(2)T δ(3) σ’(z(2))!W(2)T δ(3)

= δ(2)

--Moving error vector across point-wise non-linearity requires point-wise multiplication with local gradient of the non-linearity

Page 45: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our first example: Backpropagation using error vectors

CS224D: Deep Learning for NLP 34

1 σ 1 z(1) a(1)

W(1) z(2) a(2) W(2)

z(3) s

δ(2)

Gradient w.r.t W(1) = δ(2)a(1)T

W(1)T δ(2)

Page 46: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our second example (4-layer network): Backpropagation using error vectors

CS224D: Deep Learning for NLP 35

σ σ 1 z(1) a(1) W(1) z(2) a(2) W(2) soft

max a(3) W(3) z(3) z(4) yp

Page 47: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our second example (4-layer network): Backpropagation using error vectors

CS224D: Deep Learning for NLP 36

σ σ 1 z(1) a(1) W(1) z(2) a(2) W(2) soft

max a(3) W(3) z(3) z(4) yp

yp– y = δ(4)

Page 48: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our second example (4-layer network): Backpropagation using error vectors

CS224D: Deep Learning for NLP 37

σ σ 1 z(1) a(1) W(1) z(2) a(2) W(2) soft

max a(3) W(3) z(3) z(4) yp

Grad W(3) = δ(4)a(3)T

W(3)Tδ(4)

δ(4)

Page 49: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our second example (4-layer network): Backpropagation using error vectors

CS224D: Deep Learning for NLP 38

σ σ 1 z(1) a(1) W(1) z(2) a(2) W(2) soft

max a(3) W(3) z(3) z(4) yp

W(3)Tδ(4) δ(3)= σ’(z(3))!W(3)Tδ(4)

Page 50: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our second example (4-layer network): Backpropagation using error vectors

CS224D: Deep Learning for NLP 39

σ σ 1 z(1) a(1) W(1) z(2) a(2) W(2) soft

max a(3) W(3) z(3) z(4) yp

Grad W(2) = δ(3)a(2)T

δ(3) W(2)Tδ(3)

Page 51: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our second example (4-layer network): Backpropagation using error vectors

CS224D: Deep Learning for NLP 40

σ σ 1 z(1) a(1) W(1) z(2) a(2) W(2) soft

max a(3) W(3) z(3) z(4) yp

W(2)Tδ(3)

δ(2)= σ’(z(2))!W(2)Tδ(3)

Page 52: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our second example (4-layer network): Backpropagation using error vectors

CS224D: Deep Learning for NLP 41

σ σ 1 z(1) a(1) W(1) z(2) a(2) W(2) soft

max a(3) W(3) z(3) z(4) yp

δ(2) W(1)Tδ(2)

Grad W(1) = δ(2)a(1)T

Page 53: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Our second example (4-layer network): Backpropagation using error vectors

CS224D: Deep Learning for NLP 42

σ σ 1 z(1) a(1) W(1) z(2) a(2) W(2) soft

max a(3) W(3) z(3) z(4) yp

W(1)Tδ(2)

Grad wrt input vector = W(1)Tδ(2)

W(1)Tδ(2)

Page 54: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

CS224D Midterm Review

Ian Tenney

May 4, 2015

Page 55: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Outline

Backpropagation (continued)RNN StructureRNN Backpropagation

Backprop on a DAGExample: Gated Recurrent Units (GRUs)GRU Backpropagation

Page 56: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Outline

Backpropagation (continued)RNN StructureRNN Backpropagation

Backprop on a DAGExample: Gated Recurrent Units (GRUs)GRU Backpropagation

Page 57: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Basic RNN Structure

x(t)

h(t)h(t�1)

y(t)

...

I Basic RNN ("Elman network")I You’ve seen this on Assignment #2 (and also in Lecture #5)

Page 58: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Basic RNN Structure

x(t)

h(t)h(t�1)

y(t)

...

I Two layers between input and prediction, plus hidden state

h(t) = sigmoid⇣Hh(t�1)

+Wx(t) + b1

y(t) = softmax⇣Uh(t) + b2

Page 59: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Unrolled RNN

x(t)

h(t)

y(t)

...h(t�1)

x(t�1)

y(t�1)

h(t�2)

x(t�2)

y(t�2)

h(t�3)

I Helps to think about as “unrolled” network: distinct nodesfor each timestep

I Just do backprop on this! Then combine shared gradients.

Page 60: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Backprop on RNN

I Usual cross-entropy loss (k-class):

¯P (y(t) = j | x(t), . . . , x(1)) = y(t)j

J (t)(✓) = �

kX

j=1

y(t)j

log y(t)j

I Just do backprop on this! First timestep (⌧ = 1):

@J (t)

@U

@J (t)

@b2

@J (t)

@H

����(t)

@J (t)

@h(t)@J (t)

@W

����(t)

@J (t)

@x(t)

Page 61: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Backprop on RNN

I First timestep (s = 0):

@J (t)

@U

@J (t)

@b2

@J (t)

@H

����(t)

@J (t)

@h(t)@J (t)

@W

����(t)

@J (t)

@x(t)

I Back in time (s = 1, 2, . . . , ⌧ � 1)

@J (t)

@H

����(t�s)

@J (t)

@h(t�s)

@J (t)

@W

����(t�s)

@J (t)

@x(t�s)

Page 62: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Backprop on RNN

Yuck, that’s a lot of math!I Actually, it’s not so bad.I Solution: error vectors (�)

Page 63: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Making sense of the madness

I Chain rule to the rescue!I a(t) = Uh(t) + b2

I y(t) = softmax(a(t))I Gradient is transpose of Jacobian:

ra

J =

@J (t)

@a(t)

!T

= y(t) � y(t) = �(2)(t) 2 Rk⇥1

I Now dimensions work out:

@J (t)

@a(t)· @a

(t)

@b2= (�(2)(t))T I 2 R(1⇥k)·(k⇥k)

= R1⇥k

Page 64: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Making sense of the madness

I Chain rule to the rescue!I a(t) = Uh(t) + b2

I y(t) = softmax(a(t))I Gradient is transpose of Jacobian:

ra

J =

@J (t)

@a(t)

!T

= y(t) � y(t) = �(2)(t) 2 Rk⇥1

I Now dimensions work out:

@J (t)

@a(t)· @a

(t)

@b2= (�(2)(t))T I 2 R(1⇥k)·(k⇥k)

= R1⇥k

Page 65: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Making sense of the madness

I Chain rule to the rescue!I a(t) = Uh(t) + b2

I y(t) = softmax(a(t))I Gradient is transpose of Jacobian:

ra

J =

@J (t)

@a(t)

!T

= y(t) � y(t) = �(2)(t) 2 Rk⇥1

I Now dimensions work out:

@J (t)

@a(t)· @a

(t)

@b2= (�(2)(t))T I 2 R(1⇥k)·(k⇥k)

= R1⇥k

Page 66: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Making sense of the madness

I Chain rule to the rescue!I a(t) = Uh(t) + b2

I y(t) = softmax(a(t))I Matrix dimensions get weird:

@a(t)

@U2 Rk⇥(k⇥Dh)

I But we don’t need fancy tensors:

rU

J (t)=

@J (t)

@a(t)· @a

(t)

@U

!T

= �(2)(t)(h(t))T 2 Rk⇥Dh

I NumPy: self.grads.U += outer(d2, hs[t])

Page 67: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Making sense of the madness

I Chain rule to the rescue!I a(t) = Uh(t) + b2

I y(t) = softmax(a(t))I Matrix dimensions get weird:

@a(t)

@U2 Rk⇥(k⇥Dh)

I But we don’t need fancy tensors:

rU

J (t)=

@J (t)

@a(t)· @a

(t)

@U

!T

= �(2)(t)(h(t))T 2 Rk⇥Dh

I NumPy: self.grads.U += outer(d2, hs[t])

Page 68: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Going deeper

I Really just need one simple pattern:I z(t) = Hh(t�1)

+Wx(t) + b1

I h(t) = f(z(t))

I Compute error delta (s = 0, 1, 2, . . .):I From top: �(t) =

⇥h(t) � (1� h(t)

)

⇤� UT �(2)(t)

I Deeper: �(t�s)=

⇥h(t�s) � (1� h(t�s)

)

⇤�HT �(t�s+1)

I These are just chain-rule expansions!

@J (t)

@z(t)=

@J (t)

@a(t)· @a

(t)

@h(t)· @h

(t)

@z(t)= (�(t))T

Page 69: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Going deeper

I Really just need one simple pattern:I z(t) = Hh(t�1)

+Wx(t) + b1

I h(t) = f(z(t))

I Compute error delta (s = 0, 1, 2, . . .):I From top: �(t) =

⇥h(t) � (1� h(t)

)

⇤� UT �(2)(t)

I Deeper: �(t�s)=

⇥h(t�s) � (1� h(t�s)

)

⇤�HT �(t�s+1)

I These are just chain-rule expansions!

@J (t)

@z(t)=

@J (t)

@a(t)· @a

(t)

@h(t)· @h

(t)

@z(t)= (�(t))T

Page 70: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Going deeper

I These are just chain-rule expansions!

@J (t)

@b1

����(t)

=

@J (t)

@a(t)· @a

(t)

@h(t)· @h

(t)

@z(t)

!· @z

(t)

@b1= (�(t))T

@z(t)

@b1

@J (t)

@H

����(t)

=

@J (t)

@a(t)· @a

(t)

@h(t)· @h

(t)

@z(t)

!· @z

(t)

@H= (�(t))T

@z(t)

@H

@J (t)

@z(t�1)=

@J (t)

@a(t)· @a

(t)

@h(t)· @h

(t)

@z(t)

!· @z(t)

@h(t�1)= (�(t))T

@z(t)

@z(t�1)

Page 71: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Going deeper

I And there’s shortcuts for them too: @J (t)

@b1

����(t)

!T

= �(t)

@J (t)

@H

����(t)

!T

= �(t) · (h(t�1))

T

@J (t)

@z(t�1)

!T

=

hh(t�1) � (1� h(t�1)

)

i�HT �(t) = �(t�1)

Page 72: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Outline

Backpropagation (continued)RNN StructureRNN Backpropagation

Backprop on a DAGExample: Gated Recurrent Units (GRUs)GRU Backpropagation

Page 73: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Motivation

I Gated units with “reset” and “output” gatesI Reduce problems with vanishing gradients

Figure : You are likely to be eaten by a GRU. (Figure from Chung, etal. 2014)

Page 74: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

IntuitionI Gates z

i

and ri

for each hidden layer neuronI z

i

, ri

2 [0, 1]

I ˜h as “candidate” hidden layerI ˜h, z, r all depend on on x(t), h(t�1)

I h(t) depends on h(t�1) mixed with ˜h(t)

Figure : You are likely to be eaten by a GRU. (Figure from Chung, etal. 2014)

Page 75: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

EquationsI z(t) = �

�W

z

x(t) + Uz

h(t�1)�

I r(t) = ��W

r

x(t) + Ur

h(t�1)�

I ˜h(t) = tanh�Wx(t) + r(t) � Uh(t�1)

I h(t) = z(t) � h(t�1)+ (1� z(t)) � ˜h(t)

I Optionally can have biases; omitted for clarity.

Figure : You are likely to be eaten by a GRU. (Figure from Chung, etal. 2014)

Same eqs. as Lecture 8, subscripts/superscripts as in Assignment #2.

Page 76: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Backpropagation

Multi-path to compute @J

@x

(t)

I Start with �(t) =⇣

@J

@h

(t)

⌘T

2 Rd

I h(t) = z(t) � h(t�1)+ (1� z(t)) � ˜h(t)

I Expand chain rule into sum (a.k.a. product rule):

@J

@x(t)=

@J

@h(t)·"z(t) � @h(t�1)

@x(t)+

@z(t)

@x(t)� h(t�1)

#

+

@J

@h(t)·"(1� z(t)) � @˜h(t)

@x(t)+

@(1� z(t))

@x(t)� ˜h(t)

#

Page 77: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

It gets (a little) better

Multi-path to compute @J

@x

(t)

I Drop terms that don’t depend on x(t):

@J

@x(t)=

@J

@h(t)·"z(t) � @h(t�1)

@x(t)+

@z(t)

@x(t)� h(t�1)

#

+

@J

@h(t)·"(1� z(t)) � @˜h(t)

@x(t)+

@(1� z(t))

@x(t)� ˜h(t)

#

=

@J

@h(t)·"@z(t)

@x(t)� h(t�1)

+ (1� z(t)) � @˜h(t)

@x(t)

#

� @J

@h(t)@z(t)

@x(t)� ˜h(t)

Page 78: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Almost there!Multi-path to compute @J

@x

(t)

I Now we really just need to compute two things:I Output gate:

@z(t)

@x(t)= z(t) � (1� z(t)) �W

z

I Candidate ˜h:

@˜h(t)

@x(t)= (1� (

˜h(t))2) �W

+ (1� (

˜h(t))2) � @r(t)

@x(t)� Uh(t�1)

I Ok, I lied - there’s a third.I Don’t forget to check all paths!

Page 79: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Almost there!Multi-path to compute @J

@x

(t)

I Now we really just need to compute two things:I Output gate:

@z(t)

@x(t)= z(t) � (1� z(t)) �W

z

I Candidate ˜h:

@˜h(t)

@x(t)= (1� (

˜h(t))2) �W

+ (1� (

˜h(t))2) � @r(t)

@x(t)� Uh(t�1)

I Ok, I lied - there’s a third.I Don’t forget to check all paths!

Page 80: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Almost there!Multi-path to compute @J

@x

(t)

I Now we really just need to compute two things:I Output gate:

@z(t)

@x(t)= z(t) � (1� z(t)) �W

z

I Candidate ˜h:

@˜h(t)

@x(t)= (1� (

˜h(t))2) �W

+ (1� (

˜h(t))2) � @r(t)

@x(t)� Uh(t�1)

I Ok, I lied - there’s a third.I Don’t forget to check all paths!

Page 81: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Almost there!

Multi-path to compute @J

@x

(t)

I Last one:@r(t)

@x(t)= r(t) � (1� r(t)) �W

r

I Now we can just add things up!I (I’ll spare you the pain...)

Page 82: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Whew.I Why three derivatives?I Three arrows from x(t) to distinct nodesI Four paths total ( @z

(t)

@x

(t) appears twice)

Page 83: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Whew.I GRUs are complicatedI All the pieces are simpleI Same matrix gradients that you’ve seen before

Page 84: CS224d: Deep NLP Lecture 12: Midterm Reviewcs224d.stanford.edu/lectures/CS224d-Lecture12.pdf · CS224d: Deep NLP Lecture 12: Midterm Review Richard Socher richard@metamind.io. Overview

Summary

I Check your dimensions!I Write error vectors �; just parentheses around chain ruleI Combine simple operations to make complex network

I Matrix-vector productI Activation functions (tanh, sigmoid, softmax)


Recommended