Natural Language Processing
Language models
Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan Jurafsky
Plan• Problem definition
• Trigram models
• Evaluation
• Estimation
• Interpolation
• Discounting
2
Motivations• Define a probability distribution over sentences
• Why?
• Machine translation
• P(“high winds”) > P(“large winds”)
• Spelling correction
• P(“The office is fifteen minutes from here”) > P(“The office is fifteen minuets from here”)
• Speech recognition (that’s where it started!)
• P(“recognize speech”) > P(“wreck a nice beach”)
• And more!
p(01010 |⇠) / p(⇠| 01010) · p(01010)
3
Motivations• Philosophical: a model that is good predicting the
next word, must know something about language and the world
• A good representations for any NLP task
• paper 1
• paper 2
Motivations
• Techniques will be useful later
5
Problem definition• Given a finite vocabulary
V = {the, a,man, telescope,Beckham, two, . . . }
• We have an infinite language L, which is V* concatenated with the special symbol STOP
the STOP a STOP the fan STOP the fan saw Beckham STOP the fan saw saw STOP the fan saw Beckham play for Real Madrid STOP
6
Problem definition• Input: a training set of example sentences
• Currently: roughly one trillion words.
• Output: a probability distribution p over LX
x2L
p(x) = 1, p(x) � 0 for all x 2 L
p(“the STOP”) = 10-12 p(“the fan saw Beckham STOP”) = 2 x 10-8 p(“the fan saw saw STOP”) = 10-15
7
A naive method• Assume we have N training sentences
• Let x1, x2, …, xn be a sentence, and c(x1, x2, …, xn) be the number of times it appeared in the training data.
• Define a language model:
p(x1, . . . , xn) =c(x1, . . . , xn)
N
8
A naive method• Assume we have N training sentences
• Let x1, x2, …, xn be a sentence, and c(x1, x2, …, xn) be the number of times it appeared in the training data.
• Define a language model:
p(x1, . . . , xn) =c(x1, . . . , xn)
N
• No generalization! 8
Markov processes• Markov processes:
• Given a sequence of n random variables:
• We want a sequence probability model
X1, X2, . . . , Xn, n = 100, Xi 2 Vp(X1 = x1, X2 = x2, · · ·Xn = xn)
• There are |V|n possible sequences
9
First-order Markov processChain rule
p(X1 = x1, X2 = x2, . . . , Xn = xn) =
p(X1 = x1)nY
i=2
p(Xi = xi | X1 = x1, . . . , Xi�1 = xi�1)
10
First-order Markov processChain rule
Markov assumption
p(Xi = xi | X1 = x1, . . . , Xi�1 = xi�1) = p(Xi = xi | Xi�1 = xi�1)
p(X1 = x1, X2 = x2, . . . , Xn = xn) =
p(X1 = x1)nY
i=2
p(Xi = xi | X1 = x1, . . . , Xi�1 = xi�1)
10
Second-order Markov process
p(X1 = x1, X2 = x2, . . . , Xn = xn) =
p(X1 = x1)⇥ p(X2 = x2 | X1 = x1)
⇥nY
i=3
p(Xi = xi | Xi�2 = xi�2, Xi�1 = xi�1)
Relax independence assumption:
Simplify notation x0 = ⇤, x�1 = ⇤
11
Second-order Markov process
p(X1 = x1, X2 = x2, . . . , Xn = xn) =
p(X1 = x1)⇥ p(X2 = x2 | X1 = x1)
⇥nY
i=3
p(Xi = xi | Xi�2 = xi�2, Xi�1 = xi�1)
Relax independence assumption:
Simplify notation x0 = ⇤, x�1 = ⇤
Is this reasonable? 11
Detail: variable length• Probability distribution over sequences of any length
• Define always Xn=STOP, and obtain a probability distribution over all sequences
• Intuition: at every step you have probability 𝜶h to stop (conditioned on history) and (1-𝜶h) to keep going
p(X1 = x1, X2 = x2, . . . , Xn = xn) =nY
i=1
p(Xi = xi | Xi�2 = xi�2, Xi�1 = xi�1)
12
Trigram language model• A trigram language model contains
• A vocabulary V
• Non-negative parameters q(w|u,v) for every trigram, such that
w 2 V [ {STOP}, u, v 2 V [ {⇤}
• The probability of a sentence x1, …, xn, where xn=STOP is
p(x1, . . . , xn) =nY
i=1
q(xi | xi�1, xi�2)
13
Example
14
p(the dog barks STOP) =q(the | ⇤, ⇤)⇥q(dog | ⇤, the)⇥q(barks | the, dog)⇥q(STOP | dog, barks)⇥
Limitation
• Markovian assumption is false
He is from France, so it makes sense that his first language is…
• We would want to model longer dependencies
15
Sparseness• Maximum likelihood for estimating q
• Let c(w_1, …, w_n) be the number of times that n-gram appears in a corpus
16
q(wi | wi�2, wi�1) =c(wi�2, wi�1, wi)
c(wi�2, wi�1)
Sparseness• Maximum likelihood for estimating q
• Let c(w_1, …, w_n) be the number of times that n-gram appears in a corpus
• If vocabulary has 20,000 words —> Number of parameters is 8 x 1012!
• Most sentences will have zero or undefined probabilities 16
q(wi | wi�2, wi�1) =c(wi�2, wi�1, wi)
c(wi�2, wi�1)
Berkeley restaurant project sentences
• can you tell me about any good cantonese restaurants close by
• mid priced that food is what i’m looking for
• tell me about chez pansies
• can you give me a listing of the kinds of food that are available
• i’m looking for a good place to eat breakfast
• when is caffe venezia open during the day
17
Bigram counts
18
Bigram probabilities
19
What did we learn
• p(English | want) < p(Chinese | want) - people like Chinese stuff more when it comes to this corpus
• p(to | want) = 0.66 - English behaves in a certain way
• p(eat | to) = 0.28 - English behaves in a certain way
20
Evaluation: perplexity• Test data: S = {s1, s2, …, sM}
• Parameters are not estimated from S
• A good language model has high p(S) and low perplexity
• Perplexity is the normalized inverse probability of S
• M is the number of words in the corpus 21
p(S) =MY
i=1
p(si)
log2 p(S) =MX
i=1
log2 p(si)
perplexity = 2�l, l =1
M
MX
i=1
log2 p(si)
Evaluation: perplexity• Say we have a vocabulary V and N = |V| + 1 and a
trigram model with uniform distribution
• q(w | u, v) = 1/N.
• Perplexity is the “effective” vocabulary size. 22
l =1
M
MX
i=1
log2 p(si)
=1
Mlog
� 1
N
�M= log
1
Nperplexity = 2�l = 2logN = N
Typical values of perplexity
• When |V| = 50,000
• trigram model perplexity: 74 (<< 50000)
• bigram model: 137
• unigram model: 955
23
Evaluation
• Extrinsic evaluations: MT, speech, spelling correction, …
24
History
• Shannon (1950) estimated the perplexity score that humans get for printed English (we are good!)
• Test your perplexity
25
Estimating parameters• Recall that the number of parameters for a trigram
model with |V| = 20,000 is 8 x 1012, leading to zeros and undefined probabilities
26
q(wi | wi�2, wi�1) =c(wi�2, wi�1, wi)
c(wi�2, wi�1)
Bias-variance tradeoff• Given a corpus of length M
• Trigram model:q(wi | wi�2, wi�1) =
c(wi�2, wi�1, wi)
c(wi�1, wi)
• Bigram model:
• Unigram model:
q(wi | wi�1) =c(wi�1, wi)
c(wi�1)
q(wi) =c(wi)
M 27
28
Linear interpolationqLI(wi | wi�2, wi�1) = �1 ⇥ q(wi | wi�2, wi�1)
+ �2 ⇥ q(wi | wi�1)
+ �3 ⇥ q(wi)
�i � 0, �1 + �2 + �3 = 1
• Combine the three models to get all benefits
29
Linear interpolation• Need to verify the parameters define a probability
distribution
X
w2VqLI(w | u, v)
=X
w2V�1 ⇥ q(w | u, v) + �2 ⇥ q(w | v) + �3 ⇥ q(w)
= �1
X
w2Vq(w | u, v) + �2
X
w2Vq(w | v) + �3
X
w2V⇥q(w)
= �1 + �2 + �3 = 1
30
Estimating coefficients
• Use validation/development set (intro to ML!)
• Partition training data to training (90%?) and dev (10%?) data and optimize the coefficient to minimize the perplexity (the measure we care about!) on the development data
31
Linear interpolationqLI(wi | wi�2, wi�1) = �1 ⇥ q(wi | wi�2, wi�1)
+ �2 ⇥ q(wi | wi�1)
+ �3 ⇥ q(wi)
�i � 0, �1 + �2 + �3 = 1
Linear interpolation
⇧(wi�2, wi�1) =
( 1 c(wi�2, wi�1) = 02 1 c(wi�2, wi�1) 23 3 c(wi�2, wi�1) 104 otherwise
qLI(wi | wi�2, wi�1) = �⇧(wi�2,wi�1)1 ⇥ q(wi | wi�2, wi�1)
+ �⇧(wi�2,wi�1)2 ⇥ q(wi | wi�1)
+ �⇧(wi�2,wi�1)3 ⇥ q(wi)
�⇧(wi�2,wi�1)i � 0
�⇧(wi�2,wi�1)1 + �⇧(wi�2,wi�1)
2 + �⇧(wi�2,wi�1)3 = 1
Discounting methodsx c(x) q(wi | wi-1)
the 48the, dog 15 15/48
the, woman 11 11/48the, man 10 10/48the, park 5 5/48the, job 2 2/48
the, telescope 1 1/48the, manual 1 1/48
the, afternoon 1 1/48the, country 1 1/48the, street 1 1/48
Low count bigrams have high estimates 34
Discounting methods
x c(x) c*(x) q(wi | wi-1)the 48
the, dog 15 14.5 14.5/48the, woman 11 10.5 10.5/48
the, man 10 9.5 9.5/48 the, park 5 4.5 4.5/48the, job 2 1.5 1.5/48
the, telescope 1 0.5 0.5/48 the, manual 1 0.5 0.5/48
the, afternoon 1 0.5 0.5/48 the, country 1 0.5 0.5/48 the, street 1 0.5 0.5/48
c*(x) = c(x) - 0.5
35
Katz back-offA(wi�1) = {w : c(wi�1, w) > 0}B(wi�1) = {w : c(wi�1, w) = 0}
In a bigram model
qBO(wi | wi�1) =
( c⇤(wi�1,wi)
c(wi�1)wi 2 A(wi�1)
↵(wi�1)q(wi)P
w2B(wi�1) q(w)wi 2 B(wi�1)
↵(wi�1) = 1�X
w2A(wi�1)
c⇤(wi�1, w)
c(wi�1)
36
Katz back-off
A(wi�2, wi�1) = {w : c(wi�2, wi�1, w) > 0}B(wi�2, wi�1) = {w : c(wi�2, wi�1, w) = 0}
In a trigram model
qBO(wi | wi�2, wi�1) =
( c⇤(wi�2,wi�1,wi)
c(wi�2,wi�1)wi 2 A(wi�2, wi�1)
↵(wi�2, wi�1)qBO(wi|wi�1)P
w2B(wi�2,wi�1)qBO(w|wi�1)
wi 2 B(wi�2, wi�1)
↵(wi�2, wi�1) = 1�X
w2A(wi�2,wi�1)
c⇤(wi�2, wi�1, w)
c(wi�2, wi�1)
37
Advanced Smoothing
• Good-Turing
• Kneser-Ney
Advanced Smoothing• Principles
• Good-Turing: Take probability mass from things you have seen n times and spread this probability mass over things you have seen n-1 times (specifically move mass from things you have seen once to things you have never seen)
• Kneser-Ney: The probability of a unigram is not its frequency in the data, but how frequently it appears after other things (“francisco” vs. “glasses”).
Unknown words• What if we see a completely new word at test time?
• p(s) = p(S) = 0 —> infinite perplexity
• Solution: create a special token <unk>
• Fix vocabulary V and replace any word in the training set not in V with <unk>
• Train
• At test time, use p(<unk>) for words not in the training set
Summary and re-cap• Sentence probability: decompose with the chain rule
• Use a Markov independence assumption
• Smooth estimates to do better on rare events
• Many attempts to improve LMs using syntax but not easy
• More complex smoothing methods exist for handling rare events (Kneser-Ney, Good-Turing…)
41
Problem• Our estimator q(w | u, v) is based on a one-hot
representation. There is no relation between words.
• p(played | the, actress)
• p(played | the actor)
• Can we use distributed representations?
Neural networks
42
Plan for today
• Feed-forward neural networks for language modeling
• Recurrent neural networks for language modeling
• Vanishing/exploding gradients
Neural networks: history• Proposed in the mid-20th century
• Criticism from Minsky (“Perceptrons”)
• Back-propagation appeared in the 1980s
• In the 1990s SVMs appeared and were more successful
• Since the beginning of this decade show great success in speech, vision, language, robotics…
44
Motivation 1
Can we learn representations from raw data? 45
Motivation 2
Can we learn non-linear decision boundariesActually very similar to motivation 1
46
A single neuron• A neuron is a computational unit of the form
fw,b(x) = f(w>x+ b)
• x: input vector
• w: weights vector
• b: bias
• f: activation function
47 intro to ml
A single neuron• If f is sigmoid, a neuron is logistic regression
• Let x, y be a binary classification training example
p(y = 1 | x) = 1
1 + e�w>x�b= �(w>x+ b)
Provides a linear decision boundary
48 intro to ml
Single layer network• Perform logistic regression
in parallel multiple times (with different parameters)
49
y = �(w>x+ b) = �(w>x)
x,w 2 Rd+1, xd+1 = 1, wd+1 = b<latexit sha1_base64="bbrl+RUwRpr9A7SRaml+HC1MK+g=">AAACV3icbVFNaxsxENVu8+G6X05zzGWoaUiIMbslkPRQCM0lx7TUTcByjCTLtohWWqTZ2mbZP1noIfkrvVRem9AmHRC8eW8e0jzxXCuPSXIXxc82Nre2G8+bL16+ev2mtfP2u7eFE7InrLbumjMvtTKyhwq1vM6dZBnX8orfni/1qx/SeWXNN1zkcpCxiVFjJRgGatiyC9j/BNSrScYO6JRhOatuKNoc6mZewRHwQ3gYma3E+SFQ2px3ZrBPlQGaMZxyXn6tbsrRUVp1YD6sQTCmHZg9NHzYaifdpC54CtI1aJN1XQ5bP+nIiiKTBoVm3vfTJMdByRwqoWXVpIWXORO3bCL7ARqWST8o62AqeB+YEYytC8cg1OzfjpJl3i8yHiaXG/jH2pL8n9YvcHw6KJXJC5RGrC4aFxrQwjJlGCknBepFAEw4Fd4KYsocExj+ohlCSB+v/BT0PnQ/dtMvx+2zz+s0GmSPvCMHJCUn5IxckEvSI4L8Ir+jjWgzuo9JvBU3VqNxtPbskn8q3vkDSESvGQ==</latexit><latexit sha1_base64="bbrl+RUwRpr9A7SRaml+HC1MK+g=">AAACV3icbVFNaxsxENVu8+G6X05zzGWoaUiIMbslkPRQCM0lx7TUTcByjCTLtohWWqTZ2mbZP1noIfkrvVRem9AmHRC8eW8e0jzxXCuPSXIXxc82Nre2G8+bL16+ev2mtfP2u7eFE7InrLbumjMvtTKyhwq1vM6dZBnX8orfni/1qx/SeWXNN1zkcpCxiVFjJRgGatiyC9j/BNSrScYO6JRhOatuKNoc6mZewRHwQ3gYma3E+SFQ2px3ZrBPlQGaMZxyXn6tbsrRUVp1YD6sQTCmHZg9NHzYaifdpC54CtI1aJN1XQ5bP+nIiiKTBoVm3vfTJMdByRwqoWXVpIWXORO3bCL7ARqWST8o62AqeB+YEYytC8cg1OzfjpJl3i8yHiaXG/jH2pL8n9YvcHw6KJXJC5RGrC4aFxrQwjJlGCknBepFAEw4Fd4KYsocExj+ohlCSB+v/BT0PnQ/dtMvx+2zz+s0GmSPvCMHJCUn5IxckEvSI4L8Ir+jjWgzuo9JvBU3VqNxtPbskn8q3vkDSESvGQ==</latexit><latexit sha1_base64="bbrl+RUwRpr9A7SRaml+HC1MK+g=">AAACV3icbVFNaxsxENVu8+G6X05zzGWoaUiIMbslkPRQCM0lx7TUTcByjCTLtohWWqTZ2mbZP1noIfkrvVRem9AmHRC8eW8e0jzxXCuPSXIXxc82Nre2G8+bL16+ev2mtfP2u7eFE7InrLbumjMvtTKyhwq1vM6dZBnX8orfni/1qx/SeWXNN1zkcpCxiVFjJRgGatiyC9j/BNSrScYO6JRhOatuKNoc6mZewRHwQ3gYma3E+SFQ2px3ZrBPlQGaMZxyXn6tbsrRUVp1YD6sQTCmHZg9NHzYaifdpC54CtI1aJN1XQ5bP+nIiiKTBoVm3vfTJMdByRwqoWXVpIWXORO3bCL7ARqWST8o62AqeB+YEYytC8cg1OzfjpJl3i8yHiaXG/jH2pL8n9YvcHw6KJXJC5RGrC4aFxrQwjJlGCknBepFAEw4Fd4KYsocExj+ohlCSB+v/BT0PnQ/dtMvx+2zz+s0GmSPvCMHJCUn5IxckEvSI4L8Ir+jjWgzuo9JvBU3VqNxtPbskn8q3vkDSESvGQ==</latexit><latexit sha1_base64="bbrl+RUwRpr9A7SRaml+HC1MK+g=">AAACV3icbVFNaxsxENVu8+G6X05zzGWoaUiIMbslkPRQCM0lx7TUTcByjCTLtohWWqTZ2mbZP1noIfkrvVRem9AmHRC8eW8e0jzxXCuPSXIXxc82Nre2G8+bL16+ev2mtfP2u7eFE7InrLbumjMvtTKyhwq1vM6dZBnX8orfni/1qx/SeWXNN1zkcpCxiVFjJRgGatiyC9j/BNSrScYO6JRhOatuKNoc6mZewRHwQ3gYma3E+SFQ2px3ZrBPlQGaMZxyXn6tbsrRUVp1YD6sQTCmHZg9NHzYaifdpC54CtI1aJN1XQ5bP+nIiiKTBoVm3vfTJMdByRwqoWXVpIWXORO3bCL7ARqWST8o62AqeB+YEYytC8cg1OzfjpJl3i8yHiaXG/jH2pL8n9YvcHw6KJXJC5RGrC4aFxrQwjJlGCknBepFAEw4Fd4KYsocExj+ohlCSB+v/BT0PnQ/dtMvx+2zz+s0GmSPvCMHJCUn5IxckEvSI4L8Ir+jjWgzuo9JvBU3VqNxtPbskn8q3vkDSESvGQ==</latexit>
Single layer network
• L1: input layer • L2: hidden layer • L3: output layer • Output layer provides
the prediction • Hidden layer is the
learned representation
50
Multi-layer networkRepeat:
51
Matrix notation
a1 = f(W11x1 +W12x2 +W13x3 + b1)
a2 = f(W21x1 +W22x2 +W23x3 + b2)
a3 = f(W31x1 +W32x2 +W33x3 + b3)
z = Wx+ b, a = f(z)
x 2 R4, z 2 R3, W 2 R3⇥4
52
Language modeling with NNs
• Keep the Markov assumption
• Learn a probability q(u | v, w) with distributed representations
53
Language modeling with NNs
Bengio et al, 2003
the dog
e(the) e(dog)
h(the, dog)
f(h(the, dog))laughed
p(laughed | the, dog)
54
e(w) = We · w, ,We 2 Rd⇥|V|, w 2 R|V|⇥1
h(wi�2, wi�1) = �(Wh[e(wi�2); e(wi�1)]), Wh 2 Rm⇥2d
f(z) = softmax(Wo · z), Wo 2 R|V|⇥m, z 2 Rm⇥1
p(wi | wi�2, wi�1) = f(h(wi�2, wi�1))i<latexit sha1_base64="ZrQ4GmagbCZRNUjYkwjacyL2SY0=">AAADb3icjVJdi9NAFJ0mfqz1q+u+CH5wsbgkUEtSBBURFn3xcRW7XWhimEwm7bCZTMhM7LbZvPoDffNH+OIvcJKm4G4reCFwc+65954z3DBLmFSO87NjmNeu37i5d6t7+87de/d7+w9OpChyQsdEJCI/DbGkCUvpWDGV0NMsp5iHCZ2EZx/q+uQbzSUT6Re1zKjP8SxlMSNYaSjY73w/pNbChncwCSh4JBIKFgPwYLAGWAoex2oehuXn6msZgacYpxIuGpTgpDypLqoBLLaolxibNrcCz+sezq1FULIXo7qxTtyqVuBJNuPYmgTzKd0QbHgL7Y8m+XYtTRO2tvHNhlG0XhFbq2amoueqlCJWHJ9XerZoTa7aUeL/hHMALXb1772ts0xrZZrBIthlMbZ2WLcD1g16fWfoNAHbidsmfdTGcdD74UWCFJymiiRYyqnrZMovca4YSWjV9QpJM0zO8IxOdZpiLdIvm4Op4LlGIohFrr9UQYP+3VFiLuWSh5pZe5VXazW4qzYtVPzaL1maFYqmZL0oLhJQAurrg4jllKhkqRNMcqa1ApnjHBOlb7R+BPeq5e1kPBq+GbqfXvaP3revsYceoWfIQi56hY7QR3SMxoh0fhkHxmPjifHbfGg+NWFNNTptzwG6FKb9BwbrDmM=</latexit><latexit sha1_base64="ZrQ4GmagbCZRNUjYkwjacyL2SY0=">AAADb3icjVJdi9NAFJ0mfqz1q+u+CH5wsbgkUEtSBBURFn3xcRW7XWhimEwm7bCZTMhM7LbZvPoDffNH+OIvcJKm4G4reCFwc+65954z3DBLmFSO87NjmNeu37i5d6t7+87de/d7+w9OpChyQsdEJCI/DbGkCUvpWDGV0NMsp5iHCZ2EZx/q+uQbzSUT6Re1zKjP8SxlMSNYaSjY73w/pNbChncwCSh4JBIKFgPwYLAGWAoex2oehuXn6msZgacYpxIuGpTgpDypLqoBLLaolxibNrcCz+sezq1FULIXo7qxTtyqVuBJNuPYmgTzKd0QbHgL7Y8m+XYtTRO2tvHNhlG0XhFbq2amoueqlCJWHJ9XerZoTa7aUeL/hHMALXb1772ts0xrZZrBIthlMbZ2WLcD1g16fWfoNAHbidsmfdTGcdD74UWCFJymiiRYyqnrZMovca4YSWjV9QpJM0zO8IxOdZpiLdIvm4Op4LlGIohFrr9UQYP+3VFiLuWSh5pZe5VXazW4qzYtVPzaL1maFYqmZL0oLhJQAurrg4jllKhkqRNMcqa1ApnjHBOlb7R+BPeq5e1kPBq+GbqfXvaP3revsYceoWfIQi56hY7QR3SMxoh0fhkHxmPjifHbfGg+NWFNNTptzwG6FKb9BwbrDmM=</latexit><latexit sha1_base64="ZrQ4GmagbCZRNUjYkwjacyL2SY0=">AAADb3icjVJdi9NAFJ0mfqz1q+u+CH5wsbgkUEtSBBURFn3xcRW7XWhimEwm7bCZTMhM7LbZvPoDffNH+OIvcJKm4G4reCFwc+65954z3DBLmFSO87NjmNeu37i5d6t7+87de/d7+w9OpChyQsdEJCI/DbGkCUvpWDGV0NMsp5iHCZ2EZx/q+uQbzSUT6Re1zKjP8SxlMSNYaSjY73w/pNbChncwCSh4JBIKFgPwYLAGWAoex2oehuXn6msZgacYpxIuGpTgpDypLqoBLLaolxibNrcCz+sezq1FULIXo7qxTtyqVuBJNuPYmgTzKd0QbHgL7Y8m+XYtTRO2tvHNhlG0XhFbq2amoueqlCJWHJ9XerZoTa7aUeL/hHMALXb1772ts0xrZZrBIthlMbZ2WLcD1g16fWfoNAHbidsmfdTGcdD74UWCFJymiiRYyqnrZMovca4YSWjV9QpJM0zO8IxOdZpiLdIvm4Op4LlGIohFrr9UQYP+3VFiLuWSh5pZe5VXazW4qzYtVPzaL1maFYqmZL0oLhJQAurrg4jllKhkqRNMcqa1ApnjHBOlb7R+BPeq5e1kPBq+GbqfXvaP3revsYceoWfIQi56hY7QR3SMxoh0fhkHxmPjifHbfGg+NWFNNTptzwG6FKb9BwbrDmM=</latexit><latexit sha1_base64="ZrQ4GmagbCZRNUjYkwjacyL2SY0=">AAADb3icjVJdi9NAFJ0mfqz1q+u+CH5wsbgkUEtSBBURFn3xcRW7XWhimEwm7bCZTMhM7LbZvPoDffNH+OIvcJKm4G4reCFwc+65954z3DBLmFSO87NjmNeu37i5d6t7+87de/d7+w9OpChyQsdEJCI/DbGkCUvpWDGV0NMsp5iHCZ2EZx/q+uQbzSUT6Re1zKjP8SxlMSNYaSjY73w/pNbChncwCSh4JBIKFgPwYLAGWAoex2oehuXn6msZgacYpxIuGpTgpDypLqoBLLaolxibNrcCz+sezq1FULIXo7qxTtyqVuBJNuPYmgTzKd0QbHgL7Y8m+XYtTRO2tvHNhlG0XhFbq2amoueqlCJWHJ9XerZoTa7aUeL/hHMALXb1772ts0xrZZrBIthlMbZ2WLcD1g16fWfoNAHbidsmfdTGcdD74UWCFJymiiRYyqnrZMovca4YSWjV9QpJM0zO8IxOdZpiLdIvm4Op4LlGIohFrr9UQYP+3VFiLuWSh5pZe5VXazW4qzYtVPzaL1maFYqmZL0oLhJQAurrg4jllKhkqRNMcqa1ApnjHBOlb7R+BPeq5e1kPBq+GbqfXvaP3revsYceoWfIQi56hY7QR3SMxoh0fhkHxmPjifHbfGg+NWFNNTptzwG6FKb9BwbrDmM=</latexit>
Loss function• Minimize negative log-likelihood
• When we learned word vectors, we were basically training a language model
55
L(✓) = �TX
i=1
log p✓(wi | wi�2, wi�1)<latexit sha1_base64="UB0KjsDGBMGjkqYusmXz7ge4OCs=">AAACK3icbVBNSwMxFMzW7/pV9eglWIQWbNkVQT0Uil48eFCwWujWJZumbWiyuyRvlbLsD/LiXxHEg4pX/4dptwdtHQgMM/N4eeNHgmuw7Q8rNze/sLi0vJJfXVvf2Cxsbd/qMFaUNWgoQtX0iWaCB6wBHARrRooR6Qt25w/OR/7dA1Oah8ENDCPWlqQX8C6nBIzkFc4vSy70GZAyruEKdnUsvYTXnPT+Brsi7OHIy/zSo8exK3kHP5pA5TA9yIiTlr1C0a7aY+BZ4kxIEU1w5RVe3U5IY8kCoIJo3XLsCNoJUcCpYGnejTWLCB2QHmsZGhDJdDsZH5vifaN0cDdU5gWAx+rviYRIrYfSN0lJoK+nvZH4n9eKoXvSTngQxcACmi3qxgJDiEfN4Q5XjIIYGkKo4uavmPaJIhRMv3lTgjN98ixpHFZPq871UbF+NmljGe2iPVRCDjpGdXSBrlADUfSEXtA7+rCerTfr0/rKojlrMrOD/sD6/gFsU6WD</latexit><latexit sha1_base64="UB0KjsDGBMGjkqYusmXz7ge4OCs=">AAACK3icbVBNSwMxFMzW7/pV9eglWIQWbNkVQT0Uil48eFCwWujWJZumbWiyuyRvlbLsD/LiXxHEg4pX/4dptwdtHQgMM/N4eeNHgmuw7Q8rNze/sLi0vJJfXVvf2Cxsbd/qMFaUNWgoQtX0iWaCB6wBHARrRooR6Qt25w/OR/7dA1Oah8ENDCPWlqQX8C6nBIzkFc4vSy70GZAyruEKdnUsvYTXnPT+Brsi7OHIy/zSo8exK3kHP5pA5TA9yIiTlr1C0a7aY+BZ4kxIEU1w5RVe3U5IY8kCoIJo3XLsCNoJUcCpYGnejTWLCB2QHmsZGhDJdDsZH5vifaN0cDdU5gWAx+rviYRIrYfSN0lJoK+nvZH4n9eKoXvSTngQxcACmi3qxgJDiEfN4Q5XjIIYGkKo4uavmPaJIhRMv3lTgjN98ixpHFZPq871UbF+NmljGe2iPVRCDjpGdXSBrlADUfSEXtA7+rCerTfr0/rKojlrMrOD/sD6/gFsU6WD</latexit><latexit sha1_base64="UB0KjsDGBMGjkqYusmXz7ge4OCs=">AAACK3icbVBNSwMxFMzW7/pV9eglWIQWbNkVQT0Uil48eFCwWujWJZumbWiyuyRvlbLsD/LiXxHEg4pX/4dptwdtHQgMM/N4eeNHgmuw7Q8rNze/sLi0vJJfXVvf2Cxsbd/qMFaUNWgoQtX0iWaCB6wBHARrRooR6Qt25w/OR/7dA1Oah8ENDCPWlqQX8C6nBIzkFc4vSy70GZAyruEKdnUsvYTXnPT+Brsi7OHIy/zSo8exK3kHP5pA5TA9yIiTlr1C0a7aY+BZ4kxIEU1w5RVe3U5IY8kCoIJo3XLsCNoJUcCpYGnejTWLCB2QHmsZGhDJdDsZH5vifaN0cDdU5gWAx+rviYRIrYfSN0lJoK+nvZH4n9eKoXvSTngQxcACmi3qxgJDiEfN4Q5XjIIYGkKo4uavmPaJIhRMv3lTgjN98ixpHFZPq871UbF+NmljGe2iPVRCDjpGdXSBrlADUfSEXtA7+rCerTfr0/rKojlrMrOD/sD6/gFsU6WD</latexit><latexit sha1_base64="UB0KjsDGBMGjkqYusmXz7ge4OCs=">AAACK3icbVBNSwMxFMzW7/pV9eglWIQWbNkVQT0Uil48eFCwWujWJZumbWiyuyRvlbLsD/LiXxHEg4pX/4dptwdtHQgMM/N4eeNHgmuw7Q8rNze/sLi0vJJfXVvf2Cxsbd/qMFaUNWgoQtX0iWaCB6wBHARrRooR6Qt25w/OR/7dA1Oah8ENDCPWlqQX8C6nBIzkFc4vSy70GZAyruEKdnUsvYTXnPT+Brsi7OHIy/zSo8exK3kHP5pA5TA9yIiTlr1C0a7aY+BZ4kxIEU1w5RVe3U5IY8kCoIJo3XLsCNoJUcCpYGnejTWLCB2QHmsZGhDJdDsZH5vifaN0cDdU5gWAx+rviYRIrYfSN0lJoK+nvZH4n9eKoXvSTngQxcACmi3qxgJDiEfN4Q5XjIIYGkKo4uavmPaJIhRMv3lTgjN98ixpHFZPq871UbF+NmljGe2iPVRCDjpGdXSBrlADUfSEXtA7+rCerTfr0/rKojlrMrOD/sD6/gFsU6WD</latexit>
Advantages• If we see in the training set
The cat is walking in the bedroom
• We can hope to learn
A dog was running through a room
• We can use n-grams with n > 3 and pay only a linear price
• Compared to what? 56
Parameter estimation• We train with SGD
• How to efficiently compute the gradients?
• backpropagation (Rumelhart, Hinton and Williams, 1986)
• Proof in intro to ML, will repeat algorithm only and give an example here
57
Backpropagation• Notation:
58
Wt : weight matrix at the input of layer t
zt : output vector at layer t
x = z0 : input vector
y : gold scalar
y = zL : predicted scalar
l(y, y) : loss function
vt = Wt · zt�1 : pre-activations
�t =@l(y, y)
@zt: gradient vector
<latexit sha1_base64="UjuhSbGl3wADPYeqK6u90QmND4A=">AAADZ3icbVJNb9NAEN04fJSUj5QvIXEZiKhaiUY2QgIqVargwoFDkQipFEfWZr1OVl17rd1xiGuZH8mNH8CFX8HacdI2ZU+jee/NzJudSSqFQdf93XLaN27eur11p7N99979B92dh9+NyjTjA6ak0qcTargUCR+gQMlPU81pPJF8ODn7VOHDOddGqOQb5ikfx3SaiEgwijYV7LR+7sIwwEPwkS+w+MHFdIYQU9RiARQBZxxEkmYIKgJJc66hBATf7+zC+YUOVIYVac4ZKl0JN7iLo/PAPWzIy4JLbrnE8xU2VTIEw6ikK8ifUSzy0hb4sm5nPYaCId+gyr389Yq/vyZLZQxEWcIqyw1zHiAcQWUdfBYqtGYKPPDKyx0OqFXM60WZ1Swhl0hrrR9pygo/pRoFlVdblxd5u6QS1uY0DQVP1t47naDbc/tu/eB64DVBjzTvJOj+8kPFsthWYZIaM/LcFMdF1Y1JXnb8zPCUsjM65SMbJjTmZlzUh1LCK5sJIbI/FCk7RZ29rChobEweTyzTnsDMbGJV8n/YKMPo/bio/5UnbNkoyiSggurqIBTaOpa5DSjTws4KbEbtAtHeZrUEb9Py9WDwpv+h73192zv+2GxjizwnL8ke8cg7ckw+kxMyIKz1x9l2HjtPnL/tbvtp+9mS6rQazSNy5bVf/AP8QREE</latexit><latexit sha1_base64="UjuhSbGl3wADPYeqK6u90QmND4A=">AAADZ3icbVJNb9NAEN04fJSUj5QvIXEZiKhaiUY2QgIqVargwoFDkQipFEfWZr1OVl17rd1xiGuZH8mNH8CFX8HacdI2ZU+jee/NzJudSSqFQdf93XLaN27eur11p7N99979B92dh9+NyjTjA6ak0qcTargUCR+gQMlPU81pPJF8ODn7VOHDOddGqOQb5ikfx3SaiEgwijYV7LR+7sIwwEPwkS+w+MHFdIYQU9RiARQBZxxEkmYIKgJJc66hBATf7+zC+YUOVIYVac4ZKl0JN7iLo/PAPWzIy4JLbrnE8xU2VTIEw6ikK8ifUSzy0hb4sm5nPYaCId+gyr389Yq/vyZLZQxEWcIqyw1zHiAcQWUdfBYqtGYKPPDKyx0OqFXM60WZ1Swhl0hrrR9pygo/pRoFlVdblxd5u6QS1uY0DQVP1t47naDbc/tu/eB64DVBjzTvJOj+8kPFsthWYZIaM/LcFMdF1Y1JXnb8zPCUsjM65SMbJjTmZlzUh1LCK5sJIbI/FCk7RZ29rChobEweTyzTnsDMbGJV8n/YKMPo/bio/5UnbNkoyiSggurqIBTaOpa5DSjTws4KbEbtAtHeZrUEb9Py9WDwpv+h73192zv+2GxjizwnL8ke8cg7ckw+kxMyIKz1x9l2HjtPnL/tbvtp+9mS6rQazSNy5bVf/AP8QREE</latexit><latexit sha1_base64="UjuhSbGl3wADPYeqK6u90QmND4A=">AAADZ3icbVJNb9NAEN04fJSUj5QvIXEZiKhaiUY2QgIqVargwoFDkQipFEfWZr1OVl17rd1xiGuZH8mNH8CFX8HacdI2ZU+jee/NzJudSSqFQdf93XLaN27eur11p7N99979B92dh9+NyjTjA6ak0qcTargUCR+gQMlPU81pPJF8ODn7VOHDOddGqOQb5ikfx3SaiEgwijYV7LR+7sIwwEPwkS+w+MHFdIYQU9RiARQBZxxEkmYIKgJJc66hBATf7+zC+YUOVIYVac4ZKl0JN7iLo/PAPWzIy4JLbrnE8xU2VTIEw6ikK8ifUSzy0hb4sm5nPYaCId+gyr389Yq/vyZLZQxEWcIqyw1zHiAcQWUdfBYqtGYKPPDKyx0OqFXM60WZ1Swhl0hrrR9pygo/pRoFlVdblxd5u6QS1uY0DQVP1t47naDbc/tu/eB64DVBjzTvJOj+8kPFsthWYZIaM/LcFMdF1Y1JXnb8zPCUsjM65SMbJjTmZlzUh1LCK5sJIbI/FCk7RZ29rChobEweTyzTnsDMbGJV8n/YKMPo/bio/5UnbNkoyiSggurqIBTaOpa5DSjTws4KbEbtAtHeZrUEb9Py9WDwpv+h73192zv+2GxjizwnL8ke8cg7ckw+kxMyIKz1x9l2HjtPnL/tbvtp+9mS6rQazSNy5bVf/AP8QREE</latexit><latexit sha1_base64="UjuhSbGl3wADPYeqK6u90QmND4A=">AAADZ3icbVJNb9NAEN04fJSUj5QvIXEZiKhaiUY2QgIqVargwoFDkQipFEfWZr1OVl17rd1xiGuZH8mNH8CFX8HacdI2ZU+jee/NzJudSSqFQdf93XLaN27eur11p7N99979B92dh9+NyjTjA6ak0qcTargUCR+gQMlPU81pPJF8ODn7VOHDOddGqOQb5ikfx3SaiEgwijYV7LR+7sIwwEPwkS+w+MHFdIYQU9RiARQBZxxEkmYIKgJJc66hBATf7+zC+YUOVIYVac4ZKl0JN7iLo/PAPWzIy4JLbrnE8xU2VTIEw6ikK8ifUSzy0hb4sm5nPYaCId+gyr389Yq/vyZLZQxEWcIqyw1zHiAcQWUdfBYqtGYKPPDKyx0OqFXM60WZ1Swhl0hrrR9pygo/pRoFlVdblxd5u6QS1uY0DQVP1t47naDbc/tu/eB64DVBjzTvJOj+8kPFsthWYZIaM/LcFMdF1Y1JXnb8zPCUsjM65SMbJjTmZlzUh1LCK5sJIbI/FCk7RZ29rChobEweTyzTnsDMbGJV8n/YKMPo/bio/5UnbNkoyiSggurqIBTaOpa5DSjTws4KbEbtAtHeZrUEb9Py9WDwpv+h73192zv+2GxjizwnL8ke8cg7ckw+kxMyIKz1x9l2HjtPnL/tbvtp+9mS6rQazSNy5bVf/AP8QREE</latexit>
Backpropagation• Run the network forward to obtain all values vt, zt
• Base: �L = l0(y, zL)
• Recursion:
• Gradients:@l
@Wt= (�t � �0(vt))z
>t�1
59
�t = W>t+1(�
0(vt+1) � �t+1)
�0(vt+1), �t+1 2 Rdt+1⇥1,Wt+1 2 Rdt+1⇥dt
Bigram LM example• Forward pass:
60
z0 2 R|V|⇥1 : one-hot vector input
z1 = W1 · z0,W1 2 Rd1⇥|V|, z1 2 Rd1⇥1
z2 = �(W2 · z1),W2 2 Rd2⇥d1 , z2 2 Rd2⇥1
z3 = softmax(W3 · z2),W3 2 R|V|⇥d2 , z3 2 R|V|⇥1
l(y, z3) =X
i
y(i) log z(i)3
Bigram LM example• Backward pass:
61
�0(v3) � �3 = (z3 � y)
�2 = W>3 (�0(v3) � �3) = W>
3 (z3 � y)
�1 = W>2 (�0(v2) � �2) = W>
2 (z2 � (1� z2) � �2)@l
@W3= (�3 � �0(v3))z
>2 = (z3 � y)z>2
@l
@W2= (�2 � �0(v2))z
>1 = (�2 � z2 � (1� z2))z
>1
@l
@W1= (�1 � �0(v1))z
>0 = �1z
>0
Summary• Neural nets can improve language models:
• better scalability for larger N
• use of word similarity
• complex decision boundaries
• Training through backpropagation
62
Summary• Neural nets can improve language models:
• better scalability for larger N
• use of word similarity
• complex decision boundaries
• Training through backpropagation
62
But we still have a Markov assumption
He is from France, so it makes sense that his first language is…
Recurrent neural networks
63
the dog the dog
63
laughed
Input: w1, . . . , wt�1, wt, wt+1, . . . , wT , wi 2 RV
Model: xt = W (e) · wt ,W (e) 2 Rd⇥V
ht = �(W (hh) · ht�1 +W (hx) · xt),W(hh) 2 RDh⇥Dh ,W (hx) 2 RDh⇥d
yt = softmax(W (s) · ht) ,W (s) 2 RV⇥Dh
Recurrent neural networks
Recurrent neural networks• Can exploit long range dependencies
• Each layer has the same weights (weight sharing/tying)
• What is the loss function?
Recurrent neural networks• Can exploit long range dependencies
• Each layer has the same weights (weight sharing/tying)
• What is the loss function?
J(✓) =TX
t=1
CE(yt, yt)
Recurrent neural networks• Component:
• model? RNN (saw before)
• Loss? sum of cross-entropy over time
• Optimization? SGD
• Gradient computation? back-propagation through time
Training RNNs
• Capturing long-range dependencies with RNNs is difficult
• Vanishing/exploding gradients
• Small changes in hidden layer value in step k cause huge/minuscule changes to values in hidden layer t for t >> k
ExplanationConsider a simple linear RNN with no input:
ht = W · ht�1
ht = W t · h0
ht = (Q⇤Q�1)t · h0
ht = (Q⇤tQ�1) · h0
where W = Q⇤Q�1 is an eigendecomposition
• Some eigenvalues will explode and some will shrink to zero • Stretch input in the direction of eigenvector
with largest eigenvalue
Explanation
ht = W (hh)�(ht�1) +W (hx)xt + b
L =TX
t=1
Lt =TX
t=1
L(ht)<latexit sha1_base64="b5AOVOYDBTNZMJkiop5k9cEbZUY=">AAACZXicdVFNS8MwGE7r15xf8wMvHgwOZWM4WhHUgyB68eBBwTlhnSXNsjWYtCV5Kxulf9KbZy/+DNNtBz9fCDw8H7zJkyARXIPjvFn2zOzc/EJpsby0vLK6VlnfeNBxqihr0VjE6jEgmgkesRZwEOwxUYzIQLB28HxV6O0XpjSPo3sYJawrySDifU4JGMqv5KEP+OAct5+yWhjWc0/zgSS10M/g0M3ruDFRhvV8aIwNHGDPK3uSQEiJyG7yIuvpVBr/uZs/3eMvmgn8L5oVUPcrVafpjAf/Bu4UVNF0bv3Kq9eLaSpZBFQQrTuuk0A3Iwo4FSwve6lmCaHPZMA6BkZEMt3NxjXleN8wPdyPlTkR4DH7NZERqfVIBsZZ3FP/1AryL62TQv+0m/EoSYFFdLKonwoMMS46xz2uGAUxMoBQxc1dMQ2JIhTMz5RNCe7PJ/8GraPmWdO9O65eXE7bKKEdtIdqyEUn6AJdo1vUQhS9W4vWhrVpfdir9pa9PbHa1jSzib6NvfsJ0uq1Cw==</latexit><latexit sha1_base64="b5AOVOYDBTNZMJkiop5k9cEbZUY=">AAACZXicdVFNS8MwGE7r15xf8wMvHgwOZWM4WhHUgyB68eBBwTlhnSXNsjWYtCV5Kxulf9KbZy/+DNNtBz9fCDw8H7zJkyARXIPjvFn2zOzc/EJpsby0vLK6VlnfeNBxqihr0VjE6jEgmgkesRZwEOwxUYzIQLB28HxV6O0XpjSPo3sYJawrySDifU4JGMqv5KEP+OAct5+yWhjWc0/zgSS10M/g0M3ruDFRhvV8aIwNHGDPK3uSQEiJyG7yIuvpVBr/uZs/3eMvmgn8L5oVUPcrVafpjAf/Bu4UVNF0bv3Kq9eLaSpZBFQQrTuuk0A3Iwo4FSwve6lmCaHPZMA6BkZEMt3NxjXleN8wPdyPlTkR4DH7NZERqfVIBsZZ3FP/1AryL62TQv+0m/EoSYFFdLKonwoMMS46xz2uGAUxMoBQxc1dMQ2JIhTMz5RNCe7PJ/8GraPmWdO9O65eXE7bKKEdtIdqyEUn6AJdo1vUQhS9W4vWhrVpfdir9pa9PbHa1jSzib6NvfsJ0uq1Cw==</latexit><latexit sha1_base64="b5AOVOYDBTNZMJkiop5k9cEbZUY=">AAACZXicdVFNS8MwGE7r15xf8wMvHgwOZWM4WhHUgyB68eBBwTlhnSXNsjWYtCV5Kxulf9KbZy/+DNNtBz9fCDw8H7zJkyARXIPjvFn2zOzc/EJpsby0vLK6VlnfeNBxqihr0VjE6jEgmgkesRZwEOwxUYzIQLB28HxV6O0XpjSPo3sYJawrySDifU4JGMqv5KEP+OAct5+yWhjWc0/zgSS10M/g0M3ruDFRhvV8aIwNHGDPK3uSQEiJyG7yIuvpVBr/uZs/3eMvmgn8L5oVUPcrVafpjAf/Bu4UVNF0bv3Kq9eLaSpZBFQQrTuuk0A3Iwo4FSwve6lmCaHPZMA6BkZEMt3NxjXleN8wPdyPlTkR4DH7NZERqfVIBsZZ3FP/1AryL62TQv+0m/EoSYFFdLKonwoMMS46xz2uGAUxMoBQxc1dMQ2JIhTMz5RNCe7PJ/8GraPmWdO9O65eXE7bKKEdtIdqyEUn6AJdo1vUQhS9W4vWhrVpfdir9pa9PbHa1jSzib6NvfsJ0uq1Cw==</latexit><latexit sha1_base64="b5AOVOYDBTNZMJkiop5k9cEbZUY=">AAACZXicdVFNS8MwGE7r15xf8wMvHgwOZWM4WhHUgyB68eBBwTlhnSXNsjWYtCV5Kxulf9KbZy/+DNNtBz9fCDw8H7zJkyARXIPjvFn2zOzc/EJpsby0vLK6VlnfeNBxqihr0VjE6jEgmgkesRZwEOwxUYzIQLB28HxV6O0XpjSPo3sYJawrySDifU4JGMqv5KEP+OAct5+yWhjWc0/zgSS10M/g0M3ruDFRhvV8aIwNHGDPK3uSQEiJyG7yIuvpVBr/uZs/3eMvmgn8L5oVUPcrVafpjAf/Bu4UVNF0bv3Kq9eLaSpZBFQQrTuuk0A3Iwo4FSwve6lmCaHPZMA6BkZEMt3NxjXleN8wPdyPlTkR4DH7NZERqfVIBsZZ3FP/1AryL62TQv+0m/EoSYFFdLKonwoMMS46xz2uGAUxMoBQxc1dMQ2JIhTMz5RNCe7PJ/8GraPmWdO9O65eXE7bKKEdtIdqyEUn6AJdo1vUQhS9W4vWhrVpfdir9pa9PbHa1jSzib6NvfsJ0uq1Cw==</latexit>
Explanation
@L@✓
=TX
t=1
@Lt
@✓
@Lt
@✓=
tX
k=1
@Lt
@ht
@ht
@hk
@+hk
@✓
@ht
@hk=
tY
i=k+1
@hi
@hi�1=
tY
i=k+1
W (hh)diag(�0(hi�1))<latexit sha1_base64="7w8dKJv4xUMIZCjXWvpgkvlKh9o=">AAADoniclVJdi9NAFJ0mfqz1q6uPvgyW1ZbFkoigPhQWBREU7OLWLnTaMJ1MmiGTD2ZuxDLkh/k3fPPfOOkG2U0rrBcCJ+feM+fOnbsqpNDgeb87jnvj5q3bB3e6d+/df/Cwd/jom85LxfiU5TJX5yuquRQZn4IAyc8LxWm6kny2St7X+dl3rrTIszPYFHyR0nUmIsEoWCo47PwkkaLMkIIqEFRiklKIGZXmc1VdYiHmQCv8bIyJLtPAwNivlmf43+IA9sgJ6f6n4q9hUhvCdQ1j+9eujVsVSbtiebxlr9P2nrPqTguVh4ER4+R4X7NxIK6IjHjhVxXe1c2WZhDHQ2sM/AeYUNB1NSBarFP6fNDohsOg1/dG3jbwLvAb0EdNTILeLxLmrEx5BkxSree+V8DC1P0wyasuKTUvKEvoms8tzGjK9cJsd6zCR5YJcZQr+2WAt+xlhaGp1pt0ZSvrN9HtXE3uy81LiN4sjMiKEnjGLoyiUmLIcb2wOBSKM5AbCyhTwvaKWUztYMGuddcOwW9feRdMX47ejvzTV/2Td800DtAT9BQNkI9eoxP0EU3QFDEHOx+cL87EPXI/uafu14tSp9NoHqMr4ZI/z5QxQw==</latexit><latexit sha1_base64="7w8dKJv4xUMIZCjXWvpgkvlKh9o=">AAADoniclVJdi9NAFJ0mfqz1q6uPvgyW1ZbFkoigPhQWBREU7OLWLnTaMJ1MmiGTD2ZuxDLkh/k3fPPfOOkG2U0rrBcCJ+feM+fOnbsqpNDgeb87jnvj5q3bB3e6d+/df/Cwd/jom85LxfiU5TJX5yuquRQZn4IAyc8LxWm6kny2St7X+dl3rrTIszPYFHyR0nUmIsEoWCo47PwkkaLMkIIqEFRiklKIGZXmc1VdYiHmQCv8bIyJLtPAwNivlmf43+IA9sgJ6f6n4q9hUhvCdQ1j+9eujVsVSbtiebxlr9P2nrPqTguVh4ER4+R4X7NxIK6IjHjhVxXe1c2WZhDHQ2sM/AeYUNB1NSBarFP6fNDohsOg1/dG3jbwLvAb0EdNTILeLxLmrEx5BkxSree+V8DC1P0wyasuKTUvKEvoms8tzGjK9cJsd6zCR5YJcZQr+2WAt+xlhaGp1pt0ZSvrN9HtXE3uy81LiN4sjMiKEnjGLoyiUmLIcb2wOBSKM5AbCyhTwvaKWUztYMGuddcOwW9feRdMX47ejvzTV/2Td800DtAT9BQNkI9eoxP0EU3QFDEHOx+cL87EPXI/uafu14tSp9NoHqMr4ZI/z5QxQw==</latexit><latexit sha1_base64="7w8dKJv4xUMIZCjXWvpgkvlKh9o=">AAADoniclVJdi9NAFJ0mfqz1q6uPvgyW1ZbFkoigPhQWBREU7OLWLnTaMJ1MmiGTD2ZuxDLkh/k3fPPfOOkG2U0rrBcCJ+feM+fOnbsqpNDgeb87jnvj5q3bB3e6d+/df/Cwd/jom85LxfiU5TJX5yuquRQZn4IAyc8LxWm6kny2St7X+dl3rrTIszPYFHyR0nUmIsEoWCo47PwkkaLMkIIqEFRiklKIGZXmc1VdYiHmQCv8bIyJLtPAwNivlmf43+IA9sgJ6f6n4q9hUhvCdQ1j+9eujVsVSbtiebxlr9P2nrPqTguVh4ER4+R4X7NxIK6IjHjhVxXe1c2WZhDHQ2sM/AeYUNB1NSBarFP6fNDohsOg1/dG3jbwLvAb0EdNTILeLxLmrEx5BkxSree+V8DC1P0wyasuKTUvKEvoms8tzGjK9cJsd6zCR5YJcZQr+2WAt+xlhaGp1pt0ZSvrN9HtXE3uy81LiN4sjMiKEnjGLoyiUmLIcb2wOBSKM5AbCyhTwvaKWUztYMGuddcOwW9feRdMX47ejvzTV/2Td800DtAT9BQNkI9eoxP0EU3QFDEHOx+cL87EPXI/uafu14tSp9NoHqMr4ZI/z5QxQw==</latexit><latexit sha1_base64="7w8dKJv4xUMIZCjXWvpgkvlKh9o=">AAADoniclVJdi9NAFJ0mfqz1q6uPvgyW1ZbFkoigPhQWBREU7OLWLnTaMJ1MmiGTD2ZuxDLkh/k3fPPfOOkG2U0rrBcCJ+feM+fOnbsqpNDgeb87jnvj5q3bB3e6d+/df/Cwd/jom85LxfiU5TJX5yuquRQZn4IAyc8LxWm6kny2St7X+dl3rrTIszPYFHyR0nUmIsEoWCo47PwkkaLMkIIqEFRiklKIGZXmc1VdYiHmQCv8bIyJLtPAwNivlmf43+IA9sgJ6f6n4q9hUhvCdQ1j+9eujVsVSbtiebxlr9P2nrPqTguVh4ER4+R4X7NxIK6IjHjhVxXe1c2WZhDHQ2sM/AeYUNB1NSBarFP6fNDohsOg1/dG3jbwLvAb0EdNTILeLxLmrEx5BkxSree+V8DC1P0wyasuKTUvKEvoms8tzGjK9cJsd6zCR5YJcZQr+2WAt+xlhaGp1pt0ZSvrN9HtXE3uy81LiN4sjMiKEnjGLoyiUmLIcb2wOBSKM5AbCyhTwvaKWUztYMGuddcOwW9feRdMX47ejvzTV/2Td800DtAT9BQNkI9eoxP0EU3QFDEHOx+cL87EPXI/uafu14tSp9NoHqMr4ZI/z5QxQw==</latexit>
ExplanationAssume: |�0(·)| �,�1 1
�.
�1 is the absolute value of the largest eigenvalue of W (hh)
8k || @hk
@hk�1|| ||W (hh)|| · ||diag(�0(hk))|| <
1
�� < 1
Let ⌘ be a constant such that 8k || @hk
@hk�1|| ⌘ < 1
@Lt
@ht
tY
i=k+1
@hi
@hi�1 ⌘t�k @Lt
@ht
so long-term influence vanishes to zero.<latexit sha1_base64="mahxbX4mQWAcD6TH6g8d20ETfDo=">AAAEa3icnVNLb9NAEHaTACU82gIHXocRVUsiaBQjJB4CqcCFQw9FIrRSnVrj9dpeeb1rvOuK4uyFn8iNn8CF38DaSUvTIA6sZHl2Xt98M7NBzpnSw+GPpVa7c+HipeXL3StXr11fWV278UnJsiB0RCSXxX6AinIm6Egzzel+XlDMAk73gvRdbd87ooViUnzUxzkdZxgLFjGC2qr8tda3TU/TL7p6o1SZ0ZdgYOIpFmf4sOeRUOr+BDxOP4MXY5bhY3uxyUP03Zk6KpBUrqmmdjMAz+tunnFqkgNToBMKGCjJS03hCHlJQUaNlmMRU6WBspiKU4uBvcOqlyR9M00ZyQI5hxQ8mEymsF6OhWbIIfFTc/ZWpVuuMZNZ6db/JFWtqlnVKZrKQoax6Z0wtnn6fevzaoHX9GcN7rSaJniHalumRzWe8AwsRyBSKI1CgypJYhli4/X/9TcAp8jzkV6GOiHIqx3j67kc2mLmhQz9ir1OH7nmUMMCKptHZTXqH8zDSm+l5nzUvxFPewNKApci3tK0yICJyI5VkHrygqmE2n2Q8JUWcmC6/ur6cDBsDiwK7kxYd2Zn11/97oWS2HUVmnBU6sAd5npc1YUQTk3XKxXNkaQY0wMrCsyoGlfNYzGwYTUh2GnYz86o0Z6NqDBT6jgLrGfNVJ231cq/2Q5KHT0fV0zkdsEFmQJFJa+J1i8PQlZQovmxFZAUzNYKJEHbW9shVTfBPU95URg9GbwYuB+erm+/nXVj2bnnPHB6jus8c7ad986uM3JI62d7pX27faf9q3Orc7dzf+raWprF3HTmTmfjN5S7dNI=</latexit><latexit sha1_base64="mahxbX4mQWAcD6TH6g8d20ETfDo=">AAAEa3icnVNLb9NAEHaTACU82gIHXocRVUsiaBQjJB4CqcCFQw9FIrRSnVrj9dpeeb1rvOuK4uyFn8iNn8CF38DaSUvTIA6sZHl2Xt98M7NBzpnSw+GPpVa7c+HipeXL3StXr11fWV278UnJsiB0RCSXxX6AinIm6Egzzel+XlDMAk73gvRdbd87ooViUnzUxzkdZxgLFjGC2qr8tda3TU/TL7p6o1SZ0ZdgYOIpFmf4sOeRUOr+BDxOP4MXY5bhY3uxyUP03Zk6KpBUrqmmdjMAz+tunnFqkgNToBMKGCjJS03hCHlJQUaNlmMRU6WBspiKU4uBvcOqlyR9M00ZyQI5hxQ8mEymsF6OhWbIIfFTc/ZWpVuuMZNZ6db/JFWtqlnVKZrKQoax6Z0wtnn6fevzaoHX9GcN7rSaJniHalumRzWe8AwsRyBSKI1CgypJYhli4/X/9TcAp8jzkV6GOiHIqx3j67kc2mLmhQz9ir1OH7nmUMMCKptHZTXqH8zDSm+l5nzUvxFPewNKApci3tK0yICJyI5VkHrygqmE2n2Q8JUWcmC6/ur6cDBsDiwK7kxYd2Zn11/97oWS2HUVmnBU6sAd5npc1YUQTk3XKxXNkaQY0wMrCsyoGlfNYzGwYTUh2GnYz86o0Z6NqDBT6jgLrGfNVJ231cq/2Q5KHT0fV0zkdsEFmQJFJa+J1i8PQlZQovmxFZAUzNYKJEHbW9shVTfBPU95URg9GbwYuB+erm+/nXVj2bnnPHB6jus8c7ad986uM3JI62d7pX27faf9q3Orc7dzf+raWprF3HTmTmfjN5S7dNI=</latexit><latexit sha1_base64="mahxbX4mQWAcD6TH6g8d20ETfDo=">AAAEa3icnVNLb9NAEHaTACU82gIHXocRVUsiaBQjJB4CqcCFQw9FIrRSnVrj9dpeeb1rvOuK4uyFn8iNn8CF38DaSUvTIA6sZHl2Xt98M7NBzpnSw+GPpVa7c+HipeXL3StXr11fWV278UnJsiB0RCSXxX6AinIm6Egzzel+XlDMAk73gvRdbd87ooViUnzUxzkdZxgLFjGC2qr8tda3TU/TL7p6o1SZ0ZdgYOIpFmf4sOeRUOr+BDxOP4MXY5bhY3uxyUP03Zk6KpBUrqmmdjMAz+tunnFqkgNToBMKGCjJS03hCHlJQUaNlmMRU6WBspiKU4uBvcOqlyR9M00ZyQI5hxQ8mEymsF6OhWbIIfFTc/ZWpVuuMZNZ6db/JFWtqlnVKZrKQoax6Z0wtnn6fevzaoHX9GcN7rSaJniHalumRzWe8AwsRyBSKI1CgypJYhli4/X/9TcAp8jzkV6GOiHIqx3j67kc2mLmhQz9ir1OH7nmUMMCKptHZTXqH8zDSm+l5nzUvxFPewNKApci3tK0yICJyI5VkHrygqmE2n2Q8JUWcmC6/ur6cDBsDiwK7kxYd2Zn11/97oWS2HUVmnBU6sAd5npc1YUQTk3XKxXNkaQY0wMrCsyoGlfNYzGwYTUh2GnYz86o0Z6NqDBT6jgLrGfNVJ231cq/2Q5KHT0fV0zkdsEFmQJFJa+J1i8PQlZQovmxFZAUzNYKJEHbW9shVTfBPU95URg9GbwYuB+erm+/nXVj2bnnPHB6jus8c7ad986uM3JI62d7pX27faf9q3Orc7dzf+raWprF3HTmTmfjN5S7dNI=</latexit><latexit sha1_base64="X/BbPPQRM1pmBhxdK1enSbL+gJw=">AAAB2HicbZDNSgMxFIXv1L86Vq1rN8EiuCozbtSd4MZlBccW2qFkMnfa0ExmSO4IpfQFXLhRfDB3vo3pz0KtBwIf5yTk3pOUSloKgi+vtrW9s7tX3/cPGv7h0XGz8WSLygiMRKEK00u4RSU1RiRJYa80yPNEYTeZ3C3y7jMaKwv9SNMS45yPtMyk4OSszrDZCtrBUmwTwjW0YK1h83OQFqLKUZNQ3Np+GJQUz7ghKRTO/UFlseRiwkfYd6h5jjaeLcecs3PnpCwrjDua2NL9+WLGc2uneeJu5pzG9m+2MP/L+hVl1/FM6rIi1GL1UVYpRgVb7MxSaVCQmjrgwkg3KxNjbrgg14zvOgj/brwJ0WX7ph0+BFCHUziDCwjhCm7hHjoQgYAUXuDNG3uv3vuqqpq37uwEfsn7+AaqKYoN</latexit><latexit sha1_base64="ctsGtuAa47AtmVMMadvYwvvh5BU=">AAAEYHicnVNLb9NAEHaTACUU2iIOvA4jKkoiaBRz4SGQQFw49FAkQivVqTVer+2V17vGu64ozl74idz4CVz4DazttDQN4sBKlmfn9c03MxvknCk9Hv9Y6XR7ly5fWb3av7Z2/cb6xubaJyXLgtAJkVwWBwEqypmgE800pwd5QTELON0P0ne1ff+YFopJ8VGf5HSaYSxYxAhqq/I3O9+2PU2/6OqtUmVGX4KBmadYnOGjgUdCqYcz8Dj9DF6MWYZP7MUmD9F35+qoQFK5pmrtZgSe198+59QkB6ZAJxQwUJKXmsIx8pKCjBotxyKmSgNlMRVnFgP7R9UgSYamTRnJAjmHFDyYzVpYL8dCM+SQ+Kk5f6vSHdeY2bx063+aqlbVrOoUTWUhw9gMThnbPMOh9Xm1xKv9WYPbVtME71Jty/SoxlOegeUIRAqlUWhQJUksQ2y8/r/+BuAMeTHSy1AnBHm1a3y9kENbzLyQoV+x1+lj1xxpWEJli6isRv2DeVTpndRcjPo34llvQEngUsQ7mhYZMBHZsQpST14wlVC7DxK+0kKOTN/f2BqPxs2BZcGdC1vO/Oz5G9+9UBK7rkITjkoduuNcT6u6EMKp6XulojmSFGN6aEWBGVXTqnksBh5aTQh2GvazM2q05yMqzJQ6yQLrWTNVF2218m+2w1JHz6cVE7ldcEFaoKjkNdH65UHICko0P7ECkoLZWoEkaHtrO6TqJrgXKS8Lk6ejFyP3w9hZde45D5yB4zrPnDfOe2fPmTik87O73r3dvdP91bvVu9t2q7Myb9tNZ+H07v8GdDlz+Q==</latexit><latexit sha1_base64="ctsGtuAa47AtmVMMadvYwvvh5BU=">AAAEYHicnVNLb9NAEHaTACUU2iIOvA4jKkoiaBRz4SGQQFw49FAkQivVqTVer+2V17vGu64ozl74idz4CVz4DazttDQN4sBKlmfn9c03MxvknCk9Hv9Y6XR7ly5fWb3av7Z2/cb6xubaJyXLgtAJkVwWBwEqypmgE800pwd5QTELON0P0ne1ff+YFopJ8VGf5HSaYSxYxAhqq/I3O9+2PU2/6OqtUmVGX4KBmadYnOGjgUdCqYcz8Dj9DF6MWYZP7MUmD9F35+qoQFK5pmrtZgSe198+59QkB6ZAJxQwUJKXmsIx8pKCjBotxyKmSgNlMRVnFgP7R9UgSYamTRnJAjmHFDyYzVpYL8dCM+SQ+Kk5f6vSHdeY2bx063+aqlbVrOoUTWUhw9gMThnbPMOh9Xm1xKv9WYPbVtME71Jty/SoxlOegeUIRAqlUWhQJUksQ2y8/r/+BuAMeTHSy1AnBHm1a3y9kENbzLyQoV+x1+lj1xxpWEJli6isRv2DeVTpndRcjPo34llvQEngUsQ7mhYZMBHZsQpST14wlVC7DxK+0kKOTN/f2BqPxs2BZcGdC1vO/Oz5G9+9UBK7rkITjkoduuNcT6u6EMKp6XulojmSFGN6aEWBGVXTqnksBh5aTQh2GvazM2q05yMqzJQ6yQLrWTNVF2218m+2w1JHz6cVE7ldcEFaoKjkNdH65UHICko0P7ECkoLZWoEkaHtrO6TqJrgXKS8Lk6ejFyP3w9hZde45D5yB4zrPnDfOe2fPmTik87O73r3dvdP91bvVu9t2q7Myb9tNZ+H07v8GdDlz+Q==</latexit><latexit sha1_base64="iXUs9tTyXARlQ7PHyXKvIsHx8PA=">AAAEa3icnVNLb9NAEHaTACU82gIHXocRVdtE0CjmwkMgFbhw6KFIhFaqU2u8Xtsrr3eNd11RnL3wE7nxE7jwG1g7aWkaxIGVLM/O65tvZjbIOVN6OPyx1Gp3Ll2+sny1e+36jZsrq2u3PilZFoSOiOSyOAhQUc4EHWmmOT3IC4pZwOl+kL6r7fvHtFBMio/6JKfjDGPBIkZQW5W/1vq26Wn6RVdvlCoz+hIMTDzF4gy3eh4Jpe5PwOP0M3gxZhk+sRebPETfnamjAknlmmpqNwPwvO7mOacmOTAFOqGAgZK81BSOkZcUZNRoORYxVRooi6k4sxjYP6p6SdI305SRLJBzSMGDyWQK6+VYaIYcEj81529Vuu0aM5mVbv1PU9WqmlWdoqksZBib3iljm6fftz6vFnhNf9bgTqtpgneptmV6VOMpz8ByBCKF0ig0qJIkliE2Xv9ffwNwhjwf6WWoE4K82jW+nsuhLWZeyNCv2Ov0sWuONCygsnlUVqP+wTyq9HZqLkb9G/GsN6AkcCnibU2LDJiI7FgFqScvmEqo3QcJX2khB6brr64PB8PmwKLgzoR1Z3b2/NXvXiiJXVehCUelDt1hrsdVXQjh1HS9UtEcSYoxPbSiwIyqcdU8FgMbVhOCnYb97Iwa7fmICjOlTrLAetZM1UVbrfyb7bDU0fNxxURuF1yQKVBU8ppo/fIgZAUlmp9YAUnBbK1AErS9tR1SdRPci5QXhdHTwYuB+2G4vvN21o1l54HzyOk5rvPM2XHeO3vOyCGtn+2V9t32vfavzp3O/c7DqWtraRZz25k7nY3fk3t0zg==</latexit><latexit sha1_base64="mahxbX4mQWAcD6TH6g8d20ETfDo=">AAAEa3icnVNLb9NAEHaTACU82gIHXocRVUsiaBQjJB4CqcCFQw9FIrRSnVrj9dpeeb1rvOuK4uyFn8iNn8CF38DaSUvTIA6sZHl2Xt98M7NBzpnSw+GPpVa7c+HipeXL3StXr11fWV278UnJsiB0RCSXxX6AinIm6Egzzel+XlDMAk73gvRdbd87ooViUnzUxzkdZxgLFjGC2qr8tda3TU/TL7p6o1SZ0ZdgYOIpFmf4sOeRUOr+BDxOP4MXY5bhY3uxyUP03Zk6KpBUrqmmdjMAz+tunnFqkgNToBMKGCjJS03hCHlJQUaNlmMRU6WBspiKU4uBvcOqlyR9M00ZyQI5hxQ8mEymsF6OhWbIIfFTc/ZWpVuuMZNZ6db/JFWtqlnVKZrKQoax6Z0wtnn6fevzaoHX9GcN7rSaJniHalumRzWe8AwsRyBSKI1CgypJYhli4/X/9TcAp8jzkV6GOiHIqx3j67kc2mLmhQz9ir1OH7nmUMMCKptHZTXqH8zDSm+l5nzUvxFPewNKApci3tK0yICJyI5VkHrygqmE2n2Q8JUWcmC6/ur6cDBsDiwK7kxYd2Zn11/97oWS2HUVmnBU6sAd5npc1YUQTk3XKxXNkaQY0wMrCsyoGlfNYzGwYTUh2GnYz86o0Z6NqDBT6jgLrGfNVJ231cq/2Q5KHT0fV0zkdsEFmQJFJa+J1i8PQlZQovmxFZAUzNYKJEHbW9shVTfBPU95URg9GbwYuB+erm+/nXVj2bnnPHB6jus8c7ad986uM3JI62d7pX27faf9q3Orc7dzf+raWprF3HTmTmfjN5S7dNI=</latexit><latexit sha1_base64="mahxbX4mQWAcD6TH6g8d20ETfDo=">AAAEa3icnVNLb9NAEHaTACU82gIHXocRVUsiaBQjJB4CqcCFQw9FIrRSnVrj9dpeeb1rvOuK4uyFn8iNn8CF38DaSUvTIA6sZHl2Xt98M7NBzpnSw+GPpVa7c+HipeXL3StXr11fWV278UnJsiB0RCSXxX6AinIm6Egzzel+XlDMAk73gvRdbd87ooViUnzUxzkdZxgLFjGC2qr8tda3TU/TL7p6o1SZ0ZdgYOIpFmf4sOeRUOr+BDxOP4MXY5bhY3uxyUP03Zk6KpBUrqmmdjMAz+tunnFqkgNToBMKGCjJS03hCHlJQUaNlmMRU6WBspiKU4uBvcOqlyR9M00ZyQI5hxQ8mEymsF6OhWbIIfFTc/ZWpVuuMZNZ6db/JFWtqlnVKZrKQoax6Z0wtnn6fevzaoHX9GcN7rSaJniHalumRzWe8AwsRyBSKI1CgypJYhli4/X/9TcAp8jzkV6GOiHIqx3j67kc2mLmhQz9ir1OH7nmUMMCKptHZTXqH8zDSm+l5nzUvxFPewNKApci3tK0yICJyI5VkHrygqmE2n2Q8JUWcmC6/ur6cDBsDiwK7kxYd2Zn11/97oWS2HUVmnBU6sAd5npc1YUQTk3XKxXNkaQY0wMrCsyoGlfNYzGwYTUh2GnYz86o0Z6NqDBT6jgLrGfNVJ231cq/2Q5KHT0fV0zkdsEFmQJFJa+J1i8PQlZQovmxFZAUzNYKJEHbW9shVTfBPU95URg9GbwYuB+erm+/nXVj2bnnPHB6jus8c7ad986uM3JI62d7pX27faf9q3Orc7dzf+raWprF3HTmTmfjN5S7dNI=</latexit><latexit sha1_base64="mahxbX4mQWAcD6TH6g8d20ETfDo=">AAAEa3icnVNLb9NAEHaTACU82gIHXocRVUsiaBQjJB4CqcCFQw9FIrRSnVrj9dpeeb1rvOuK4uyFn8iNn8CF38DaSUvTIA6sZHl2Xt98M7NBzpnSw+GPpVa7c+HipeXL3StXr11fWV278UnJsiB0RCSXxX6AinIm6Egzzel+XlDMAk73gvRdbd87ooViUnzUxzkdZxgLFjGC2qr8tda3TU/TL7p6o1SZ0ZdgYOIpFmf4sOeRUOr+BDxOP4MXY5bhY3uxyUP03Zk6KpBUrqmmdjMAz+tunnFqkgNToBMKGCjJS03hCHlJQUaNlmMRU6WBspiKU4uBvcOqlyR9M00ZyQI5hxQ8mEymsF6OhWbIIfFTc/ZWpVuuMZNZ6db/JFWtqlnVKZrKQoax6Z0wtnn6fevzaoHX9GcN7rSaJniHalumRzWe8AwsRyBSKI1CgypJYhli4/X/9TcAp8jzkV6GOiHIqx3j67kc2mLmhQz9ir1OH7nmUMMCKptHZTXqH8zDSm+l5nzUvxFPewNKApci3tK0yICJyI5VkHrygqmE2n2Q8JUWcmC6/ur6cDBsDiwK7kxYd2Zn11/97oWS2HUVmnBU6sAd5npc1YUQTk3XKxXNkaQY0wMrCsyoGlfNYzGwYTUh2GnYz86o0Z6NqDBT6jgLrGfNVJ231cq/2Q5KHT0fV0zkdsEFmQJFJa+J1i8PQlZQovmxFZAUzNYKJEHbW9shVTfBPU95URg9GbwYuB+erm+/nXVj2bnnPHB6jus8c7ad986uM3JI62d7pX27faf9q3Orc7dzf+raWprF3HTmTmfjN5S7dNI=</latexit><latexit sha1_base64="mahxbX4mQWAcD6TH6g8d20ETfDo=">AAAEa3icnVNLb9NAEHaTACU82gIHXocRVUsiaBQjJB4CqcCFQw9FIrRSnVrj9dpeeb1rvOuK4uyFn8iNn8CF38DaSUvTIA6sZHl2Xt98M7NBzpnSw+GPpVa7c+HipeXL3StXr11fWV278UnJsiB0RCSXxX6AinIm6Egzzel+XlDMAk73gvRdbd87ooViUnzUxzkdZxgLFjGC2qr8tda3TU/TL7p6o1SZ0ZdgYOIpFmf4sOeRUOr+BDxOP4MXY5bhY3uxyUP03Zk6KpBUrqmmdjMAz+tunnFqkgNToBMKGCjJS03hCHlJQUaNlmMRU6WBspiKU4uBvcOqlyR9M00ZyQI5hxQ8mEymsF6OhWbIIfFTc/ZWpVuuMZNZ6db/JFWtqlnVKZrKQoax6Z0wtnn6fevzaoHX9GcN7rSaJniHalumRzWe8AwsRyBSKI1CgypJYhli4/X/9TcAp8jzkV6GOiHIqx3j67kc2mLmhQz9ir1OH7nmUMMCKptHZTXqH8zDSm+l5nzUvxFPewNKApci3tK0yICJyI5VkHrygqmE2n2Q8JUWcmC6/ur6cDBsDiwK7kxYd2Zn11/97oWS2HUVmnBU6sAd5npc1YUQTk3XKxXNkaQY0wMrCsyoGlfNYzGwYTUh2GnYz86o0Z6NqDBT6jgLrGfNVJ231cq/2Q5KHT0fV0zkdsEFmQJFJa+J1i8PQlZQovmxFZAUzNYKJEHbW9shVTfBPU95URg9GbwYuB+erm+/nXVj2bnnPHB6jus8c7ad986uM3JI62d7pX27faf9q3Orc7dzf+raWprF3HTmTmfjN5S7dNI=</latexit><latexit sha1_base64="mahxbX4mQWAcD6TH6g8d20ETfDo=">AAAEa3icnVNLb9NAEHaTACU82gIHXocRVUsiaBQjJB4CqcCFQw9FIrRSnVrj9dpeeb1rvOuK4uyFn8iNn8CF38DaSUvTIA6sZHl2Xt98M7NBzpnSw+GPpVa7c+HipeXL3StXr11fWV278UnJsiB0RCSXxX6AinIm6Egzzel+XlDMAk73gvRdbd87ooViUnzUxzkdZxgLFjGC2qr8tda3TU/TL7p6o1SZ0ZdgYOIpFmf4sOeRUOr+BDxOP4MXY5bhY3uxyUP03Zk6KpBUrqmmdjMAz+tunnFqkgNToBMKGCjJS03hCHlJQUaNlmMRU6WBspiKU4uBvcOqlyR9M00ZyQI5hxQ8mEymsF6OhWbIIfFTc/ZWpVuuMZNZ6db/JFWtqlnVKZrKQoax6Z0wtnn6fevzaoHX9GcN7rSaJniHalumRzWe8AwsRyBSKI1CgypJYhli4/X/9TcAp8jzkV6GOiHIqx3j67kc2mLmhQz9ir1OH7nmUMMCKptHZTXqH8zDSm+l5nzUvxFPewNKApci3tK0yICJyI5VkHrygqmE2n2Q8JUWcmC6/ur6cDBsDiwK7kxYd2Zn11/97oWS2HUVmnBU6sAd5npc1YUQTk3XKxXNkaQY0wMrCsyoGlfNYzGwYTUh2GnYz86o0Z6NqDBT6jgLrGfNVJ231cq/2Q5KHT0fV0zkdsEFmQJFJa+J1i8PQlZQovmxFZAUzNYKJEHbW9shVTfBPU95URg9GbwYuB+erm+/nXVj2bnnPHB6jus8c7ad986uM3JI62d7pX27faf9q3Orc7dzf+raWprF3HTmTmfjN5S7dNI=</latexit><latexit sha1_base64="mahxbX4mQWAcD6TH6g8d20ETfDo=">AAAEa3icnVNLb9NAEHaTACU82gIHXocRVUsiaBQjJB4CqcCFQw9FIrRSnVrj9dpeeb1rvOuK4uyFn8iNn8CF38DaSUvTIA6sZHl2Xt98M7NBzpnSw+GPpVa7c+HipeXL3StXr11fWV278UnJsiB0RCSXxX6AinIm6Egzzel+XlDMAk73gvRdbd87ooViUnzUxzkdZxgLFjGC2qr8tda3TU/TL7p6o1SZ0ZdgYOIpFmf4sOeRUOr+BDxOP4MXY5bhY3uxyUP03Zk6KpBUrqmmdjMAz+tunnFqkgNToBMKGCjJS03hCHlJQUaNlmMRU6WBspiKU4uBvcOqlyR9M00ZyQI5hxQ8mEymsF6OhWbIIfFTc/ZWpVuuMZNZ6db/JFWtqlnVKZrKQoax6Z0wtnn6fevzaoHX9GcN7rSaJniHalumRzWe8AwsRyBSKI1CgypJYhli4/X/9TcAp8jzkV6GOiHIqx3j67kc2mLmhQz9ir1OH7nmUMMCKptHZTXqH8zDSm+l5nzUvxFPewNKApci3tK0yICJyI5VkHrygqmE2n2Q8JUWcmC6/ur6cDBsDiwK7kxYd2Zn11/97oWS2HUVmnBU6sAd5npc1YUQTk3XKxXNkaQY0wMrCsyoGlfNYzGwYTUh2GnYz86o0Z6NqDBT6jgLrGfNVJ231cq/2Q5KHT0fV0zkdsEFmQJFJa+J1i8PQlZQovmxFZAUzNYKJEHbW9shVTfBPU95URg9GbwYuB+erm+/nXVj2bnnPHB6jus8c7ad986uM3JI62d7pX27faf9q3Orc7dzf+raWprF3HTmTmfjN5S7dNI=</latexit>
Solutions• Exploding gradient: gradient clipping
• Re-normalize gradient to be less than C (this is no longer the true gradient in size but it is in direction)
• Exploding gradients are easy to detect
• The problem is with the model!
• Change it (LSTMs, GRUs)
IllustrationPascanu et al, 2013
LSTMs and GRUs
LSTMs and GRUs
• Bottom line: use vector addition and not matrix-vector multiplication. Allows for better propagation of gradients to the past
Gated Recurrent Unit (GRU)• Main insight: add learnable gates to the recurrent
unit that control the flow of information from the past to the present
• Vanilla RNN:
ht = f(W (hh)ht�1 +W (hx)xt)
• Update and reset gates:
zt = �(W (z)ht�1 + U (z)xt)
rt = �(W (r)ht�1 + U (r)xt)
Cho et al., 2014
Gated Recurrent Unit (GRU)
• Use the gates to control information flow
• If z=1, we simply copy the past and ignore the present (note that gradient will be 1)
• If z=0, then we have an RNN like update, but we are also free to reset some of the past units, and if r=0, then we have no memory of the past
• The + in the last equation is crucial
zt = �(W (z)xt + U (z)ht�1)
rt = �(W (r)xt + U (r)ht�1)
ht = tanh(Wxt + rt � Uht�1)
ht = zt � ht�1 + (1� zt) � ht
Illustration
ht-1 xt
rt zt
ȟt
ht
Long short term memory (LSTMs)
• z has been split into i and f
• There is no r
• There is a new gate o that distinguishes between the memory and the output.
• c is like h in GRUs
• h is the output
Hochreiter and Schmidhuber, 1997
it = �(W (i)xt + U (i)ht�1)
ft = �(W (f)xt + U (f)ht�1)
ot = �(W (o)xt + U (o)ht�1)
ct = tanh(W (c)xt + U (c)ht�1)
ct = ft � ct�1 + it � ctht = ot � tanh(ct)
Illustration
ht-1 xt
it ftčt
ht
ot
ct-1
ct
Illustration
Chris Olah’s blog
More GRU intuition from Stanford
• Go over sequence of slides from Chris Manning
ResultsJozefowicz et al, 2016
Summary• Language modeling is a fundamental NLP task used in machine
translation, spelling correction, speech recognition, etc.
• Traditional models use n-gram counts and smoothing
• Feed-forward take into account word similarity to generalize better
• Recurrent models can potentially learn to exploit long-range interactions
• Neural models dramatically reduced perplexity
• Recurrent networks are now used in many other NLP tasks (bidirectional RNNs, deep RNNs)
84