+ All Categories
Home > Documents > Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN...

Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN...

Date post: 21-May-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
32
Recurrent Neural Networks Fall 2019 COS 484: Natural Language Processing How to model sequences using neural networks? (Some slides adapted from Chris Manning, Abigail See, Andrej Karpathy)
Transcript
Page 1: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Recurrent Neural Networks

Fall 2019

COS 484: Natural Language Processing

How to model sequences using neural networks?

(Some slides adapted from Chris Manning, Abigail See, Andrej Karpathy)

Page 2: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Overview

• What is a recurrent neural network (RNN)? • Simple RNNs • Backpropagation through time • Long short-term memory networks (LSTMs) • Applications • Variants: Stacked RNNs, Bidirectional RNNs

Page 3: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Recurrent neural networks (RNNs)

A class of neural networks allowing to handle variable length inputs

A function: y = RNN(x1, x2, …, xn) ∈ ℝd

where x1, …, xn ∈ ℝdin

Page 4: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Recurrent neural networks (RNNs)Proven to be an highly effective approach to language modeling, sequence tagging as well as text classification tasks:

Language modeling Sequence tagging

The movie sucks .

👎

Text classification

Page 5: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Recurrent neural networks (RNNs)Form the basis for the modern approaches to machine translation, question answering and dialogue:

Page 6: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Why variable-length?Recall the feedfoward neural LMs we learned:

The dogs are barking

the dogs in the neighborhood are ___

x = [ethe, edogs, eare] 2 R3d<latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">AAACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3//vuBsfNj9ube+0Pn3+8nXX29vvmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmffOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLii0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjkk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit><latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">AAACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3//vuBsfNj9ube+0Pn3+8nXX29vvmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmffOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLii0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjkk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit><latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">AAACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3//vuBsfNj9ube+0Pn3+8nXX29vvmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmffOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLii0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjkk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit><latexit sha1_base64="o+E5BkBVqzV8FU9SF1g20jIaEIM=">AAACVnicbVFNT+MwEHXCskD3gwBHLtZWK+1hVSULEntBQnDZIyD6ITXZynEmrYXjRPZkRRXlT8KF/SlcEG4bJGg7kqXn9+Zpxs9xIYVB3//vuBsfNj9ube+0Pn3+8nXX29vvmbzUHLo8l7kexMyAFAq6KFDCoNDAslhCP769mOn9f6CNyNUNTguIMjZWIhWcoaVGXhZmDCdxWt3V9JQOX29Qj6oQ4Q4rnEBd/6SrQpKPzXqFaWuJaChUI8bVdf23Okrqkdf2O/686CoIGtAmTV2OvPswyXmZgUIumTHDwC8wshNQcAl1KywNFIzfsjEMLVQsAxNV81hq+t0yCU1zbY9COmffOiqWGTPNYts5W9MsazNynTYsMf0dVUIVJYLii0FpKSnmdJYxTYQGjnJqAeNa2F0pnzDNONqfaNkQguUnr4Ler07gd4Kr4/bZeRPHNjkk38gPEpATckb+kEvSJZw8kCfHdTacR+fZ3XS3Fq2u03gOyLtyvRfjObdo</latexit>

(fixed-window size = 3)

Page 7: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Simple RNNs

h0 ∈ ℝd is an initial state

ht = f(ht−1, xt) ∈ ℝd

ht = g(Wht−1 + Uxt + b) ∈ ℝd

Simple RNNs:

W ∈ ℝd×d, U ∈ ℝd×din, b ∈ ℝd

: nonlinearity (e.g. tanh),g

ht : hidden states which store information from to x1 xt

Page 8: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Simple RNNs

Key idea: apply the same weights repeatedlyW

ht = g(Wht−1 + Uxt + b) ∈ ℝd

Page 9: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

RNNs vs Feedforward NNs

Page 10: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Recurrent Neural Langhage Models (RNNLMs)

P(w1, w2, …, wn) = P(w1) × P(w2 ∣ w1) × P(w3 ∣ w1, w2) × … × P(wn ∣ w1, w2, …, wn−1)

= P(w1 ∣ h0) × P(w2 ∣ h1) × P(w3 ∣ h2) × … × P(wn ∣ hn−1)

• Denote , yt = softmax(Woht) Wo ∈ ℝ|V|×d

• Cross-entroy loss:

L(θ) = −1n

n

∑t=1

log yt−1(wt)

the students opened their …exams

θ = {W, U, b, Wo, E}

Page 11: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Training RNNLMs

• Backpropagation? Yes, but not that simple!

• The algorithm is called Backpropagation Through Time (BPTT).

Page 12: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Backpropagation through time

h1 = g(Wh0 + Ux1 + b)

h2 = g(Wh1 + Ux2 + b)

h3 = g(Wh2 + Ux3 + b)

L3 = − log y3(w4)

You should know how to compute: ∂L3

∂h3

∂L3

∂W=

∂L3

∂h3

∂h3

∂W+

∂L3

∂h3

∂h3

∂h2

∂h2

∂W+

∂L3

∂h3

∂h3

∂h2

∂h2

∂h1

∂h1

∂W

∂L∂W

= −1n

n

∑t=1

t

∑k=1

∂Lt

∂ht

t

∏j=k+1

∂hj

∂hj−1

∂hk

∂W

Page 13: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Truncated backpropagation through time

• Backpropagation is very expensive if you handle long sequences

• Run forward and backward through chunks of the sequence instead of whole sequence

• Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps

Page 14: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Progress on language models

On the Penn Treebank (PTB) dataset Metric: perplexity

(Mikolov and Zweig, 2012): Context dependent recurrent neural network language model

KN5: Kneser-Ney 5-gram

Page 15: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Progress on language models

(Yang et al, 2018): Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

On the Penn Treebank (PTB) dataset Metric: perplexity

Page 16: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Vanishing/exploding gradients

• Consider the gradient of at step , with respect to the hidden state at some previous step ( ):

Lt thk k k < t

∂Lt

∂hk=

∂Lt

∂ht ∏t≥j>k

∂hj

∂hj−1

(advanced)

• (Pascanu et al, 2013) showed that if the largest eigenvalue of is less than 1 for , then the gradient will shrink exponentially. This problem is called vanishing gradients.

Wg = tanh

• In contrast, if the gradients are getting too large, it is called exploding gradients.

=∂Lt

∂ht× ∏

t≥j>k(diag (g′�(Whj−1 + Uxj + b)) W)

Page 17: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Why is exploding gradient a problem?

• Gradients become too big and we take a very large step in SGD.

• Solution: Gradient clipping — if the norm of the gradient is greater than some threshold, scale it down before applying SGD update.

Page 18: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Why is vanishing gradient a problem?

• If the gradients becomes vanishingly small over long distances (step to step ), then we can’t tell whether:

• We don’t need long-term dependencies • We have wrong parameters to capture the true dependency

kt

the dogs in the neighborhood are ___Still difficult to predict “barking”

• How to fix vanishing gradient problem? • LSTMs: Long short-term memory networks • GRUs: Gated recurrent units

Page 19: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Long Short-term Memory (LSTM)• A type of RNN proposed by Hochreiter and Schmidhuber

in 1997 as a solution to the vanishing gradients problem

ht = f(ht−1, xt) ∈ ℝd

• Work extremely well in practice

• Basic idea: turning multiplication into addition

• Use “gates” to control how much information to add/erase

• At each timestep, there is a hidden state and also a cell state

• stores long-term information

• We write/erase after each step

• We read from

ht ∈ ℝd ct ∈ ℝd

ct

ct

ht ct

Page 20: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Long Short-term Memory (LSTM)

There are 4 gates:

• Input gate (how much to write): it = σ(W(i)ht−1 + U(i)xt + b(i)) ∈ ℝd

• Forget gate (how much to erase): ft = σ(W( f )ht−1 + U( f )xt + b( f )) ∈ ℝd

• Output gate (how much to reveal): ot = σ(W(o)ht−1 + U(o)xt + b(o)) ∈ ℝd

• New memory cell (what to write): ct = tanh(W(c)ht−1 + U(c)xt + b(c)) ∈ ℝd

How many parameters in total?

• Final memory cell: ct = ft ⊙ ct−1 + it ⊙ ct

• Final hidden cell: ht = ot ⊙ ct element-wise product

Page 21: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Long Short-term Memory (LSTM)

• LSTM doesn’t guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies

• LSTMs were invented in 1997 but finally got working from 2013-2015.

Page 22: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Is the LSTM architecture optimal?

(Jozefowicz et al, 2015): An Empirical Exploration of Recurrent Network Architectures

Page 23: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Overview

• What is a recurrent neural network (RNN)? • Simple RNNs • Backpropagation through time • Long short-term memory networks (LSTMs) • Applications • Variants: Stacked RNNs, Bidirectional RNNs

Page 24: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Application: Text Generation

You can generate text by repeated sampling. Sampled output is next step’s input.

Page 25: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Fun with RNNs

Andrej Karpathy “The Unreasonable Effectiveness of Recurrent Neural Networks”

Obama speeches Latex generation

Page 26: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Application: Sequence Tagging

P(yi = k) = softmaxk(Wohi) Wo ∈ ℝC×d

L = −1n

n

∑i=1

log P(yi = k)

Input: a sentence of n words: x1, …, xn

Output: y1, …, yn, yi ∈ {1,…C}

Page 27: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Application: Text Classification

the movie was terribly exciting !

hn

P(y = k) = softmaxk(Wohn) Wo ∈ ℝC×d

Input: a sentence of n words

Output: y ∈ {1,2,…, C}

Page 28: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Multi-layer RNNs

• RNNs are already “deep” on one dimension (unroll over time steps)

• We can also make them “deep” in another dimension by applying multiple RNNs

• Multi-layer RNNs are also called stacked RNNs.

Page 29: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Multi-layer RNNs

The hidden states from RNN layer are the inputs to RNN layer

ii + 1

• In practice, using 2 to 4 layers is common (usually better than 1 layer) • Transformer-based networks can be up to 24 layers with lots of skip-

connections.

Page 30: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Bidirectional RNNs

• Bidirectionality is important in language representations:

terribly: • left context “the movie was” • right context “exciting !”

Page 31: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Bidirectional RNNs

ht = f(ht−1, xt) ∈ ℝd

h t = f1(h t−1, xt), t = 1,2,…n

h t = f2(h t+1, xt), t = n, n − 1,…1

ht = [h t, h t] ∈ ℝ2d

Page 32: Recurrent Neural Networks - Princeton University · Long Short-term Memory (LSTM) • A type of RNN proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradients

Bidirectional RNNs

• Sequence tagging: Yes! • Text classification: Yes! With slight modifications.

• Text generation: No. Why?

terribly exciting !the movie wasterribly exciting !the movie was

Sentence encoding

element-wise mean/max element-wise mean/max


Recommended