Neural Network with Memory
Hung-yi Lee
Memory is important
47
47
11
1 2 3
Input:2 dimensions
Output:1 dimension
𝑥1 𝑥2 𝑥3
𝑦1 𝑦2 𝑦3
1 4 4
1 7 7
1
+
Network needs memory to achieve this
11
23
Memory is important
x1 x2
y
c1
10
10
Network with Memory 1-10
1 1 1 1
1
1
MemoryCell4
747
11
1 2 3
𝑥1 𝑥2 𝑥3
𝑦1 𝑦2 𝑦3
0
4 7
1111
111
1
1
Memory is important
x1 x2
y
c1
10
10
Network with Memory 1-10
1 1 1 1
1
1
MemoryCell4
747
11
1 2 3
𝑥1 𝑥2 𝑥3
𝑦1 𝑦2 𝑦3
0
4 7
1212
121
2
11
Memory is important
x1 x2
y
c1
10
10
Network with Memory 1-10
1 1 1 1
1
1
MemoryCell4
747
11
1 2 3
𝑥1 𝑥2 𝑥3
𝑦1 𝑦2 𝑦3
0
1 1
33
30
3
110
Outline
Long Short-term Memory (LSTM)
Variants of RNN
Vanilla Recurrent Neural Network (RNN)
Outline
Long Short-term Memory (LSTM)
Variants of RNN
Vanilla Recurrent Neural Network (RNN)
Application
• (Simplified) Speech Recognition
TSI TSI TSI I I N N N
x1 x2 x3 x4
y1 y2 y3 y4 ……
……
S S @ @ @ @
x1 x2 x3 x4
y1 y2 y3 y4 ……
……
Utterance 1 Utterance 2
We use DNN. All the frames are considered independently.
RNN input: 𝑥1 𝑥2 𝑥3 …… 𝑥𝑁
RNN
x1
y1=softmax(Woa1)
Input of RNN is one utterance
WiWh
Wo
a1=σ(Wix1+Wh0)memory
copy
The order cannot change.
y1
a10
RNN input: 𝑥1 𝑥2 𝑥3 …… 𝑥𝑁
RNN
x2
y2=softmax(Woa2)
WiWh
Wo
a2=σ(Wix2+Wha1)memory
copy
y1 y2
a10 a2
Input of RNN is one utterance
The order cannot change.
RNN input: 𝑥1 𝑥2 𝑥3 …… 𝑥𝑁
RNN
x3
y3=softmax(Woa3)
WiWh
Wo
a3=σ(Wix3+Wha2)memory
copy
y1 y2
a10 a2
y3
a3
Input of RNN is one utterance
The order cannot change.
RNN
copy copy
x1 x2 x3
y1 y2 y3
Output yi depends on x1, x2, …… xi
Wi
Wo
Wh Wh WhWi Wi
Wo Wo
0a1
a1 a2
a2 a3
The same network is used again and again.
Input data: 𝑥1 𝑥2 𝑥3 …… 𝑥𝑁 Input of RNN is one utterance
init
RNN
Output yi depends on x1, x2, …… xi
The same network is used again and again.
x1 x2 x3
y1 y2 y3
Wi
WhWo
……Wh Wh
Wi
Wo
Wi
Wo
0
Input data: 𝑥1 𝑥2 𝑥3 …… 𝑥𝑁 Input of RNN is one utterance
init
Cost
RNN input: 𝑥1 𝑥2 𝑥3 …… 𝑥𝑁
RNN output: 𝑦1 𝑦2 𝑦3 …… 𝑦𝑁
RNN output: 𝑦1 𝑦2 𝑦3 …… 𝑦𝑁
𝐶 =1
2
𝑛=1
𝑁
𝑦𝑛 − 𝑦𝑛 2 𝐶 =1
2
𝑛=1
𝑁
−𝑙𝑜𝑔𝑦𝑟𝑛𝑛
Training
RNN Training is very difficult in practice.
Backpropagation through time (BPTT)
𝑦1 𝑦2 𝑦3
x1 x2 x3
y1 y2 y3
Wi
WhWo
init……
Wh Wh
Wi
Wo
Wi
Wo
0
𝑤 is an element in Wh, Wi or Wo 𝑤 ← 𝑤 − 𝜂𝜕𝐶 ∕ 𝜕𝑤
More Applications
• Input and output are vector sequences with the same length
POS Tagging
John saw the saw.
PN V D N
x1 x2 x3 x4
y1 y2 y3 y4
x1 x2 x3
y1 y2 y3
More Applications
• Name entity recognition• Identifying names of people, places, organizations, etc.
from a sentence
• Harry Potter is a student of Hogwarts and lived on Privet Drive.
• people, organizations, places, not a name entity
• Information extraction• Extract pieces of information relevant to a specific
application, e.g. flight booking
• I would like to leave Boston on November 2nd and arrive in Taipei before 2 p.m.
• place of departure, destination, time of departure, time of arrival, other
Outline
Long Short-term Memory (LSTM)
Variants of RNN
Vanilla Recurrent Neural Network (RNN)
Elman Network & Jordan Network
xt xt+1
yt yt+1
……Wh
Wi
Wo
Wi
Wo
……
xt xt+1
yt yt+1
……
Wh
Wi
Wo
Wi
Wo
……
Elman Network Jordan Network
Deep RNN
…… ……
xt xt+1 xt+2
……
……
yt
……
……
yt+1…
…yt+2
……
……
Bidirectional RNN
yt+1
…… ……
…………yt+2yt
xt xt+1 xt+2
xt xt+1 xt+2
Many to one
• Input is a vector sequence, but output is only one vector
Sentiment Analysis
……
我 覺 太得 糟 了
超好雷
好雷
普雷
負雷
超負雷
看了這部電影覺得很高興 …….
這部電影太糟了…….
這部電影很棒 …….
Positive (正雷) Negative (負雷) Positive (正雷)
……
Many to Many (Output is shorter)
• Both input and output are vector sequences, but the output is shorter.
Speech Recognition
x1 x2 x3 x4 ……
好 好 好
Trimming
棒 棒 棒 棒 棒
“好棒”
You can never recognize “好棒棒” !
Many to Many (Output is shorter)
• Both input and output are vector sequences, but the output is shorter.
• Connectionist Temporal Classification (CTC)
• Add an extra symbol “φ” (同上)
好 φ φ 棒 φ φ φ φ
好 φ φ 棒 φ 棒 φ φ
“好棒”
“好棒棒”
learnin
g
Many to Many (No Limitation)
• Both input and output are vector sequences with different lengths. → Sequence to sequence learning
Machine Translation
mach
ine
機 器 學 習
……
……
never-ending
慣 性
Many to Many (No Limitation)
•推文接龍• Ref: http://pttpedia.pixnet.net/blog/post/168133002-
%E6%8E%A5%E9%BE%8D%E6%8E%A8%E6%96%87
推xxx: ptt萬歲推dd: 歲平安噓dddf: 全推zzzzzzzzzzz: 家就是你家
……
推tlkagk: =========斷==========
Many to Many (No Limitation)
• Both input and output are vector sequences with different lengths. → Sequence to sequence learning
Add a symbol “===“ (斷)learn
ing
mach
ine
機 器 學 習
===
One to Many
• Input is one vector, but output is a vector sequence
Caption generationInput
a woman is throwing a
……
…… ===(斷)
Outline
Long Short-term Memory (LSTM)
Variants of RNN
Vanilla Recurrent Neural Network (RNN)
MemoryCell
Long Short-term Memory (LSTM)
Input Gate
Output Gate
Signal control the input gate
Signal control the output gate
Forget Gate
Signal control the forget gate
Other part of the network
Other part of the network
(Other part of the network)
(Other part of the network)
(Other part of the network)
LSTM
Special Neuron:4 inputs, 1 output
𝑧
𝑧𝑖
𝑧𝑓
𝑧𝑜
𝑔 𝑧
𝑓 𝑧𝑖multiply
multiplyActivation function f is usually a sigmoid function
Between 0 and 1
Mimic open and close gate
c
𝑐′ = 𝑔 𝑧 𝑓 𝑧𝑖 + 𝑐𝑓 𝑧𝑓
ℎ 𝑐′𝑓 𝑧𝑜
𝑎 = ℎ 𝑐′ 𝑓 𝑧𝑜
𝑔 𝑧 𝑓 𝑧𝑖
𝑐′
𝑓 𝑧𝑓
𝑐𝑓 𝑧𝑓
𝑐
x1 x2 Input
Original Network:
Simply replace the neurons with LSTM
𝑎1 𝑎2
……
……
𝑧1 𝑧2
x1 x2
+
+
+
+
+
+
+
+
Input
𝑎1 𝑎2
4 times of parameters
LSTM - Example
100
0
𝑥1
𝑥2
𝑥3
𝑦
310
0
200
0
410
0
200
0
101
7
3-10
0
610
0
101
6
When x2 = 1, add the numbers of x1 into the memory
When x3 = 1, output the number in the memory.
0 0 3 3 7 7 7 0 6
When x2 = -1, reset the memory
𝑥1
𝑥2
𝑥3
𝑦
310
0
410
0
200
0
101
7
3-10
0
+
+
+
+
x1 x2 x3
x1
x2
x3
x1
x2
x3
x1
x2
x3
y
1 0 0
100
0
0
1
1
1
1
0
-10
0
0
100
-10
0
010
1000
𝑥1
𝑥2
𝑥3
𝑦
310
0
410
0
200
0
101
7
3-10
0
+
+
+
+
3 1 0
3
1
0
3
1
0
3
1
0
y
1 0 0
100
0
0
1
1
1
1
0
-10
0
0
100
-10
0
010
100
3
≈1 3
≈103
3≈0
≈0
𝑥1
𝑥2
𝑥3
𝑦
310
0
410
0
200
0
101
7
3-10
0
+
+
+
+
4 1 0
4
1
0
4
1
0
4
1
0
y
1
1
1
1
4
≈1 4
≈137
7≈0
≈0
1 0 0
100
0
0
0
-10
0
0
100
-10
0
010
100
𝑥1
𝑥2
𝑥3
𝑦
310
0
410
0
200
0
101
7
3-10
0
+
+
+
+
2 0 0
2
0
0
2
0
0
2
0
0
y
1
1
1
1
2
≈0 0
≈17
7≈0
≈0
1 0 0
100
0
0
0
-10
0
0
100
-10
0
010
100
𝑥1
𝑥2
𝑥3
𝑦
310
0
410
0
200
0
101
7
3-10
0
+
+
+
+
1 0 1
1
0
1
1
0
1
1
0
1
y
1
1
1
1
1
≈0 0
≈17
7≈1
≈7
1 0 0
100
0
0
0
-10
0
0
100
-10
0
010
100
𝑥1
𝑥2
𝑥3
𝑦
310
0
410
0
200
0
101
7
3-10
0
+
+
+
+
3 -1 0
3
-1
0
3
-1
0
3
-1
0
y
1
1
1
1
3
≈0 0
≈07
0≈0
≈0
0
1 0 0
100
0
0
0
-10
0
0
100
-10
0
010
100
What is the next wave?
• Attention-based Model
Reading Head Controller
Input x
Reading Head Writing Head
DNN
Writing Head Controller
output y
…… ……Memory
Recommended Reading List
• The Unreasonable Effectiveness of Recurrent Neural Networks
• http://karpathy.github.io/2015/05/21/rnn-effectiveness/
• Understanding LSTM Networks
• http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Acknowledgement
•感謝葉軒銘同學於上課時發現投影片上的錯誤