Tips for Training Deep Network
Output
• Training Strategy: Batch Normalization
• Activation Function: SELU
• Network Structure: Highway Network
Batch Normalization
Feature Scaling
……
……
……
……
……
…… ……
𝑥1 𝑥2 𝑥3 𝑥𝑟 𝑥𝑅
mean: 𝑚𝑖
standard deviation: 𝜎𝑖
𝑥𝑖𝑟 ←
𝑥𝑖𝑟 −𝑚𝑖
𝜎𝑖
The means of all dimensions are 0, and the variances are all 1
For each dimension i:
𝑥11
𝑥21
𝑥12
𝑥22
In general, gradient descent converges much faster with feature scaling than without it.
How about Hidden Layer?
𝑎1𝑥1 Layer 1 Layer 2 𝑎2 ……
Feature Scaling Feature Scaling ? Feature Scaling ?
Smaller learning rate can be helpful, but the training would be slower.
Difficulty: their statistics change during the training …
Batch normalization
Internal Covariate Shift
𝑎3
𝑎2
𝑎1
Batch
𝑥1
𝑥2
𝑥3
𝑊1
𝑊1
𝑊1
𝑧1
𝑧2
𝑧3
𝑊2
𝑊2
𝑊2
Sigmo
id
……
……
……
𝑊1 𝑥1 𝑥2 𝑥3𝑧1 𝑧2 𝑧3 =
Sigmo
idSigm
oid
Batch
Batch normalization
𝑥1
𝑥2
𝑥3
𝑊1
𝑊1
𝑊1
𝑧1
𝑧2
𝑧3
𝜇 𝜎
𝜇 =1
3
𝑖=1
3
𝑧𝑖
𝜎 =1
3
𝑖=1
3
𝑧𝑖 − 𝜇 2
𝜇 and 𝜎depends on 𝑧𝑖
Note: Batch normalization cannot be applied on small batch.
Batch normalization
𝑥1
𝑥2
𝑥3
𝑊1
𝑊1
𝑊1
𝑧1
𝑧2
𝑧3
𝜇 𝜎
ǁ𝑧𝑖 =𝑧𝑖 − 𝜇
𝜎
𝜇 and 𝜎depends on 𝑧𝑖
𝑎3
𝑎2
𝑎1
Sigmo
idSigm
oid
Sigmo
id
ǁ𝑧1
ǁ𝑧2
ǁ𝑧3
How to do backpropogation?
Batch normalization
𝑥1
𝑥2
𝑥3
𝑊1
𝑊1
𝑊1
𝑧1
𝑧2
𝑧3
𝜇 𝜎
Ƹ𝑧𝑖 = 𝛾⨀ ǁ𝑧𝑖 + 𝛽
𝜇 and 𝜎depends on 𝑧𝑖
Ƹ𝑧3
Ƹ𝑧2
Ƹ𝑧1ǁ𝑧1
ǁ𝑧2
ǁ𝑧3
𝛽 𝛾
ǁ𝑧𝑖 =𝑧𝑖 − 𝜇
𝜎
Batch normalization
• At testing stage:
𝑥 𝑊1 𝑧 Ƹ𝑧ǁ𝑧Ƹ𝑧𝑖 = 𝛾⨀ ǁ𝑧𝑖 + 𝛽ǁ𝑧 =
𝑧 − 𝜇
𝜎
𝜇, 𝜎 are from batch
𝛾, 𝛽 are network parameters
We do not have batch at testing stage.
Ideal solution:
Computing 𝜇 and 𝜎 using the whole training dataset.
Practical solution:
Computing the moving average of 𝜇 and 𝜎 of the batches during training.
Acc
Updates
𝜇1
𝜇100𝜇300
Batch normalization - Benefit
• BN reduces training times, and make very deep net trainable.
• Because of less Covariate Shift, we can use larger learning rates.
• Less exploding/vanishing gradients
• Especially effective for sigmoid, tanh, etc.
• Learning is less affected by initialization.
• BN reduces the demand for regularization.
𝑥𝑖 𝑊1 𝑧𝑖 Ƹ𝑧𝑖ǁ𝑧𝑖
Ƹ𝑧𝑖 = 𝛾⨀ ǁ𝑧𝑖 + 𝛽ǁ𝑧𝑖 =𝑧𝑖 − 𝜇
𝜎
× 𝒌 × 𝒌
𝒌 𝒌
𝒌
𝒌𝒆𝒆𝒑
To learn more ……
• Batch Renormalization
• Layer Normalization
• Instance Normalization
• Weight Normalization
• Spectrum Normalization
Activation Function: SELU
ReLU
• Rectified Linear Unit (ReLU)
Reason:
1. Fast to compute
2. Biological reason
3. Infinite sigmoid with different biases
4. Vanishing gradient problem
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0
𝜎 𝑧
ReLU - variant
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0.01𝑧
𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 𝛼𝑧
𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈
α also learned by gradient descent
ReLU - variant
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 𝛼 𝑒𝑧 − 1
Exponential Linear Unit (ELU)
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 𝛼 𝑒𝑧 − 1
Scaled ELU (SELU)
https://github.com/bioinf-jku/SNNs
× 𝜆
× 𝜆
𝛼 = 1.6732632423543772848170429916717
𝜆 = 1.0507009873554804934193349852946
SELU
Positive and negative values
The whole ReLU family has this property except the original ReLU.
Saturation region ELU also has this property
Slope larger than 1
𝑎 = 𝜆𝑧
𝑎 = 𝜆𝛼 𝑒𝑧 − 1
𝛼 = 1.673263242…
𝜆 = 1.050700987…
Only SELU also has this property
SELU
KKkk wawawaz 11
z
1w
kw
Kw
…
1a
ka
Ka
zf a
…
……
The inputs are i.i.d random variables with mean 𝜇 and variance 𝜎2.
𝜇𝑧 = 𝐸 𝑧
=
𝑘=1
𝐾
𝐸 𝑎𝑘 𝑤𝑘
𝜇= 𝜇
𝑘=1
𝐾
𝑤𝑘 = 𝜇 ∙ 𝐾𝜇𝑤
=0=1
=0
Do not have to be Gaussian
=0
SELU
KKkk wawawaz 11
z
1w
kw
Kw
…
1a
ka
Ka
zf a
…
……
The inputs are i.i.d random variables with mean 𝜇 and variance 𝜎2.
𝜇𝑧 = 0
𝜎𝑧2 = 𝐸 𝑧 − 𝜇𝑧
2 = 𝐸 𝑧2
=0=1
= 𝐸 𝑎1𝑤1 + 𝑎2𝑤2 +⋯ 2
𝐸 𝑎𝑘𝑤𝑘2 = 𝑤𝑘
2𝐸 𝑎𝑘2 = 𝑤𝑘
2𝜎2
𝐸 𝑎𝑖𝑎𝑗𝑤𝑖𝑤𝑗 = 𝑤𝑖𝑤𝑗𝐸 𝑎𝑖 𝐸 𝑎𝑗 = 0
=
𝑘=1
𝐾
𝑤𝑘2𝜎2 = 𝜎2 ∙ 𝐾𝜎𝑤
2
𝜇 = 0, 𝜎 = 1
=1=1
= 1
target
Assume Gaussian
𝜇𝑤 = 0
Demo
Source of joke: https://zhuanlan.zhihu.com/p/27336839
93 頁的證明
SELU is actually more general.
•最新激活神經元:SELF-NORMALIZATION NEURAL NETWORK (SELU)
MNIST CIFAR-10
Demo
Highway Network & Grid LSTM
x f1 a1 f2 a2 f3 a3 f4 y
𝑎𝑡 = 𝑓𝑙 𝑎𝑡−1 = 𝜎 𝑊𝑡𝑎𝑡−1 + 𝑏𝑡
x1
h0 f h1
x2
f
x3
h2 f
x4
h3 f y4
ℎ𝑡 = 𝑓 ℎ𝑡−1, 𝑥𝑡 = 𝜎 𝑊ℎℎ𝑡−1 +𝑊𝑖𝑥𝑡 + 𝑏𝑖
t is layer
t is time step
Applying gated structure in feedforward network
Feedforward v.s. Recurrent
1. Feedforward network does not have input at each step
2. Feedforward network has different parameters for each layer
GRU → Highway Network
ht-1
r z
yt
xtht-1
h'
⨀
xt
⨀
⨀1-
+ ht
reset update
No input xt at each step
at-1 is the output of the (t-1)-th layer
at is the output of the t-th layer
No output yt at each step
No reset gate
at-1 at
at-1
Highway Network
• Residual Network• Highway Network
Deep Residual Learning for Image
Recognition
http://arxiv.org/abs/1512.03385
Training Very Deep Networks
https://arxiv.org/pdf/1507.0622
8v2.pdf
+
copycopy
Gate controller
ℎ′ = 𝜎 𝑊𝑎𝑡−1
𝑧 = 𝜎 𝑊′𝑎𝑡−1
𝑎𝑡 = 𝑧⊙ 𝑎𝑡−1 + 1 − 𝑧 ⊙ ℎ
𝑎𝑡−1
𝑎𝑡
𝑧
ℎ′
𝑎𝑡−1
𝑎𝑡
𝑎𝑡−1ℎ′
Input layer
output layer
Input layer
output layer
Input layer
output layer
Highway Network automatically determines the layers needed!
Highway Network
Grid LSTM
LSTM
y
x
c’
hth
cGrid
LSTM
c’
h’h
c
Memory for both time and depth
ba
b’a’
time
depth
GridLSTMht-1
ct-1
al bl
ht
ct
al-1 bl-1
GridLSTM
al bl
ht+1
ct+1
al-1 bl-1
GridLSTM’ht-1
ct-1
al+1 bl+1
ht
ct
GridLSTM’
al+1 bl+1
ht+1
ct+1
Grid LSTM
GridLSTM
c’
h’h
c
ba
b’a’
h'
zzizf zo
+
h
c
⨀
⨀ ⨀tanh
c'
a
b
a'
b'
e’ f’
3D Grid LSTM
h
c
h’
c’
ba
e f
b’a’
3D Grid LSTM
• Images are composed of pixels
3 x 3 images