Post on 31-Jul-2020
transcript
Short Trip In The Valley of Deep LearningPantelis Vlachas, Guido Novati
Computational Science and Engineering Lab ETH Zürich
Motivation - What is machine learning?
2
Classical Machine Learning
data regression/classification/etc. result
data
feature extraction
feature extraction + regression/classification/etc.
Deep Learning
result
Deep Learning
• Backpropagation
• Backpropagation through time (BPTT)
• Variational Inference (Bayesian)
• GEMM (General Matrix to Matrix Multiplication)
Sophisticated Architectures Algorithms
LeNet of Yann LeCun et al., 1998
LSTM, 1997 GRU, 2014
• Graphical Processing Units (Hardware)
Hardware
Convolutional Neural Networks
• Heavily based on GEMM (General Matrix to Matrix Multiplication)
• Parametric models suited for image processing (classification, object detection, etc.)
• Applications in self-driving cars, robotics, healthcare, physics, image and video recognition, recommender systems, image classification, medical image analysis, natural language processing, financial time series, etc.
LeNet of Yann LeCun et al., 1998
Biological Intuition
• Very roughly biological brains have neurons that activate when they recognize a triggering pattern in their input
• Each unit does “simple” pattern recognition
• Complexity emerges from sheer numbers
Convolutional Model of a part of a Fruit-Fly’s brainJonathan Schneider et al., 2018
Convolutional Neural Network
What is a Convolution ?
Input is a matrix : dIY
× dIX× dIC
dIX
dIY
dIC
Parameters are a tensor : dKY
× dKX× dIC
× dKC
dKY
dKX
dIC
KC = 1 KC = 2 KC = 3 KC = 4 …(We have filters)dKC
dOX
dOY
dKC
Output is a matrix : dOy
× dOX× dKC
• Mapping an image to another image
• Feature sizes , , can be any numbers
• Parameters are called “filters” or “kernels”
dICdKC
Convolution Operation (filtering, sliding)
• Kernels
• Sliding a kernel along the spatial dimensions , and (Iterating along , and )
• At each position and for each filter in the dimension, we compute the scalar product
between the filter (of size ) and a “patch” of the image of size
• The output of the scalar product is a number which is written in a single color pixel (channel) of the output image
dKY× dKX
× dIC× dKC
IX IY IX IY
KC
dKY× dKX
× dICdKY
× dKX× dIC
1D Convolution, 1D Filter
1-1201
10-1
-1xxx
+
1D Convolution, 1D Filter
1-1201
10-1
-1xxx
+ -1
1D Convolution, 1D Filter
1-1201
10-1
-1
xxx
+-11
Padding
• What if we want to keep the output equal to the input in the spatial dimensions ?
• Size of the image is extended in both directions by and
• Usually zero padding
dIX= dOX
, dIY= dOY
dPYdPX
1-1201
10-1
xxx
+
-1-11
0
0
1
010-1
xxx
+
dS = 2
Stride
• Convolution does not have to be computed by increments of pixel
• Sride (skip, stepping) and
• Here padding , stride
1dSY
dSX
1 2
1-1201
0
0
10-1
xxx
+ 1
-110-1
xxx
+dS = 2
010-1
xxx
+
2-D Convolutions
• If the input is
• The filters (kernels) are
• With strides ,
• Padding ,
• The output image has size:
dIY× dIX
× dIC
dKY× dKX
× dIC× dKC
dSYdSX
dPYdPX
dOY=
dIY− dKY
+ 2dPY
dSY
+ 1
dOX=
dIX− dKX
+ 2dPX
dSX
+ 1
dOC= dKC
Convolutional Neural Network
Pooling Operations (Subsampling)
Convolutional Neural Network
Short detour in classification…
19
Classification
ALGORITHM
INPUT
object, image, etc.
OUTPUT
0/1 Binary
Logistic Regression
ALGORITHMW
OUTPUT
y ∈ {0,1}x ∈ ℝdx
INPUT
• Training examples
• Testing examples {(x1, y1), …, (xNtrain, yNtrain)}
{(x1, y1), …, (xNtest, yNtest)}
Logistic Regression
ALGORITHMx ∈ ℝdxW
y = fW(x) ∈ ℝ
x1
x2
σ(x) =1
1 + e−x
• Output is a real number (one class)
• Ideally we want
• Training data ,
• Model 1 : linear regression :
• Model 2 : Sigmoid output layer :
• , ?
• Cross entropy loss:
y = fW(x)y = P(y = 1 | x)
{(x1, y1), …, (xNtrain, yNtrain)} yn ∈ {0,1}y = f w(x) = wTx + b
y = f w(x) = σ(wTx + b)W⋆ = argmin
WL(y, y) L (y, y) =
12 (y − y)2
L (y, y) = − (y log(y) + (1 − y) log(1 − y))
if
• • Maximum log likelihood ! if
• • Maximum log likelihood !
y = 1L (y, y) = − y log(y) = − log P(y = 1 | x)
y = 0L (y, y) = − log(1 − y) = − log P(y = 0 | x)
Logistic Regression
ALGORITHMx ∈ ℝdxW
y = fW(x) ∈ ℝ
x1
x2
σ(x) =1
1 + e−x
• Output is a real number (one class)
• Ideally we want
• Training data ,
• Model 2 : Sigmoid output layer :
•
• Cross entropy loss:
• How can we classify an object to more than 2 classes ?
y = fW(x)y = P(y = 1 | x)
{(x1, y1), …, (xNtrain, yNtrain)} yn ∈ {0,1}y = f w(x) = σ(wTx + b)
W⋆ = argminW
L(y, y)
L (y, y) = − (y log(y) + (1 − y) log(1 − y))
Classification
ALGORITHM
INPUT
object, image, etc.
OUTPUT
0
Multi-class
100
Dog
Cat
Mouse
Elephant
Back in CNNs…
25
Classification on Images
Input to the network is an image
CNN
{ 0.02 0.03 0.01 0.01 0.70 0.02 0.02 0.01 0.06 0.12 }
Probability that image is 0 1 2 3 4 5 6 7 8 9
Output of the network is the probability of the input image being one of the digits (belonging to one of the target classes)
Classification Layer
• SoftMax Output layer
• Sum of outputs is equal to 1
• Represent probabilities for the target classes
• Loss function ? Cross entropy loss:
•
• Measure of dissimilarity between distributions
f(xi) =exp(xi)
∑10j=1 exp(xj)
L( f, f ) = −10
∑i=1
fi log f(xi)
f = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0] ⟺
hl o = softmax(Whl + b
x
)
Automatic Feature DetectionHigh level features
Low level features (edges, circels, mesh, text etc.)
Second layer features
Architectures
• LeNet by Yann LeCun et al. in 1998 • Alex-Net by Alex Krizhevsky, et. al. 2012 • VGG Net by Oxford’s Visual Geometry Group 2014 • GoogLeNet by Christian Szegedy, et. al. 2014 • ResNet (Residual Network) by Kaiming He, et. al. 2015 • DenseNet by Gao Huang, et. al. 2016
LeNet of Yann LeCun et al., 1998
Alex-Net of Alex Krizhevsky et al., 2012
Heuristics for Deep Learning
Data Preprocessing • Scaling (e.g. zero mean, unit variance) • Random cropping • Flipping data • PCA whitening • Noise
Initialization of Weights • Scale the weights of each layers bu the
inverse of the square root of number of
input neurons
• Xavier initialization
1Nl
Activation Functions • tanh
• sigmoid
• ReLU
• ELU
Regularization
DropoutFull-connected
Operating on Sequences• In many applications cases, the data have temporal order (language, time series, etc.) • Fully-connected networks, and CNNs do not take into account this feature and have fixed input
and output sizes • Recurrent Neural Networks: networks with feedback loops
Operating on Sequences
xt+1
ht+1
RNN
xt�1
ht�1
RNN
xt
tanh
ht�1 ht
ht
hT
yT
x3
RNNW h3
y3
x2
RNNW h2h1
Weight Sharing in Time
h0
x1
RNNW
W, b
y1 y2
hT
yT
x3
RNNW h3
y3
x2
RNNW h2h1
Weight Sharing in Time
h0
x1
RNNW
W, b
y1 y2
y1
L1 = | y1 − y2 |2
y2 y3 yT
L2 L3 LT
L =1T
T
∑t=1
Lt
L
Backpropagation Through Time (BPTT)
FORWARD PASS - entire sequence, compute loss
BACKWARD PASS - entire sequence, comppute gradients
L
BACKWARD PASS - on some smaller amount of steps
L
Truncated BPTT
“Carry” hidden state forever
BACKWARD PASS - on some smaller amount of steps
tanh(W ·)
h1
x1
h0
tanh(W ·)
h2
x2
h1
tanh(W ·)
h3
h2
x3
tanh(W ·)
h4
h3
x4
h4
Vanishing Gradients Problem
• Computing the gradient of the loss w.r.t. involves many factors of and repeated
• In case of a linear activation and no bias, you would have factors like
• The gradient vanishes (explodes) if largest singular value ( )
h0 W tanh
W(W(…(Wh0)))< 1 > 1
Gating Architectures
ot
ht�1�
got
ht
ht
ct�1 ct�
gft
git
�
+
tanh
�tanh
ct
�
�
Long Short-Term Memory Cell
ot
ht�1
�
�
rt
�
1 � ·
zt
+ht
ht
�
tanh
ht�
Gated Recurrent Unit
Gating Architectures
�
got
�
gft
git
�
+
tanh
�tanh
ct
�
�
�
got
�
gft
git
�
+
tanh
�tanh
ct
�
�
�
got
�
gft
git
�
+
tanh
�tanh
ct
�
�
�
got
�
gft
git
�
+
tanh
�tanh
ct
�
�
x1 x2 x3 x4
h1 h2 h3 h4
C0 C1 C2 C3 C4
h1 h2 h3h0 h4
Uninterrupted gradient flow !
RNN structure
One-To-One e.g. classification
Many-To-One e.g. sentiment
analysis
One-To-Many e.g. Image captioning,
video generation
Many-To-Many e.g. Machine translation,
time-series prediction
Prediction of Chaotic Dynamics
• Forecasting the state of the Kuramoto-Sivashinsky equation
∂u∂t
= − v∂4u∂x4
−∂2u∂x2
− u∂u∂x
• RNNs can be chaotic !
• They are dynamical systems
Word embeddings
• Words can be represented by numbers (vectors) that encode semantic meaning • E.g. Word2Vec • Input: LARGE CORPUS OF TEXT • Learns a vector space where each word is assigned a vector • How ? Predict a word (target) from its neighboring words (context) or vice versa • Encodes context information
x1
x2
France
ParisGreece
Athens (closest word)
Word embeddings
Applications The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches, 2018
Object detection Object localisation Image/Video segmentation
Autonomous driving
Brain cancer detection
Scin cancer recognition
Speech recognition Machine translation Image/Video captioning Medicine/Biology
Outlook
46
Why Deep Learning ? • Universal approach for learning problems • Robust approach, does not require “much” expert
knowledge • Generalization, Scalability
Challenges ? • Big data and scalability • Generalization, transfer learning, multi-task learning • Generate new “artificial” datasets, for applications where data is scarce (Generative
models) • Understaning/Explainable models, incorporating physics • Causality and not plain pattern recognition/correlations • Energy efficient implementations on mobiles/FPGAs, etc.
Amount of data
Performance
Classical ML
Deep Learning