Short Trip In The Valley of Deep Learning · Convolution Operation (filtering, sliding) • Kernels...

Post on 31-Jul-2020

0 views 0 download

transcript

Short Trip In The Valley of Deep LearningPantelis Vlachas, Guido Novati

Computational Science and Engineering Lab ETH Zürich

Motivation - What is machine learning?

2

Classical Machine Learning

data regression/classification/etc. result

data

feature extraction

feature extraction + regression/classification/etc.

Deep Learning

result

Deep Learning

• Backpropagation

• Backpropagation through time (BPTT)

• Variational Inference (Bayesian)

• GEMM (General Matrix to Matrix Multiplication)

Sophisticated Architectures Algorithms

LeNet of Yann LeCun et al., 1998

LSTM, 1997 GRU, 2014

• Graphical Processing Units (Hardware)

Hardware

Convolutional Neural Networks

• Heavily based on GEMM (General Matrix to Matrix Multiplication)

• Parametric models suited for image processing (classification, object detection, etc.)

• Applications in self-driving cars, robotics, healthcare, physics, image and video recognition, recommender systems, image classification, medical image analysis, natural language processing, financial time series, etc.

LeNet of Yann LeCun et al., 1998

Biological Intuition

• Very roughly biological brains have neurons that activate when they recognize a triggering pattern in their input

• Each unit does “simple” pattern recognition

• Complexity emerges from sheer numbers

Convolutional Model of a part of a Fruit-Fly’s brainJonathan Schneider et al., 2018

Convolutional Neural Network

What is a Convolution ?

Input is a matrix : dIY

× dIX× dIC

dIX

dIY

dIC

Parameters are a tensor : dKY

× dKX× dIC

× dKC

dKY

dKX

dIC

KC = 1 KC = 2 KC = 3 KC = 4 …(We have filters)dKC

dOX

dOY

dKC

Output is a matrix : dOy

× dOX× dKC

• Mapping an image to another image

• Feature sizes , , can be any numbers

• Parameters are called “filters” or “kernels”

dICdKC

Convolution Operation (filtering, sliding)

• Kernels

• Sliding a kernel along the spatial dimensions , and (Iterating along , and )

• At each position and for each filter in the dimension, we compute the scalar product

between the filter (of size ) and a “patch” of the image of size

• The output of the scalar product is a number which is written in a single color pixel (channel) of the output image

dKY× dKX

× dIC× dKC

IX IY IX IY

KC

dKY× dKX

× dICdKY

× dKX× dIC

1D Convolution, 1D Filter

1-1201

10-1

-1xxx

+

1D Convolution, 1D Filter

1-1201

10-1

-1xxx

+ -1

1D Convolution, 1D Filter

1-1201

10-1

-1

xxx

+-11

Padding

• What if we want to keep the output equal to the input in the spatial dimensions ?

• Size of the image is extended in both directions by and

• Usually zero padding

dIX= dOX

, dIY= dOY

dPYdPX

1-1201

10-1

xxx

+

-1-11

0

0

1

010-1

xxx

+

dS = 2

Stride

• Convolution does not have to be computed by increments of pixel

• Sride (skip, stepping) and

• Here padding , stride

1dSY

dSX

1 2

1-1201

0

0

10-1

xxx

+ 1

-110-1

xxx

+dS = 2

010-1

xxx

+

2-D Convolutions

• If the input is

• The filters (kernels) are

• With strides ,

• Padding ,

• The output image has size:

dIY× dIX

× dIC

dKY× dKX

× dIC× dKC

dSYdSX

dPYdPX

dOY=

dIY− dKY

+ 2dPY

dSY

+ 1

dOX=

dIX− dKX

+ 2dPX

dSX

+ 1

dOC= dKC

Convolutional Neural Network

Pooling Operations (Subsampling)

Convolutional Neural Network

Short detour in classification…

19

Classification

ALGORITHM

INPUT

object, image, etc.

OUTPUT

0/1 Binary

Logistic Regression

ALGORITHMW

OUTPUT

y ∈ {0,1}x ∈ ℝdx

INPUT

• Training examples

• Testing examples {(x1, y1), …, (xNtrain, yNtrain)}

{(x1, y1), …, (xNtest, yNtest)}

Logistic Regression

ALGORITHMx ∈ ℝdxW

y = fW(x) ∈ ℝ

x1

x2

σ(x) =1

1 + e−x

• Output is a real number (one class)

• Ideally we want

• Training data ,

• Model 1 : linear regression :

• Model 2 : Sigmoid output layer :

• , ?

• Cross entropy loss:

y = fW(x)y = P(y = 1 | x)

{(x1, y1), …, (xNtrain, yNtrain)} yn ∈ {0,1}y = f w(x) = wTx + b

y = f w(x) = σ(wTx + b)W⋆ = argmin

WL(y, y) L (y, y) =

12 (y − y)2

L (y, y) = − (y log(y) + (1 − y) log(1 − y))

if

• • Maximum log likelihood ! if

• • Maximum log likelihood !

y = 1L (y, y) = − y log(y) = − log P(y = 1 | x)

y = 0L (y, y) = − log(1 − y) = − log P(y = 0 | x)

Logistic Regression

ALGORITHMx ∈ ℝdxW

y = fW(x) ∈ ℝ

x1

x2

σ(x) =1

1 + e−x

• Output is a real number (one class)

• Ideally we want

• Training data ,

• Model 2 : Sigmoid output layer :

• Cross entropy loss:

• How can we classify an object to more than 2 classes ?

y = fW(x)y = P(y = 1 | x)

{(x1, y1), …, (xNtrain, yNtrain)} yn ∈ {0,1}y = f w(x) = σ(wTx + b)

W⋆ = argminW

L(y, y)

L (y, y) = − (y log(y) + (1 − y) log(1 − y))

Classification

ALGORITHM

INPUT

object, image, etc.

OUTPUT

0

Multi-class

100

Dog

Cat

Mouse

Elephant

Back in CNNs…

25

Classification on Images

Input to the network is an image

CNN

{ 0.02 0.03 0.01 0.01 0.70 0.02 0.02 0.01 0.06 0.12 }

Probability that image is 0 1 2 3 4 5 6 7 8 9

Output of the network is the probability of the input image being one of the digits (belonging to one of the target classes)

Classification Layer

• SoftMax Output layer

• Sum of outputs is equal to 1

• Represent probabilities for the target classes

• Loss function ? Cross entropy loss:

• Measure of dissimilarity between distributions

f(xi) =exp(xi)

∑10j=1 exp(xj)

L( f, f ) = −10

∑i=1

fi log f(xi)

f = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0] ⟺

hl o = softmax(Whl + b

x

)

Automatic Feature DetectionHigh level features

Low level features (edges, circels, mesh, text etc.)

Second layer features

Architectures

• LeNet by Yann LeCun et al. in 1998 • Alex-Net by Alex Krizhevsky, et. al. 2012 • VGG Net by Oxford’s Visual Geometry Group 2014 • GoogLeNet by Christian Szegedy, et. al. 2014 • ResNet (Residual Network) by Kaiming He, et. al. 2015 • DenseNet by Gao Huang, et. al. 2016

LeNet of Yann LeCun et al., 1998

Alex-Net of Alex Krizhevsky et al., 2012

Heuristics for Deep Learning

Data Preprocessing • Scaling (e.g. zero mean, unit variance) • Random cropping • Flipping data • PCA whitening • Noise

Initialization of Weights • Scale the weights of each layers bu the

inverse of the square root of number of

input neurons

• Xavier initialization

1Nl

Activation Functions • tanh

• sigmoid

• ReLU

• ELU

Regularization

DropoutFull-connected

Operating on Sequences• In many applications cases, the data have temporal order (language, time series, etc.) • Fully-connected networks, and CNNs do not take into account this feature and have fixed input

and output sizes • Recurrent Neural Networks: networks with feedback loops

Operating on Sequences

xt+1

ht+1

RNN

xt�1

ht�1

RNN

xt

tanh

ht�1 ht

ht

hT

yT

x3

RNNW h3

y3

x2

RNNW h2h1

Weight Sharing in Time

h0

x1

RNNW

W, b

y1 y2

hT

yT

x3

RNNW h3

y3

x2

RNNW h2h1

Weight Sharing in Time

h0

x1

RNNW

W, b

y1 y2

y1

L1 = | y1 − y2 |2

y2 y3 yT

L2 L3 LT

L =1T

T

∑t=1

Lt

L

Backpropagation Through Time (BPTT)

FORWARD PASS - entire sequence, compute loss

BACKWARD PASS - entire sequence, comppute gradients

L

BACKWARD PASS - on some smaller amount of steps

L

Truncated BPTT

“Carry” hidden state forever

BACKWARD PASS - on some smaller amount of steps

tanh(W ·)

h1

x1

h0

tanh(W ·)

h2

x2

h1

tanh(W ·)

h3

h2

x3

tanh(W ·)

h4

h3

x4

h4

Vanishing Gradients Problem

• Computing the gradient of the loss w.r.t. involves many factors of and repeated

• In case of a linear activation and no bias, you would have factors like

• The gradient vanishes (explodes) if largest singular value ( )

h0 W tanh

W(W(…(Wh0)))< 1 > 1

Gating Architectures

ot

ht�1�

got

ht

ht

ct�1 ct�

gft

git

+

tanh

�tanh

ct

Long Short-Term Memory Cell

ot

ht�1

rt

1 � ·

zt

+ht

ht

tanh

ht�

Gated Recurrent Unit

Gating Architectures

got

gft

git

+

tanh

�tanh

ct

got

gft

git

+

tanh

�tanh

ct

got

gft

git

+

tanh

�tanh

ct

got

gft

git

+

tanh

�tanh

ct

x1 x2 x3 x4

h1 h2 h3 h4

C0 C1 C2 C3 C4

h1 h2 h3h0 h4

Uninterrupted gradient flow !

RNN structure

One-To-One e.g. classification

Many-To-One e.g. sentiment

analysis

One-To-Many e.g. Image captioning,

video generation

Many-To-Many e.g. Machine translation,

time-series prediction

Prediction of Chaotic Dynamics

• Forecasting the state of the Kuramoto-Sivashinsky equation

∂u∂t

= − v∂4u∂x4

−∂2u∂x2

− u∂u∂x

• RNNs can be chaotic !

• They are dynamical systems

Word embeddings

• Words can be represented by numbers (vectors) that encode semantic meaning • E.g. Word2Vec • Input: LARGE CORPUS OF TEXT • Learns a vector space where each word is assigned a vector • How ? Predict a word (target) from its neighboring words (context) or vice versa • Encodes context information

x1

x2

France

ParisGreece

Athens (closest word)

Word embeddings

Applications The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches, 2018

Object detection Object localisation Image/Video segmentation

Autonomous driving

Brain cancer detection

Scin cancer recognition

Speech recognition Machine translation Image/Video captioning Medicine/Biology

Outlook

46

Why Deep Learning ? • Universal approach for learning problems • Robust approach, does not require “much” expert

knowledge • Generalization, Scalability

Challenges ? • Big data and scalability • Generalization, transfer learning, multi-task learning • Generate new “artificial” datasets, for applications where data is scarce (Generative

models) • Understaning/Explainable models, incorporating physics • Causality and not plain pattern recognition/correlations • Energy efficient implementations on mobiles/FPGAs, etc.

Amount of data

Performance

Classical ML

Deep Learning