+ All Categories
Home > Documents > Foundations of Arti cial...

Foundations of Arti cial...

Date post: 26-Jun-2020
Category:
Upload: others
View: 8 times
Download: 0 times
Share this document with a friend
102
Foundations of Artificial Intelligence 14. Deep Learning An Overview Joschka Boedecker and Wolfram Burgard and Bernhard Nebel Guest lecturer: Frank Hutter Albert-Ludwigs-Universit¨ at Freiburg July 14, 2017
Transcript
Page 1: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Foundations of Artificial Intelligence14. Deep Learning

An Overview

Joschka Boedecker and Wolfram Burgard and Bernhard NebelGuest lecturer: Frank Hutter

Albert-Ludwigs-Universitat Freiburg

July 14, 2017

Page 2: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Motivation: Deep Learning in the News

Foundations of AI July 14, 2017 2

Page 3: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Motivation: Why is Deep Learning so Popular?

Excellent empirical results, e.g., in computer vision

Foundations of AI July 14, 2017 3

Page 4: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Motivation: Why is Deep Learning so Popular?

Excellent empirical results, e.g., in speech recognition

Foundations of AI July 14, 2017 4

Page 5: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Motivation: Why is Deep Learning so Popular?

Excellent empirical results, e.g., in reasoning in games

- Superhuman performance in playingAtari games[Mnih et al, Nature 2015]

- Beating the world’s best Go player[Silver et al, Nature 2016]

More reasons for the popularity of deep learning throughout

Foundations of AI July 14, 2017 5

Page 6: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Motivation: Why is Deep Learning so Popular?

Excellent empirical results, e.g., in reasoning in games

- Superhuman performance in playingAtari games[Mnih et al, Nature 2015]

- Beating the world’s best Go player[Silver et al, Nature 2016]

More reasons for the popularity of deep learning throughout

Foundations of AI July 14, 2017 5

Page 7: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Lecture Overview

1 Representation Learning and Deep Learning

2 Multilayer Perceptrons

3 Optimization of Neural Networks in a Nutshell

4 Overview of Some Advanced TopicsConvolutional neural networksRecurrent neural networksDeep reinforcement learning

5 Wrapup

Foundations of AI July 14, 2017 6

Page 8: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Lecture Overview

1 Representation Learning and Deep Learning

2 Multilayer Perceptrons

3 Optimization of Neural Networks in a Nutshell

4 Overview of Some Advanced TopicsConvolutional neural networksRecurrent neural networksDeep reinforcement learning

5 Wrapup

Foundations of AI July 14, 2017 7

Page 9: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Some definitions

Representation learning

“a set of methods that allows a machine to be fed with raw data and toautomatically discover the representations needed for detection orclassification”

Deep learning

“representation learning methods with multiple levels of representation,obtained by composing simple but nonlinear modules that each transformthe representation at one level into a [...] higher, slightly more abstract(one)”

(LeCun et al., 2015)

Foundations of AI July 14, 2017 8

Page 10: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Some definitions

Representation learning

“a set of methods that allows a machine to be fed with raw data and toautomatically discover the representations needed for detection orclassification”

Deep learning

“representation learning methods with multiple levels of representation,obtained by composing simple but nonlinear modules that each transformthe representation at one level into a [...] higher, slightly more abstract(one)”

(LeCun et al., 2015)

Foundations of AI July 14, 2017 8

Page 11: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Standard Machine Learning Pipeline

Standard machine learning algorithms are based on high-level attributesor features of the dataE.g., the binary attributes we used for decisions treesThis requires (often substantial) feature engineering

Foundations of AI July 14, 2017 9

Page 12: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Representation Learning Pipeline

Jointly learn features and classifier, directly from raw dataThis is also referrred to as end-to-end learning

Foundations of AI July 14, 2017 10

Page 13: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Shallow vs. Deep Learning

Foundations of AI July 14, 2017 11

Page 14: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Shallow vs. Deep Learning

Image

Human Cat Dog Classes

Pixels

Edges

Contours

Object Parts

Deep Learning: learning a hierarchy of representations that build oneach other, from simple to complex

Quintessential deep learning model: Multilayer Perceptrons

Foundations of AI July 14, 2017 12

Page 15: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Shallow vs. Deep Learning

Image

Human Cat Dog Classes

Pixels

Edges

Contours

Object Parts

Deep Learning: learning a hierarchy of representations that build oneach other, from simple to complex

Quintessential deep learning model: Multilayer Perceptrons

Foundations of AI July 14, 2017 12

Page 16: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Biological Inspiration of Artificial Neural Networks

Dendrites input information to the cell

Neuron fires (has action potential) if a certain threshold for the voltageis exceeded

Output of information by axon

The axon is connected to dentrites of other cells via synapses

Learning: adaptation of the synapse’s efficiency, its synaptical weight

AXON

dendrites

SYNAPSES

soma

Foundations of AI July 14, 2017 13

Page 17: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

A Very Brief History of Neural Networks

Neural networks have a long history

- 1942: artificial neurons (McCulloch/Pitts)- 1958/1969: perceptron (Rosenblatt; Minsky/Papert)- 1986: multilayer perceptrons and backpropagation (Rumelhart)- 1989: convolutional neural networks (LeCun)

Alternative theoretically motivated methods outperformed NNs

- Exaggeraged expectations: “It works like the brain” (No, it does not!)

Why the sudden success of neural networks in the last 5 years?

- Data: Availability of massive amounts of labelled data- Compute power: Ability to train very large neural networks on GPUs- Methodological advances: many since first renewed popularization

Foundations of AI July 14, 2017 14

Page 18: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

A Very Brief History of Neural Networks

Neural networks have a long history

- 1942: artificial neurons (McCulloch/Pitts)- 1958/1969: perceptron (Rosenblatt; Minsky/Papert)- 1986: multilayer perceptrons and backpropagation (Rumelhart)- 1989: convolutional neural networks (LeCun)

Alternative theoretically motivated methods outperformed NNs

- Exaggeraged expectations: “It works like the brain” (No, it does not!)

Why the sudden success of neural networks in the last 5 years?

- Data: Availability of massive amounts of labelled data- Compute power: Ability to train very large neural networks on GPUs- Methodological advances: many since first renewed popularization

Foundations of AI July 14, 2017 14

Page 19: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

A Very Brief History of Neural Networks

Neural networks have a long history

- 1942: artificial neurons (McCulloch/Pitts)- 1958/1969: perceptron (Rosenblatt; Minsky/Papert)- 1986: multilayer perceptrons and backpropagation (Rumelhart)- 1989: convolutional neural networks (LeCun)

Alternative theoretically motivated methods outperformed NNs

- Exaggeraged expectations: “It works like the brain” (No, it does not!)

Why the sudden success of neural networks in the last 5 years?

- Data: Availability of massive amounts of labelled data- Compute power: Ability to train very large neural networks on GPUs- Methodological advances: many since first renewed popularization

Foundations of AI July 14, 2017 14

Page 20: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Lecture Overview

1 Representation Learning and Deep Learning

2 Multilayer Perceptrons

3 Optimization of Neural Networks in a Nutshell

4 Overview of Some Advanced TopicsConvolutional neural networksRecurrent neural networksDeep reinforcement learning

5 Wrapup

Foundations of AI July 14, 2017 15

Page 21: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Multilayer Perceptrons

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

[figure from Bishop, Ch. 5]

Network is organized in layers

- Outputs of k-th layer serve as inputs of k + 1th layer

Each layer k only does quite simple computations:

- Linear function of previous layer’s outputs zk−1: ak = Wkzk−1 + bk

- Nonlinear transformation zk = hk(ak) through activation function hk

Foundations of AI July 14, 2017 16

Page 22: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Activation Functions - Examples

Logistic sigmoid activation function:

hlogistic(a) =1

1 + exp(−a)

8 6 4 2 0 2 4 6 80.0

0.2

0.4

0.6

0.8

1.0

Logistic hyperbolic tangentactivation function:

htanh(a) = tanh(a)

=exp(a)− exp(−a)exp(a) + exp(−a)

}

8 6 4 2 0 2 4 6 81.0

0.5

0.0

0.5

1.0

Foundations of AI July 14, 2017 17

Page 23: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Activation Functions - Examples (cont.)

Linear activation function:

hlinear(a) = a

8 6 4 2 0 2 4 6 88

6

4

2

0

2

4

6

8

Rectified Linear (ReLU) activationfunction:

hrelu(a) = max(0, a)

8 6 4 2 0 2 4 6 80

1

2

3

4

5

6

7

8

Foundations of AI July 14, 2017 18

Page 24: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Output unit activation functions

Depending on the task, typically:

for regression: single output neuron with linear activation

for binary classification: single output neuron with logistic/tanhactivation

for multiclass classification: K output neurons and softmax activation

(y(x,w))k = hsoftmax((a)k) =exp((a)k)∑j exp((a)j)

→ so for the complete output layer:

y(x,w) =

p(y1 = 1|x)p(y2 = 1|x)

...p(yK = 1|x)

=1∑K

j=1 exp((a)j)exp(a)

Foundations of AI July 14, 2017 19

Page 25: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Output unit activation functions

Depending on the task, typically:

for regression: single output neuron with linear activation

for binary classification: single output neuron with logistic/tanhactivation

for multiclass classification: K output neurons and softmax activation

(y(x,w))k = hsoftmax((a)k) =exp((a)k)∑j exp((a)j)

→ so for the complete output layer:

y(x,w) =

p(y1 = 1|x)p(y2 = 1|x)

...p(yK = 1|x)

=1∑K

j=1 exp((a)j)exp(a)

Foundations of AI July 14, 2017 19

Page 26: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Output unit activation functions

Depending on the task, typically:

for regression: single output neuron with linear activation

for binary classification: single output neuron with logistic/tanhactivation

for multiclass classification: K output neurons and softmax activation

(y(x,w))k = hsoftmax((a)k) =exp((a)k)∑j exp((a)j)

→ so for the complete output layer:

y(x,w) =

p(y1 = 1|x)p(y2 = 1|x)

...p(yK = 1|x)

=1∑K

j=1 exp((a)j)exp(a)

Foundations of AI July 14, 2017 19

Page 27: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Output unit activation functions

Depending on the task, typically:

for regression: single output neuron with linear activation

for binary classification: single output neuron with logistic/tanhactivation

for multiclass classification: K output neurons and softmax activation

(y(x,w))k = hsoftmax((a)k) =exp((a)k)∑j exp((a)j)

→ so for the complete output layer:

y(x,w) =

p(y1 = 1|x)p(y2 = 1|x)

...p(yK = 1|x)

=1∑K

j=1 exp((a)j)exp(a)

Foundations of AI July 14, 2017 19

Page 28: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Loss function to be minimized

Consider binary classification task using a single output unit withlogistic sigmoid activation function:

y(x,w) = hlogistic(a) =1

1 + exp(−a)

This defines a (Bernoulli) probability distribution over the label of eachdata point xn:

p(yn = 1 | xn,w) = y(xn,w)

p(yn = 0 | xn,w) = 1− y(xn,w)

Rewritten:

p(yn | xn,w) = y(xn,w)yn{1− y(xn,w)}1−yn

Min. negative log likelihood of this distribution (aka cross entropy):

L(w) = −N∑

n=1

{yn ln yn + (1− yn) ln(1− yn)}

Foundations of AI July 14, 2017 20

Page 29: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Loss function to be minimized

Consider binary classification task using a single output unit withlogistic sigmoid activation function:

y(x,w) = hlogistic(a) =1

1 + exp(−a)This defines a (Bernoulli) probability distribution over the label of eachdata point xn:

p(yn = 1 | xn,w) = y(xn,w)

p(yn = 0 | xn,w) = 1− y(xn,w)

Rewritten:

p(yn | xn,w) = y(xn,w)yn{1− y(xn,w)}1−yn

Min. negative log likelihood of this distribution (aka cross entropy):

L(w) = −N∑

n=1

{yn ln yn + (1− yn) ln(1− yn)}

Foundations of AI July 14, 2017 20

Page 30: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Loss function to be minimized

Consider binary classification task using a single output unit withlogistic sigmoid activation function:

y(x,w) = hlogistic(a) =1

1 + exp(−a)This defines a (Bernoulli) probability distribution over the label of eachdata point xn:

p(yn = 1 | xn,w) = y(xn,w)

p(yn = 0 | xn,w) = 1− y(xn,w)

Rewritten:

p(yn | xn,w) = y(xn,w)yn{1− y(xn,w)}1−yn

Min. negative log likelihood of this distribution (aka cross entropy):

L(w) = −N∑

n=1

{yn ln yn + (1− yn) ln(1− yn)}

Foundations of AI July 14, 2017 20

Page 31: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Loss function to be minimized

Consider binary classification task using a single output unit withlogistic sigmoid activation function:

y(x,w) = hlogistic(a) =1

1 + exp(−a)This defines a (Bernoulli) probability distribution over the label of eachdata point xn:

p(yn = 1 | xn,w) = y(xn,w)

p(yn = 0 | xn,w) = 1− y(xn,w)

Rewritten:

p(yn | xn,w) = y(xn,w)yn{1− y(xn,w)}1−yn

Min. negative log likelihood of this distribution (aka cross entropy):

L(w) = −N∑

n=1

{yn ln yn + (1− yn) ln(1− yn)}

Foundations of AI July 14, 2017 20

Page 32: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Loss function to be minimized

For multiclass classification, use generalization of cross-entropy error:

L(w) = −N∑

n=1

K∑k=1

ykn ln yk(xn,w)

For regression, e.g., use squared error function:

L(w) =1

2

N∑n=1

{y(xn,w)− yn}2

Foundations of AI July 14, 2017 21

Page 33: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Optimizing a loss / error function

Given training data D = 〈(xi, yi)〉Ni=1 and topology of an MLP

Task: adapt weights w to minimize the loss:

minimizew

L(w;D)

Interpret L just as a mathematical function depending on w and forgetabout its semantics, then we are faced with a problem of mathematicaloptimization

Foundations of AI July 14, 2017 22

Page 34: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Lecture Overview

1 Representation Learning and Deep Learning

2 Multilayer Perceptrons

3 Optimization of Neural Networks in a Nutshell

4 Overview of Some Advanced TopicsConvolutional neural networksRecurrent neural networksDeep reinforcement learning

5 Wrapup

Foundations of AI July 14, 2017 23

Page 35: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Optimization theory

Discusses mathematical problems of the form:

minimizeu

f(u),

where u is any vector of suitable size.

Simplification: here, we only consider functions f which are continuousand differentiable

continuous, non differentiablefunction

non continuous function differentiable function(disrupted) (folded) (smooth)

x

y y y

x x

Foundations of AI July 14, 2017 24

Page 36: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Optimization theory

Discusses mathematical problems of the form:

minimizeu

f(u),

where u is any vector of suitable size.

Simplification: here, we only consider functions f which are continuousand differentiable

continuous, non differentiablefunction

non continuous function differentiable function(disrupted) (folded) (smooth)

x

y y y

x x

Foundations of AI July 14, 2017 24

Page 37: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Optimization theory (cont.)

A global minimum u∗ is a pointsuch that:

f(u∗) ≤ f(u)

for all u.

A local minimum u+ is a pointsuch that exist r > 0 with

f(u+) ≤ f(u)

for all points u with ||u−u+|| < r

y

x

global localminima

Foundations of AI July 14, 2017 25

Page 38: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Optimization theory (cont.)

Analytical way to find a minimum:For a local minimum u+, the gradient of f becomes zero:

∂f

∂ui(u+) = 0 for all i

Hence, calculating all partial derivatives and looking for zeros is a goodidea

But: for neural networks, we can’t write down a solution for theminimization problem in closed form

- even though ∂f∂ui

= 0 holds at (local) solution points→ need to resort to iterative methods

Foundations of AI July 14, 2017 26

Page 39: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Optimization theory (cont.)

Analytical way to find a minimum:For a local minimum u+, the gradient of f becomes zero:

∂f

∂ui(u+) = 0 for all i

Hence, calculating all partial derivatives and looking for zeros is a goodidea

But: for neural networks, we can’t write down a solution for theminimization problem in closed form

- even though ∂f∂ui

= 0 holds at (local) solution points→ need to resort to iterative methods

Foundations of AI July 14, 2017 26

Page 40: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Optimization theory (cont.)

Numerical way to find a minimum,searching:assume we start at point u.

Which is the best direction tosearch for a point v withf(v) < f(u) ?

Which is the best stepwidth?

general principle:

vi ← ui − ε∂f

∂ui

ε > 0 is called learning rate

Optimization theory (cont.)

! numerical way to find a minimum,searching:assume we are starting at a pointu.Which is the best direction tosearch for a point v withf (v) < f (u) ?

u

Dr. Joschka Boedecker Machine Learning Lab, University of Freiburg Multi Layer Perceptrons (18)

slope is negative (descending), go right!

Foundations of AI July 14, 2017 27

Page 41: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Optimization theory (cont.)

Numerical way to find a minimum,searching:assume we start at point u.

Which is the best direction tosearch for a point v withf(v) < f(u) ?

Which is the best stepwidth?

general principle:

vi ← ui − ε∂f

∂ui

ε > 0 is called learning rate

Optimization theory (cont.)

! numerical way to find a minimum,searching:assume we are starting at a pointu.Which is the best direction tosearch for a point v withf (v) < f (u) ?

slope is negative (descending),go right!

u

Dr. Joschka Boedecker Machine Learning Lab, University of Freiburg Multi Layer Perceptrons (18)

slope is negative (descending), go right!

Foundations of AI July 14, 2017 27

Page 42: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Optimization theory (cont.)

Numerical way to find a minimum,searching:assume we start at point u.

Which is the best direction tosearch for a point v withf(v) < f(u) ?

Which is the best stepwidth?

general principle:

vi ← ui − ε∂f

∂ui

ε > 0 is called learning rate

Optimization theory (cont.)

! numerical way to find a minimum,searching:assume we are starting at a pointu.Which is the best direction tosearch for a point v withf (v) < f (u) ?

u

Dr. Joschka Boedecker Machine Learning Lab, University of Freiburg Multi Layer Perceptrons (18)

slope is positive (ascending), go left!

Foundations of AI July 14, 2017 27

Page 43: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Optimization theory (cont.)

Numerical way to find a minimum,searching:assume we start at point u.

Which is the best direction tosearch for a point v withf(v) < f(u) ?

Which is the best stepwidth?

general principle:

vi ← ui − ε∂f

∂ui

ε > 0 is called learning rate

Optimization theory (cont.)

! numerical way to find a minimum,searching:assume we are starting at a pointu.Which is the best direction tosearch for a point v withf (v) < f (u) ?

slope is positive (ascending),go left!

u

Dr. Joschka Boedecker Machine Learning Lab, University of Freiburg Multi Layer Perceptrons (18)

slope is positive (ascending), go left!

Foundations of AI July 14, 2017 27

Page 44: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Optimization theory (cont.)

Numerical way to find a minimum,searching:assume we start at point u.

Which is the best direction tosearch for a point v withf(v) < f(u) ?

Which is the best stepwidth?

general principle:

vi ← ui − ε∂f

∂ui

ε > 0 is called learning rate

Optimization theory (cont.)

! numerical way to find a minimum,searching:assume we are starting at a pointu.Which is the best direction tosearch for a point v withf (v) < f (u) ?

Which is the best stepwidth?

u

Dr. Joschka Boedecker Machine Learning Lab, University of Freiburg Multi Layer Perceptrons (18)

slope is small, small step!

Foundations of AI July 14, 2017 27

Page 45: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Optimization theory (cont.)

Numerical way to find a minimum,searching:assume we start at point u.

Which is the best direction tosearch for a point v withf(v) < f(u) ?

Which is the best stepwidth?

general principle:

vi ← ui − ε∂f

∂ui

ε > 0 is called learning rate

Optimization theory (cont.)

! numerical way to find a minimum,searching:assume we are starting at a pointu.Which is the best direction tosearch for a point v withf (v) < f (u) ?

Which is the best stepwidth?

slope is small, small step!u

Dr. Joschka Boedecker Machine Learning Lab, University of Freiburg Multi Layer Perceptrons (18)

slope is small, small step!

Foundations of AI July 14, 2017 27

Page 46: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Optimization theory (cont.)

Numerical way to find a minimum,searching:assume we start at point u.

Which is the best direction tosearch for a point v withf(v) < f(u) ?

Which is the best stepwidth?

general principle:

vi ← ui − ε∂f

∂ui

ε > 0 is called learning rate

Optimization theory (cont.)

! numerical way to find a minimum,searching:assume we are starting at a pointu.Which is the best direction tosearch for a point v withf (v) < f (u) ?

Which is the best stepwidth?

u

Dr. Joschka Boedecker Machine Learning Lab, University of Freiburg Multi Layer Perceptrons (18)

slope is large, large step!

Foundations of AI July 14, 2017 27

Page 47: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Optimization theory (cont.)

Numerical way to find a minimum,searching:assume we start at point u.

Which is the best direction tosearch for a point v withf(v) < f(u) ?

Which is the best stepwidth?

general principle:

vi ← ui − ε∂f

∂ui

ε > 0 is called learning rate

Optimization theory (cont.)

! numerical way to find a minimum,searching:assume we are starting at a pointu.Which is the best direction tosearch for a point v withf (v) < f (u) ?

Which is the best stepwidth?

slope is large, large step!u

Dr. Joschka Boedecker Machine Learning Lab, University of Freiburg Multi Layer Perceptrons (18)

slope is large, large step!

Foundations of AI July 14, 2017 27

Page 48: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Optimization theory (cont.)

Numerical way to find a minimum,searching:assume we start at point u.

Which is the best direction tosearch for a point v withf(v) < f(u) ?

Which is the best stepwidth?

general principle:

vi ← ui − ε∂f

∂ui

ε > 0 is called learning rate

slope is large, large step!

Foundations of AI July 14, 2017 27

Page 49: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Gradient descent

Gradient descent approach:

Require: mathematical function f , learning rate ε > 0Ensure: returned vector is close to a local minimum of f1: choose an initial point u2: while ||∇uf(u)|| not close to 0 do3: u← u− ε · ∇uf(u)4: end while5: return u

Note: ∇uf := [ ∂f∂u1, . . . , ∂f

∂uK] for K-dimensionsal u

Foundations of AI July 14, 2017 28

Page 50: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Calculating partial derivatives

Our typical loss functions are defined across data points:

L(w) =

N∑n=1

Ln(w) = L(f(xn;w), yn)

We can compute their partial derivatives as a sum over data points:

∂L

∂wj=

N∑n=1

∂Ln

∂wj

The method of backpropagation makes consistent use of the chain ruleof calculus to compute the partial derivatives ∂Ln

∂wjw.r.t. each network

weight wj , re-using previously computed results

- Backpropagation is not covered here, but, e.g., in ML lecture

Foundations of AI July 14, 2017 29

Page 51: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Calculating partial derivatives

Our typical loss functions are defined across data points:

L(w) =

N∑n=1

Ln(w) = L(f(xn;w), yn)

We can compute their partial derivatives as a sum over data points:

∂L

∂wj=

N∑n=1

∂Ln

∂wj

The method of backpropagation makes consistent use of the chain ruleof calculus to compute the partial derivatives ∂Ln

∂wjw.r.t. each network

weight wj , re-using previously computed results

- Backpropagation is not covered here, but, e.g., in ML lecture

Foundations of AI July 14, 2017 29

Page 52: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Calculating partial derivatives

Our typical loss functions are defined across data points:

L(w) =

N∑n=1

Ln(w) = L(f(xn;w), yn)

We can compute their partial derivatives as a sum over data points:

∂L

∂wj=

N∑n=1

∂Ln

∂wj

The method of backpropagation makes consistent use of the chain ruleof calculus to compute the partial derivatives ∂Ln

∂wjw.r.t. each network

weight wj , re-using previously computed results

- Backpropagation is not covered here, but, e.g., in ML lecture

Foundations of AI July 14, 2017 29

Page 53: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Do we need gradients based on the entire data set?

Using the entire set is referred to as batch gradient descent

Gradients get more accurate when based on more data points

- But using more data has diminishing returns w.r.t reduction in error- Usually faster progress by updating more often based on cheaper, less

accurate estimates of the gradient

Common approach in practice: compute gradients over mini-batches

- Mini-batch: small subset of the training data- Today, this is commonly called stochastic gradient descent (SGD)

Foundations of AI July 14, 2017 30

Page 54: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Do we need gradients based on the entire data set?

Using the entire set is referred to as batch gradient descent

Gradients get more accurate when based on more data points

- But using more data has diminishing returns w.r.t reduction in error- Usually faster progress by updating more often based on cheaper, less

accurate estimates of the gradient

Common approach in practice: compute gradients over mini-batches

- Mini-batch: small subset of the training data- Today, this is commonly called stochastic gradient descent (SGD)

Foundations of AI July 14, 2017 30

Page 55: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Do we need gradients based on the entire data set?

Using the entire set is referred to as batch gradient descent

Gradients get more accurate when based on more data points

- But using more data has diminishing returns w.r.t reduction in error- Usually faster progress by updating more often based on cheaper, less

accurate estimates of the gradient

Common approach in practice: compute gradients over mini-batches

- Mini-batch: small subset of the training data- Today, this is commonly called stochastic gradient descent (SGD)

Foundations of AI July 14, 2017 30

Page 56: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Stochastic gradient descent

Stochastic gradient descent (SGD)

Require: mathematical function f , learning rate ε > 0Ensure: returned vector is close to a local minimum of f1: choose an initial point w2: while stopping criterion not met do3: Sample a minibatch of m examples x(1), . . . ,x(m) with

corresponding targets y(i) from the training set4: Compute gradient g← 1

m∇w∑m

i=1 L(f(x(i);w),y(i))

5: Update parameter: w← w − ε · g6: end while7: return w

Foundations of AI July 14, 2017 31

Page 57: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Problems with suboptimal choices for learning rate

choice of ε

1. case small ε: convergence

Gradient descent (cont.)

! choice of ϵ

1. case small ϵ: convergence

Dr. Joschka Boedecker Machine Learning Lab, University of Freiburg Multi Layer Perceptrons (20)

Foundations of AI July 14, 2017 32

Page 58: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Problems with suboptimal choices for learning rate

choice of ε

2. case very small ε: convergence,but it may take very long

Gradient descent (cont.)

! choice of ϵ

2. case very small ϵ: convergence,but it may take very long

Dr. Joschka Boedecker Machine Learning Lab, University of Freiburg Multi Layer Perceptrons (20)

Foundations of AI July 14, 2017 33

Page 59: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Problems with suboptimal choices for learning rate

choice of ε

3. case medium size ε:convergence

Gradient descent (cont.)

! choice of ϵ

3. case medium size ϵ:convergence

Dr. Joschka Boedecker Machine Learning Lab, University of Freiburg Multi Layer Perceptrons (20)

Foundations of AI July 14, 2017 34

Page 60: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Problems with suboptimal choices for learning rate

choice of ε

4. case large ε: divergence

Gradient descent (cont.)

! choice of ϵ

4. case large ϵ: divergence

Dr. Joschka Boedecker Machine Learning Lab, University of Freiburg Multi Layer Perceptrons (20)

Foundations of AI July 14, 2017 35

Page 61: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Other reasons for problems with gradient descent

flat spots and steep valleys:need larger ε in u to jump over theuninteresting flat area but needsmaller ε in v to meet theminimum

Gradient descent (cont.)

! some more problems withgradient descent:

! flat spots and steep valleys:need larger ϵ in u to jump overthe uninteresting flat area butneed smaller ϵ in v to meet theminimum

u v

Dr. Joschka Boedecker Machine Learning Lab, University of Freiburg Multi Layer Perceptrons (21)

zig-zagging:in higher dimensions: ε is notappropriate for all dimensions

Foundations of AI July 14, 2017 36

Page 62: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Other reasons for problems with gradient descent

flat spots and steep valleys:need larger ε in u to jump over theuninteresting flat area but needsmaller ε in v to meet theminimum

Gradient descent (cont.)

! some more problems withgradient descent:

! flat spots and steep valleys:need larger ϵ in u to jump overthe uninteresting flat area butneed smaller ϵ in v to meet theminimum

u v

Dr. Joschka Boedecker Machine Learning Lab, University of Freiburg Multi Layer Perceptrons (21)

zig-zagging:in higher dimensions: ε is notappropriate for all dimensions

Foundations of AI July 14, 2017 36

Page 63: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Learning rate quizz

Which curve denotes low, high, very high, and good learning rate?

epoch

lossa)

b)

c)

d)

Foundations of AI July 14, 2017 37

Page 64: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Gradient descent – Conclusion

Pure gradient descent is a nice framework

In practice, stochastic gradient descent is used

Finding the right learning rate ε is tedious

Heuristics to overcome problems of gradient descent:

Gradient descent with momentum

Individual learning rates for each dimension

Adaptive learning rates

Decoupling steplength from partial derivates

Foundations of AI July 14, 2017 38

Page 65: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Gradient descent – Conclusion

Pure gradient descent is a nice framework

In practice, stochastic gradient descent is used

Finding the right learning rate ε is tedious

Heuristics to overcome problems of gradient descent:

Gradient descent with momentum

Individual learning rates for each dimension

Adaptive learning rates

Decoupling steplength from partial derivates

Foundations of AI July 14, 2017 38

Page 66: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Lecture Overview

1 Representation Learning and Deep Learning

2 Multilayer Perceptrons

3 Optimization of Neural Networks in a Nutshell

4 Overview of Some Advanced TopicsConvolutional neural networksRecurrent neural networksDeep reinforcement learning

5 Wrapup

Foundations of AI July 14, 2017 39

Page 67: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Lecture Overview

1 Representation Learning and Deep Learning

2 Multilayer Perceptrons

3 Optimization of Neural Networks in a Nutshell

4 Overview of Some Advanced TopicsConvolutional neural networksRecurrent neural networksDeep reinforcement learning

5 Wrapup

Foundations of AI July 14, 2017 40

Page 68: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Historical context and inspiration from Neuroscience

Hubel & Wiesel (Nobel prize 1981) found in several studies in the 1950sand 1960s:

Visual cortex has feature detectors(e.g., cells with preference foredges with specific orientation)

- edge location did not matter

Simple cells as local featuredetectors

Complex cells pool responses ofsimple cells

There is a feature hierarchy

Foundations of AI July 14, 2017 41

Page 69: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Learned feature hierarchy

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201619

Preview [From recent Yann LeCun slides]

[slide credit: Andrej Karpathy]

Foundations of AI July 14, 2017 42

Page 70: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Convolutions illustrated

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201612

32

32

3

Convolution Layer

5x5x3 filter

32x32x3 image

Convolve the filter with the imagei.e. “slide over the image spatially, computing dot products”

Filters always extend the full depth of the input volume

[slide credit: Andrej Karpathy]

Foundations of AI July 14, 2017 43

Page 71: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Convolutions illustrated (cont.)

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201613

32

32

3

Convolution Layer32x32x3 image5x5x3 filter

1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image(i.e. 5*5*3 = 75-dimensional dot product + bias)

[slide credit: Andrej Karpathy]

Foundations of AI July 14, 2017 44

Page 72: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Convolutions illustrated (cont.)

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201614

32

32

3

Convolution Layer32x32x3 image5x5x3 filter

convolve (slide) over all spatial locations

activation map

1

28

28

[slide credit: Andrej Karpathy]

Foundations of AI July 14, 2017 45

Page 73: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Convolutions – several filters

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201615

32

32

3

Convolution Layer32x32x3 image5x5x3 filter

convolve (slide) over all spatial locations

activation maps

1

28

28

consider a second, green filter

[slide credit: Andrej Karpathy]

Foundations of AI July 14, 2017 46

Page 74: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Convolutions – several filters

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201616

32

32

3

Convolution Layer

activation maps

6

28

28

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:

We stack these up to get a “new image” of size 28x28x6!

[slide credit: Andrej Karpathy]

Foundations of AI July 14, 2017 47

Page 75: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Stacking several convolutional layers

Convolutional layers stacked in a ConvNet

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201618

Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions

32

32

3

CONV,ReLUe.g. 6 5x5x3 filters 28

28

6

CONV,ReLUe.g. 10 5x5x6 filters

CONV,ReLU

….

10

24

24

[slide credit: Andrej Karpathy]

Foundations of AI July 14, 2017 48

Page 76: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Learned feature hierarchy

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin JohnsonFei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 201619

Preview [From recent Yann LeCun slides]

[slide credit: Andrej Karpathy]

Foundations of AI July 14, 2017 49

Page 77: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Lecture Overview

1 Representation Learning and Deep Learning

2 Multilayer Perceptrons

3 Optimization of Neural Networks in a Nutshell

4 Overview of Some Advanced TopicsConvolutional neural networksRecurrent neural networksDeep reinforcement learning

5 Wrapup

Foundations of AI July 14, 2017 50

Page 78: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Feedforward vs Recurrent Neural NetworksRecurrent vs Feedforward networks

1. Recurrent neural networks

1.1 First impression There are two major types of neural networks, feedforward and recurrent. In feedforward networks, activation is "piped" through the network from input units to output units (from left to right in left drawing in Fig. 1.1):

......

Figure 1.1: Typical structure of a feedforward network (left) and a recurrent network (right). Short characterization of feedforward networks:

!" typically, activation is fed forward from input to output through "hidden layers" ("Multi-Layer Perceptrons" MLP), though many other architectures exist

!" mathematically, they implement static input-output mappings (functions) !" basic theoretical result: MLPs can approximate arbitrary (term needs some

qualification) nonlinear maps with arbitrary precision ("universal approximation property")

!" most popular supervised training algorithm: backpropagation algorithm !" huge literature, 95 % of neural network publications concern feedforward nets

(my estimate) !" have proven useful in many practical applications as approximators of

nonlinear functions and as pattern classificators !" are not the topic considered in this tutorial

By contrast, a recurrent neural network (RNN) has (at least one) cyclic path of synaptic connections. Basic characteristics:

!" all biological neural networks are recurrent !" mathematically, RNNs implement dynamical systems !" basic theoretical result: RNNs can approximate arbitrary (term needs some

qualification) dynamical systems with arbitrary precision ("universal approximation property")

!" several types of training algorithms are known, no clear winner !" theoretical and practical difficulties by and large have prevented practical

applications so far

3

1. Recurrent neural networks

1.1 First impression There are two major types of neural networks, feedforward and recurrent. In feedforward networks, activation is "piped" through the network from input units to output units (from left to right in left drawing in Fig. 1.1):

......

Figure 1.1: Typical structure of a feedforward network (left) and a recurrent network (right). Short characterization of feedforward networks:

!" typically, activation is fed forward from input to output through "hidden layers" ("Multi-Layer Perceptrons" MLP), though many other architectures exist

!" mathematically, they implement static input-output mappings (functions) !" basic theoretical result: MLPs can approximate arbitrary (term needs some

qualification) nonlinear maps with arbitrary precision ("universal approximation property")

!" most popular supervised training algorithm: backpropagation algorithm !" huge literature, 95 % of neural network publications concern feedforward nets

(my estimate) !" have proven useful in many practical applications as approximators of

nonlinear functions and as pattern classificators !" are not the topic considered in this tutorial

By contrast, a recurrent neural network (RNN) has (at least one) cyclic path of synaptic connections. Basic characteristics:

!" all biological neural networks are recurrent !" mathematically, RNNs implement dynamical systems !" basic theoretical result: RNNs can approximate arbitrary (term needs some

qualification) dynamical systems with arbitrary precision ("universal approximation property")

!" several types of training algorithms are known, no clear winner !" theoretical and practical difficulties by and large have prevented practical

applications so far

3

Source:Jaeger, 2001

[Source: Jaeger, 2001]

Foundations of AI July 14, 2017 51

Page 79: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Recurrent Neural Networks (RNNs)

Neural Networks that allow for cycles in the connectivity graph

Cycles let information persist in the network for some time (state), andprovide a time-context or (fading) memory

Very powerful for processing sequences

Implement dynamical systems rather than function mappings, and canapproximate any dynamical system with arbitrary precision

They are Turing-complete [Siegelmann and Sontag, 1991]

Foundations of AI July 14, 2017 52

Page 80: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Recurrent Neural Networks (RNNs)

Neural Networks that allow for cycles in the connectivity graph

Cycles let information persist in the network for some time (state), andprovide a time-context or (fading) memory

Very powerful for processing sequences

Implement dynamical systems rather than function mappings, and canapproximate any dynamical system with arbitrary precision

They are Turing-complete [Siegelmann and Sontag, 1991]

Foundations of AI July 14, 2017 52

Page 81: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Recurrent Neural Networks (RNNs)

Neural Networks that allow for cycles in the connectivity graph

Cycles let information persist in the network for some time (state), andprovide a time-context or (fading) memory

Very powerful for processing sequences

Implement dynamical systems rather than function mappings, and canapproximate any dynamical system with arbitrary precision

They are Turing-complete [Siegelmann and Sontag, 1991]

Foundations of AI July 14, 2017 52

Page 82: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Recurrent Neural Networks (RNNs)

Neural Networks that allow for cycles in the connectivity graph

Cycles let information persist in the network for some time (state), andprovide a time-context or (fading) memory

Very powerful for processing sequences

Implement dynamical systems rather than function mappings, and canapproximate any dynamical system with arbitrary precision

They are Turing-complete [Siegelmann and Sontag, 1991]

Foundations of AI July 14, 2017 52

Page 83: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Recurrent Neural Networks (RNNs)

Neural Networks that allow for cycles in the connectivity graph

Cycles let information persist in the network for some time (state), andprovide a time-context or (fading) memory

Very powerful for processing sequences

Implement dynamical systems rather than function mappings, and canapproximate any dynamical system with arbitrary precision

They are Turing-complete [Siegelmann and Sontag, 1991]

Foundations of AI July 14, 2017 52

Page 84: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Abstract schematic

With fully connected hidden layer:

Foundations of AI July 14, 2017 53

Page 85: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Sequence to sequence mappingSequence-to-sequence Mapping

one to many many to one

image captiongeneration

temporalclassification

Foundations of AI July 14, 2017 54

Page 86: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Sequence to sequence mapping (cont.)Sequence-to-sequence Mapping

many to many many to many

videoframe labeling

automatictranslation

Foundations of AI July 14, 2017 55

Page 87: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Lecture Overview

1 Representation Learning and Deep Learning

2 Multilayer Perceptrons

3 Optimization of Neural Networks in a Nutshell

4 Overview of Some Advanced TopicsConvolutional neural networksRecurrent neural networksDeep reinforcement learning

5 Wrapup

Foundations of AI July 14, 2017 56

Page 88: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Reinforcement Learning

Finding optimal policies for MDPs

Reminder: states s ∈ S, actions a ∈ A, transition model T , rewards r

Policy: complete mapping π : S → A that specifies for each state swhich action π(s) to take

Foundations of AI July 14, 2017 57

Page 89: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Deep Reinforcement Learning

Policy-based deep RL- Represent policy π : S → A as a deep neural network with weights w- Evaluate w by “rolling out” the policy defined by w- Optimize weights to obtain higher rewards (using approx. gradients)- Examples: AlphaGo & modern Atari agents

Value-based deep RL- Basically value iteration, but using a deep neural network (= function

approximator) to generalize across many states and actions- Approximate optimal state-value function U(s)

or state-action value function Q(s, a)

Model-based deep RL- If transition model T is not known- Approximate T with a deep neural network (learned from data)- Plan using this approximate transition model

→ Use deep neural networks to represent policy / value function / model

Foundations of AI July 14, 2017 58

Page 90: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Deep Reinforcement Learning

Policy-based deep RL- Represent policy π : S → A as a deep neural network with weights w- Evaluate w by “rolling out” the policy defined by w- Optimize weights to obtain higher rewards (using approx. gradients)- Examples: AlphaGo & modern Atari agents

Value-based deep RL- Basically value iteration, but using a deep neural network (= function

approximator) to generalize across many states and actions- Approximate optimal state-value function U(s)

or state-action value function Q(s, a)

Model-based deep RL- If transition model T is not known- Approximate T with a deep neural network (learned from data)- Plan using this approximate transition model

→ Use deep neural networks to represent policy / value function / model

Foundations of AI July 14, 2017 58

Page 91: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Deep Reinforcement Learning

Policy-based deep RL- Represent policy π : S → A as a deep neural network with weights w- Evaluate w by “rolling out” the policy defined by w- Optimize weights to obtain higher rewards (using approx. gradients)- Examples: AlphaGo & modern Atari agents

Value-based deep RL- Basically value iteration, but using a deep neural network (= function

approximator) to generalize across many states and actions- Approximate optimal state-value function U(s)

or state-action value function Q(s, a)

Model-based deep RL- If transition model T is not known- Approximate T with a deep neural network (learned from data)- Plan using this approximate transition model

→ Use deep neural networks to represent policy / value function / model

Foundations of AI July 14, 2017 58

Page 92: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Deep Reinforcement Learning

Policy-based deep RL- Represent policy π : S → A as a deep neural network with weights w- Evaluate w by “rolling out” the policy defined by w- Optimize weights to obtain higher rewards (using approx. gradients)- Examples: AlphaGo & modern Atari agents

Value-based deep RL- Basically value iteration, but using a deep neural network (= function

approximator) to generalize across many states and actions- Approximate optimal state-value function U(s)

or state-action value function Q(s, a)

Model-based deep RL- If transition model T is not known- Approximate T with a deep neural network (learned from data)- Plan using this approximate transition model

→ Use deep neural networks to represent policy / value function / model

Foundations of AI July 14, 2017 58

Page 93: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Lecture Overview

1 Representation Learning and Deep Learning

2 Multilayer Perceptrons

3 Optimization of Neural Networks in a Nutshell

4 Overview of Some Advanced TopicsConvolutional neural networksRecurrent neural networksDeep reinforcement learning

5 Wrapup

Foundations of AI July 14, 2017 59

Page 94: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

An Exciting Approach to AI:Learning as an Alternative to Traditional Programming

We don’t understand how the human brain solves certain problems

- Face recognition- Speech recognition- Playing Atari games- Picking the next move in the game of Go

We can nevertheless learn these tasks from data/experience

If the task changes, we simply re-train

We can construct computer systems that are too complex for us tounderstand anymore ourselves. . .

- E.g., deep neural networks have millions of weights.- E.g., AlphaGo, the system that beat world champion Lee Sedol

+ David Silver, lead author of AlphaGo cannot say why a move is good+ Paraphrased: “You would have to ask a Go expert.”

Foundations of AI July 14, 2017 60

Page 95: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

An Exciting Approach to AI:Learning as an Alternative to Traditional Programming

We don’t understand how the human brain solves certain problems

- Face recognition- Speech recognition- Playing Atari games- Picking the next move in the game of Go

We can nevertheless learn these tasks from data/experience

If the task changes, we simply re-train

We can construct computer systems that are too complex for us tounderstand anymore ourselves. . .

- E.g., deep neural networks have millions of weights.- E.g., AlphaGo, the system that beat world champion Lee Sedol

+ David Silver, lead author of AlphaGo cannot say why a move is good+ Paraphrased: “You would have to ask a Go expert.”

Foundations of AI July 14, 2017 60

Page 96: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

An Exciting Approach to AI:Learning as an Alternative to Traditional Programming

We don’t understand how the human brain solves certain problems

- Face recognition- Speech recognition- Playing Atari games- Picking the next move in the game of Go

We can nevertheless learn these tasks from data/experience

If the task changes, we simply re-train

We can construct computer systems that are too complex for us tounderstand anymore ourselves. . .

- E.g., deep neural networks have millions of weights.- E.g., AlphaGo, the system that beat world champion Lee Sedol

+ David Silver, lead author of AlphaGo cannot say why a move is good+ Paraphrased: “You would have to ask a Go expert.”

Foundations of AI July 14, 2017 60

Page 97: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Summary: Why is Deep Learning so Popular?

Excellent empirical results in many domains

- very scalable to big data- but beware: not a silver bullet

Analogy to the ways humans process information

- mostly tangential

Allows end-to-end learning

- no more need for many complicated subsystems- e.g., dramatically simplified Google’s translation

Very versatile/flexible

- easy to combine building blocks- allows supervised, unsupervised, and reinforcement learning

Foundations of AI July 14, 2017 61

Page 98: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Summary: Why is Deep Learning so Popular?

Excellent empirical results in many domains

- very scalable to big data- but beware: not a silver bullet

Analogy to the ways humans process information

- mostly tangential

Allows end-to-end learning

- no more need for many complicated subsystems- e.g., dramatically simplified Google’s translation

Very versatile/flexible

- easy to combine building blocks- allows supervised, unsupervised, and reinforcement learning

Foundations of AI July 14, 2017 61

Page 99: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Summary: Why is Deep Learning so Popular?

Excellent empirical results in many domains

- very scalable to big data- but beware: not a silver bullet

Analogy to the ways humans process information

- mostly tangential

Allows end-to-end learning

- no more need for many complicated subsystems- e.g., dramatically simplified Google’s translation

Very versatile/flexible

- easy to combine building blocks- allows supervised, unsupervised, and reinforcement learning

Foundations of AI July 14, 2017 61

Page 100: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Summary: Why is Deep Learning so Popular?

Excellent empirical results in many domains

- very scalable to big data- but beware: not a silver bullet

Analogy to the ways humans process information

- mostly tangential

Allows end-to-end learning

- no more need for many complicated subsystems- e.g., dramatically simplified Google’s translation

Very versatile/flexible

- easy to combine building blocks- allows supervised, unsupervised, and reinforcement learning

Foundations of AI July 14, 2017 61

Page 101: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Lots of Work on Deep Learning in Freiburg

Computer Vision (Thomas Brox)

- Images, video

Robotics (Wolfram Burgard)

- Navigation, grasping, object recognition

Neurorobotics (Joschka Boedecker)

- Robotic control

Machine Learning (Frank Hutter)

- Optimization of deep nets, learning to learn

Neuroscience (Tonio Ball, Michael Tangermann, and others )

- EEG data and other applications from BrainLinks-BrainTools

→ Details when the individual groups present their research

Foundations of AI July 14, 2017 62

Page 102: Foundations of Arti cial Intelligenceais.informatik.uni-freiburg.de/.../ai14_deep_learning.pdfFoundations of Arti cial Intelligence 14. Deep Learning An Overview Joschka Boedecker

Summary by learning goals

Having heard this lecture, you can now . . .

Explain the terms representation learning and deep learning

Describe the main principles behind MLPs

Describe how neural networks are optimized in practice

On a high level, describe

- Convolutional Neural Networks- Recurrent Neural Networks- Deep Reinforcement Learning

Foundations of AI July 14, 2017 63


Recommended