+ All Categories
Home > Documents > Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine...

Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine...

Date post: 15-Oct-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
100
Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 1 / 30
Transcript
Page 1: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Introduction to Machine Learning (67577)

Shai Shalev-Shwartz

School of CS and Engineering,The Hebrew University of Jerusalem

Deep Learning

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 1 / 30

Page 2: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Outline

1 Gradient-Based Learning

2 Computation Graph and Backpropagation

3 Expressiveness and Sample Complexity

4 Computational Complexity

5 Deep Learning — Examples

6 Convolutional Networks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 2 / 30

Page 3: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Gradient-Based Learning

Consider a hypothesis class which is parameterized by a vector θ ∈ Rd

Loss function of hθ on example (x, y) is denoted `(θ; (x, y))

The true and empirical risks are

LD(θ) = E(x,y)∼D

[`(θ; (x, y))] , LS(θ) =1

m

m∑i=1

`(θ; (xi, yi))

Assumption: ` is differentiable w.r.t. θ and we can calculate∇`(θ; (x, y)) efficiently

Minimize LD or LS with Stochastic Gradient Descent (SGD):Start with θ(0) and update θ(t+1) = θ(t) − ηt∇`(θ(t); (x, y))SGD converges for convex problems. It may work for non-convexproblems if we initialize “close enough” to a “good minimum”

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 3 / 30

Page 4: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Gradient-Based Learning

Consider a hypothesis class which is parameterized by a vector θ ∈ Rd

Loss function of hθ on example (x, y) is denoted `(θ; (x, y))

The true and empirical risks are

LD(θ) = E(x,y)∼D

[`(θ; (x, y))] , LS(θ) =1

m

m∑i=1

`(θ; (xi, yi))

Assumption: ` is differentiable w.r.t. θ and we can calculate∇`(θ; (x, y)) efficiently

Minimize LD or LS with Stochastic Gradient Descent (SGD):Start with θ(0) and update θ(t+1) = θ(t) − ηt∇`(θ(t); (x, y))SGD converges for convex problems. It may work for non-convexproblems if we initialize “close enough” to a “good minimum”

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 3 / 30

Page 5: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Gradient-Based Learning

Consider a hypothesis class which is parameterized by a vector θ ∈ Rd

Loss function of hθ on example (x, y) is denoted `(θ; (x, y))

The true and empirical risks are

LD(θ) = E(x,y)∼D

[`(θ; (x, y))] , LS(θ) =1

m

m∑i=1

`(θ; (xi, yi))

Assumption: ` is differentiable w.r.t. θ and we can calculate∇`(θ; (x, y)) efficiently

Minimize LD or LS with Stochastic Gradient Descent (SGD):Start with θ(0) and update θ(t+1) = θ(t) − ηt∇`(θ(t); (x, y))SGD converges for convex problems. It may work for non-convexproblems if we initialize “close enough” to a “good minimum”

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 3 / 30

Page 6: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Gradient-Based Learning

Consider a hypothesis class which is parameterized by a vector θ ∈ Rd

Loss function of hθ on example (x, y) is denoted `(θ; (x, y))

The true and empirical risks are

LD(θ) = E(x,y)∼D

[`(θ; (x, y))] , LS(θ) =1

m

m∑i=1

`(θ; (xi, yi))

Assumption: ` is differentiable w.r.t. θ and we can calculate∇`(θ; (x, y)) efficiently

Minimize LD or LS with Stochastic Gradient Descent (SGD):Start with θ(0) and update θ(t+1) = θ(t) − ηt∇`(θ(t); (x, y))SGD converges for convex problems. It may work for non-convexproblems if we initialize “close enough” to a “good minimum”

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 3 / 30

Page 7: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Gradient-Based Learning

Consider a hypothesis class which is parameterized by a vector θ ∈ Rd

Loss function of hθ on example (x, y) is denoted `(θ; (x, y))

The true and empirical risks are

LD(θ) = E(x,y)∼D

[`(θ; (x, y))] , LS(θ) =1

m

m∑i=1

`(θ; (xi, yi))

Assumption: ` is differentiable w.r.t. θ and we can calculate∇`(θ; (x, y)) efficiently

Minimize LD or LS with Stochastic Gradient Descent (SGD):Start with θ(0) and update θ(t+1) = θ(t) − ηt∇`(θ(t); (x, y))

SGD converges for convex problems. It may work for non-convexproblems if we initialize “close enough” to a “good minimum”

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 3 / 30

Page 8: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Gradient-Based Learning

Consider a hypothesis class which is parameterized by a vector θ ∈ Rd

Loss function of hθ on example (x, y) is denoted `(θ; (x, y))

The true and empirical risks are

LD(θ) = E(x,y)∼D

[`(θ; (x, y))] , LS(θ) =1

m

m∑i=1

`(θ; (xi, yi))

Assumption: ` is differentiable w.r.t. θ and we can calculate∇`(θ; (x, y)) efficiently

Minimize LD or LS with Stochastic Gradient Descent (SGD):Start with θ(0) and update θ(t+1) = θ(t) − ηt∇`(θ(t); (x, y))SGD converges for convex problems. It may work for non-convexproblems if we initialize “close enough” to a “good minimum”

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 3 / 30

Page 9: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Outline

1 Gradient-Based Learning

2 Computation Graph and Backpropagation

3 Expressiveness and Sample Complexity

4 Computational Complexity

5 Deep Learning — Examples

6 Convolutional Networks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 4 / 30

Page 10: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Computation Graph

A computation graph for a one dimensional Least Squares(numbering of nodes corresponds to topological sort):

Variable layer: w

2

Linear layer: p = wx

3

Input layer: x

0

Subtract layer: r = p− y4

Input layer: y

1

Squared layer: s = r25

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 5 / 30

Page 11: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Gradient Calculation using the Chain Rule

Fix x, y and write ` as a function of w by

`(w) = s(ry(px(w))) = (s ◦ ry ◦ px)(w) .

Chain rule:

`′(w) = (s ◦ ry ◦ px)′(w)= s′(ry(px(w))) · (ry ◦ px)′(w)= s′(ry(px(w))) · r′y(px(w)) · p′x(w)

Backpropagation: Calculate by a Forward-Backward pass over thegraph

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 6 / 30

Page 12: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Gradient Calculation using the Chain Rule

Fix x, y and write ` as a function of w by

`(w) = s(ry(px(w))) = (s ◦ ry ◦ px)(w) .

Chain rule:

`′(w) = (s ◦ ry ◦ px)′(w)= s′(ry(px(w))) · (ry ◦ px)′(w)= s′(ry(px(w))) · r′y(px(w)) · p′x(w)

Backpropagation: Calculate by a Forward-Backward pass over thegraph

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 6 / 30

Page 13: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Gradient Calculation using the Chain Rule

Fix x, y and write ` as a function of w by

`(w) = s(ry(px(w))) = (s ◦ ry ◦ px)(w) .

Chain rule:

`′(w) = (s ◦ ry ◦ px)′(w)= s′(ry(px(w))) · (ry ◦ px)′(w)= s′(ry(px(w))) · r′y(px(w)) · p′x(w)

Backpropagation: Calculate by a Forward-Backward pass over thegraph

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 6 / 30

Page 14: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Computation Graph — Forward

For t = 0, 1, . . . , T − 1

Layer[t]->output = Layer[t]->function(Layer[t]->inputs)

Variable layer: w

2

Linear layer: p = wx

3

Input layer: x

0

Subtract layer: r = p− y4

Input layer: y

1

Squared layer: s = r25

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 7 / 30

Page 15: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Computation Graph — Backward

Recall: `′(w) = s′(ry(px(w))) · r′y(px(w)) · p′x(w)Layer[T-1]->delta = 1

For t = T − 1, T − 2, . . . , 0For i in Layer[t]->inputs:

i->delta = Layer[t]->delta *

Layer[t]->derivative(i,Layer[t]->inputs)

Variable layer: w

2

Linear layer: p = wx

3

Input layer: x

0

Subtract layer: r = p− y4

Input layer: y

1

Squared layer: s = r25

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 8 / 30

Page 16: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Layers

Nodes in the computation graph are often called layers

Each layer is a simple differentiable function

Layers can implement multivariate functions

Example of popular layers:

Affine layer: O =WX + b 1> where W ∈ Rm,n, x ∈ Rn,c, b ∈ Rm

Unary layer: ∀i, oi = f(xi) for some f : R→ R e.g.

Sigmoid: f(x) = (1 + exp(−x))−1

Rectified Linear Unit (ReLU): f(x) = max{0, x} (discuss: derivative?)

Binary layer: ∀i, oi = f(xi, yi) for some f : R2 → R e.g.

Add layer: f(x, y) = x+ yHinge loss: f(x, y) = [1− yixi]+Logistic loss: f(x, y) = log(1 + exp(−yixi))

Main message

Computation graph enables us to construct very complicated functionsfrom simple building blocks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 9 / 30

Page 17: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Layers

Nodes in the computation graph are often called layers

Each layer is a simple differentiable function

Layers can implement multivariate functions

Example of popular layers:

Affine layer: O =WX + b 1> where W ∈ Rm,n, x ∈ Rn,c, b ∈ Rm

Unary layer: ∀i, oi = f(xi) for some f : R→ R e.g.

Sigmoid: f(x) = (1 + exp(−x))−1

Rectified Linear Unit (ReLU): f(x) = max{0, x} (discuss: derivative?)

Binary layer: ∀i, oi = f(xi, yi) for some f : R2 → R e.g.

Add layer: f(x, y) = x+ yHinge loss: f(x, y) = [1− yixi]+Logistic loss: f(x, y) = log(1 + exp(−yixi))

Main message

Computation graph enables us to construct very complicated functionsfrom simple building blocks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 9 / 30

Page 18: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Layers

Nodes in the computation graph are often called layers

Each layer is a simple differentiable function

Layers can implement multivariate functions

Example of popular layers:

Affine layer: O =WX + b 1> where W ∈ Rm,n, x ∈ Rn,c, b ∈ Rm

Unary layer: ∀i, oi = f(xi) for some f : R→ R e.g.

Sigmoid: f(x) = (1 + exp(−x))−1

Rectified Linear Unit (ReLU): f(x) = max{0, x} (discuss: derivative?)

Binary layer: ∀i, oi = f(xi, yi) for some f : R2 → R e.g.

Add layer: f(x, y) = x+ yHinge loss: f(x, y) = [1− yixi]+Logistic loss: f(x, y) = log(1 + exp(−yixi))

Main message

Computation graph enables us to construct very complicated functionsfrom simple building blocks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 9 / 30

Page 19: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Layers

Nodes in the computation graph are often called layers

Each layer is a simple differentiable function

Layers can implement multivariate functions

Example of popular layers:

Affine layer: O =WX + b 1> where W ∈ Rm,n, x ∈ Rn,c, b ∈ Rm

Unary layer: ∀i, oi = f(xi) for some f : R→ R e.g.

Sigmoid: f(x) = (1 + exp(−x))−1

Rectified Linear Unit (ReLU): f(x) = max{0, x} (discuss: derivative?)

Binary layer: ∀i, oi = f(xi, yi) for some f : R2 → R e.g.

Add layer: f(x, y) = x+ yHinge loss: f(x, y) = [1− yixi]+Logistic loss: f(x, y) = log(1 + exp(−yixi))

Main message

Computation graph enables us to construct very complicated functionsfrom simple building blocks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 9 / 30

Page 20: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Layers

Nodes in the computation graph are often called layers

Each layer is a simple differentiable function

Layers can implement multivariate functions

Example of popular layers:

Affine layer: O =WX + b 1> where W ∈ Rm,n, x ∈ Rn,c, b ∈ Rm

Unary layer: ∀i, oi = f(xi) for some f : R→ R e.g.

Sigmoid: f(x) = (1 + exp(−x))−1

Rectified Linear Unit (ReLU): f(x) = max{0, x} (discuss: derivative?)

Binary layer: ∀i, oi = f(xi, yi) for some f : R2 → R e.g.

Add layer: f(x, y) = x+ yHinge loss: f(x, y) = [1− yixi]+Logistic loss: f(x, y) = log(1 + exp(−yixi))

Main message

Computation graph enables us to construct very complicated functionsfrom simple building blocks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 9 / 30

Page 21: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Layers

Nodes in the computation graph are often called layers

Each layer is a simple differentiable function

Layers can implement multivariate functions

Example of popular layers:

Affine layer: O =WX + b 1> where W ∈ Rm,n, x ∈ Rn,c, b ∈ Rm

Unary layer: ∀i, oi = f(xi) for some f : R→ R e.g.

Sigmoid: f(x) = (1 + exp(−x))−1

Rectified Linear Unit (ReLU): f(x) = max{0, x} (discuss: derivative?)

Binary layer: ∀i, oi = f(xi, yi) for some f : R2 → R e.g.

Add layer: f(x, y) = x+ yHinge loss: f(x, y) = [1− yixi]+Logistic loss: f(x, y) = log(1 + exp(−yixi))

Main message

Computation graph enables us to construct very complicated functionsfrom simple building blocks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 9 / 30

Page 22: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Layers

Nodes in the computation graph are often called layers

Each layer is a simple differentiable function

Layers can implement multivariate functions

Example of popular layers:

Affine layer: O =WX + b 1> where W ∈ Rm,n, x ∈ Rn,c, b ∈ Rm

Unary layer: ∀i, oi = f(xi) for some f : R→ R e.g.

Sigmoid: f(x) = (1 + exp(−x))−1

Rectified Linear Unit (ReLU): f(x) = max{0, x} (discuss: derivative?)

Binary layer: ∀i, oi = f(xi, yi) for some f : R2 → R e.g.

Add layer: f(x, y) = x+ yHinge loss: f(x, y) = [1− yixi]+Logistic loss: f(x, y) = log(1 + exp(−yixi))

Main message

Computation graph enables us to construct very complicated functionsfrom simple building blocks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 9 / 30

Page 23: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Layers

Nodes in the computation graph are often called layers

Each layer is a simple differentiable function

Layers can implement multivariate functions

Example of popular layers:

Affine layer: O =WX + b 1> where W ∈ Rm,n, x ∈ Rn,c, b ∈ Rm

Unary layer: ∀i, oi = f(xi) for some f : R→ R e.g.

Sigmoid: f(x) = (1 + exp(−x))−1

Rectified Linear Unit (ReLU): f(x) = max{0, x} (discuss: derivative?)

Binary layer: ∀i, oi = f(xi, yi) for some f : R2 → R e.g.

Add layer: f(x, y) = x+ yHinge loss: f(x, y) = [1− yixi]+Logistic loss: f(x, y) = log(1 + exp(−yixi))

Main message

Computation graph enables us to construct very complicated functionsfrom simple building blocks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 9 / 30

Page 24: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Layers

Nodes in the computation graph are often called layers

Each layer is a simple differentiable function

Layers can implement multivariate functions

Example of popular layers:

Affine layer: O =WX + b 1> where W ∈ Rm,n, x ∈ Rn,c, b ∈ Rm

Unary layer: ∀i, oi = f(xi) for some f : R→ R e.g.

Sigmoid: f(x) = (1 + exp(−x))−1

Rectified Linear Unit (ReLU): f(x) = max{0, x} (discuss: derivative?)

Binary layer: ∀i, oi = f(xi, yi) for some f : R2 → R e.g.

Add layer: f(x, y) = x+ yHinge loss: f(x, y) = [1− yixi]+Logistic loss: f(x, y) = log(1 + exp(−yixi))

Main message

Computation graph enables us to construct very complicated functionsfrom simple building blocks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 9 / 30

Page 25: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Layers

Nodes in the computation graph are often called layers

Each layer is a simple differentiable function

Layers can implement multivariate functions

Example of popular layers:

Affine layer: O =WX + b 1> where W ∈ Rm,n, x ∈ Rn,c, b ∈ Rm

Unary layer: ∀i, oi = f(xi) for some f : R→ R e.g.

Sigmoid: f(x) = (1 + exp(−x))−1

Rectified Linear Unit (ReLU): f(x) = max{0, x} (discuss: derivative?)

Binary layer: ∀i, oi = f(xi, yi) for some f : R2 → R e.g.

Add layer: f(x, y) = x+ y

Hinge loss: f(x, y) = [1− yixi]+Logistic loss: f(x, y) = log(1 + exp(−yixi))

Main message

Computation graph enables us to construct very complicated functionsfrom simple building blocks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 9 / 30

Page 26: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Layers

Nodes in the computation graph are often called layers

Each layer is a simple differentiable function

Layers can implement multivariate functions

Example of popular layers:

Affine layer: O =WX + b 1> where W ∈ Rm,n, x ∈ Rn,c, b ∈ Rm

Unary layer: ∀i, oi = f(xi) for some f : R→ R e.g.

Sigmoid: f(x) = (1 + exp(−x))−1

Rectified Linear Unit (ReLU): f(x) = max{0, x} (discuss: derivative?)

Binary layer: ∀i, oi = f(xi, yi) for some f : R2 → R e.g.

Add layer: f(x, y) = x+ yHinge loss: f(x, y) = [1− yixi]+

Logistic loss: f(x, y) = log(1 + exp(−yixi))

Main message

Computation graph enables us to construct very complicated functionsfrom simple building blocks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 9 / 30

Page 27: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Layers

Nodes in the computation graph are often called layers

Each layer is a simple differentiable function

Layers can implement multivariate functions

Example of popular layers:

Affine layer: O =WX + b 1> where W ∈ Rm,n, x ∈ Rn,c, b ∈ Rm

Unary layer: ∀i, oi = f(xi) for some f : R→ R e.g.

Sigmoid: f(x) = (1 + exp(−x))−1

Rectified Linear Unit (ReLU): f(x) = max{0, x} (discuss: derivative?)

Binary layer: ∀i, oi = f(xi, yi) for some f : R2 → R e.g.

Add layer: f(x, y) = x+ yHinge loss: f(x, y) = [1− yixi]+Logistic loss: f(x, y) = log(1 + exp(−yixi))

Main message

Computation graph enables us to construct very complicated functionsfrom simple building blocks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 9 / 30

Page 28: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Layers

Nodes in the computation graph are often called layers

Each layer is a simple differentiable function

Layers can implement multivariate functions

Example of popular layers:

Affine layer: O =WX + b 1> where W ∈ Rm,n, x ∈ Rn,c, b ∈ Rm

Unary layer: ∀i, oi = f(xi) for some f : R→ R e.g.

Sigmoid: f(x) = (1 + exp(−x))−1

Rectified Linear Unit (ReLU): f(x) = max{0, x} (discuss: derivative?)

Binary layer: ∀i, oi = f(xi, yi) for some f : R2 → R e.g.

Add layer: f(x, y) = x+ yHinge loss: f(x, y) = [1− yixi]+Logistic loss: f(x, y) = log(1 + exp(−yixi))

Main message

Computation graph enables us to construct very complicated functionsfrom simple building blocks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 9 / 30

Page 29: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Layers

Nodes in the computation graph are often called layers

Each layer is a simple differentiable function

Layers can implement multivariate functions

Example of popular layers:

Affine layer: O =WX + b 1> where W ∈ Rm,n, x ∈ Rn,c, b ∈ Rm

Unary layer: ∀i, oi = f(xi) for some f : R→ R e.g.

Sigmoid: f(x) = (1 + exp(−x))−1

Rectified Linear Unit (ReLU): f(x) = max{0, x} (discuss: derivative?)

Binary layer: ∀i, oi = f(xi, yi) for some f : R2 → R e.g.

Add layer: f(x, y) = x+ yHinge loss: f(x, y) = [1− yixi]+Logistic loss: f(x, y) = log(1 + exp(−yixi))

Main message

Computation graph enables us to construct very complicated functionsfrom simple building blocks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 9 / 30

Page 30: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Backpropgation for multivariate layers

Recall the backpropagation rule:For i in Layer[t]->inputs:

i->delta = Layer[t]->delta *

Layer[t]->derivative(i,Layer[t]->inputs)

“delta” is now a vector (same dimension as the output of the layer)

“derivative” is the Jacobian matrix:The Jacobian of f : Rn → Rm at x ∈ Rn, denoted Jx(f), is them× n matrix whose i, j element is the partial derivative offi : Rn → R w.r.t. its j’th variable at x.

The multiplication is matrix multiplication

The correctness of the algorithm follows from the multivariate chainrule

Jw(f ◦ g) = Jg(w)(f)Jw(g)

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 10 / 30

Page 31: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Backpropgation for multivariate layers

Recall the backpropagation rule:For i in Layer[t]->inputs:

i->delta = Layer[t]->delta *

Layer[t]->derivative(i,Layer[t]->inputs)

“delta” is now a vector (same dimension as the output of the layer)

“derivative” is the Jacobian matrix:The Jacobian of f : Rn → Rm at x ∈ Rn, denoted Jx(f), is them× n matrix whose i, j element is the partial derivative offi : Rn → R w.r.t. its j’th variable at x.

The multiplication is matrix multiplication

The correctness of the algorithm follows from the multivariate chainrule

Jw(f ◦ g) = Jg(w)(f)Jw(g)

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 10 / 30

Page 32: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Backpropgation for multivariate layers

Recall the backpropagation rule:For i in Layer[t]->inputs:

i->delta = Layer[t]->delta *

Layer[t]->derivative(i,Layer[t]->inputs)

“delta” is now a vector (same dimension as the output of the layer)

“derivative” is the Jacobian matrix:The Jacobian of f : Rn → Rm at x ∈ Rn, denoted Jx(f), is them× n matrix whose i, j element is the partial derivative offi : Rn → R w.r.t. its j’th variable at x.

The multiplication is matrix multiplication

The correctness of the algorithm follows from the multivariate chainrule

Jw(f ◦ g) = Jg(w)(f)Jw(g)

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 10 / 30

Page 33: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Backpropgation for multivariate layers

Recall the backpropagation rule:For i in Layer[t]->inputs:

i->delta = Layer[t]->delta *

Layer[t]->derivative(i,Layer[t]->inputs)

“delta” is now a vector (same dimension as the output of the layer)

“derivative” is the Jacobian matrix:The Jacobian of f : Rn → Rm at x ∈ Rn, denoted Jx(f), is them× n matrix whose i, j element is the partial derivative offi : Rn → R w.r.t. its j’th variable at x.

The multiplication is matrix multiplication

The correctness of the algorithm follows from the multivariate chainrule

Jw(f ◦ g) = Jg(w)(f)Jw(g)

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 10 / 30

Page 34: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Backpropgation for multivariate layers

Recall the backpropagation rule:For i in Layer[t]->inputs:

i->delta = Layer[t]->delta *

Layer[t]->derivative(i,Layer[t]->inputs)

“delta” is now a vector (same dimension as the output of the layer)

“derivative” is the Jacobian matrix:The Jacobian of f : Rn → Rm at x ∈ Rn, denoted Jx(f), is them× n matrix whose i, j element is the partial derivative offi : Rn → R w.r.t. its j’th variable at x.

The multiplication is matrix multiplication

The correctness of the algorithm follows from the multivariate chainrule

Jw(f ◦ g) = Jg(w)(f)Jw(g)

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 10 / 30

Page 35: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Jacobian — Examples

If f : Rn → Rn is element-wise application of σ : R→ R thenJx(f) = diag((σ′(x1), . . . , σ

′(xn))).

Let f(x,w, b) = w>x+ b for w,x ∈ Rn, b ∈ R1. Then:

Jx(f) = w> , Jw(f) = x> , Jb(f) = 1

Let f(W,x) =Wx. Then:

Jx(f) =W , JW (f) =

x> 0 · · · 00 x> · · · 0...

.... . .

...0 0 · · · x>

.

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 11 / 30

Page 36: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Jacobian — Examples

If f : Rn → Rn is element-wise application of σ : R→ R thenJx(f) = diag((σ′(x1), . . . , σ

′(xn))).

Let f(x,w, b) = w>x+ b for w,x ∈ Rn, b ∈ R1. Then:

Jx(f) = w> , Jw(f) = x> , Jb(f) = 1

Let f(W,x) =Wx. Then:

Jx(f) =W , JW (f) =

x> 0 · · · 00 x> · · · 0...

.... . .

...0 0 · · · x>

.

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 11 / 30

Page 37: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Jacobian — Examples

If f : Rn → Rn is element-wise application of σ : R→ R thenJx(f) = diag((σ′(x1), . . . , σ

′(xn))).

Let f(x,w, b) = w>x+ b for w,x ∈ Rn, b ∈ R1. Then:

Jx(f) = w> , Jw(f) = x> , Jb(f) = 1

Let f(W,x) =Wx. Then:

Jx(f) =W , JW (f) =

x> 0 · · · 00 x> · · · 0...

.... . .

...0 0 · · · x>

.

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 11 / 30

Page 38: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Outline

1 Gradient-Based Learning

2 Computation Graph and Backpropagation

3 Expressiveness and Sample Complexity

4 Computational Complexity

5 Deep Learning — Examples

6 Convolutional Networks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 12 / 30

Page 39: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Sample Complexity

If we learn d parameters, and each one is stored in, say, 32 bits, thenthe number of hypotheses in our class is at most 232d. It follows thatthe sample complexity is order of d.

Other ways to improve generalization is all sort of regularization

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 13 / 30

Page 40: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Sample Complexity

If we learn d parameters, and each one is stored in, say, 32 bits, thenthe number of hypotheses in our class is at most 232d. It follows thatthe sample complexity is order of d.

Other ways to improve generalization is all sort of regularization

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 13 / 30

Page 41: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Expressiveness

So far in the course we considered hypotheses of the formx 7→ w>x+ b

Now, consider the following computation graph, known as “onehidden layer network”:

Input layer: x

0

Affine layer: a(1) =W (1)x+ b(1)6

Variable layer: W (1)

2

Variable layer: b(1)3

ReLU layer: h(1) = [a(1)]+

7

Affine layer: p =W (2)h(1) + b(2)8

Variable layer: W (2)

4

Variable layer: b(2)5

Loss layer

9

Input layer: y

1

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 14 / 30

Page 42: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Expressiveness of “One Hidden Layer Network”

Claim: Every Boolean function f : {±1}n → {±1} can be expressedby a one hidden layer network.

Proof:

Show that for integer x we have sign(x) = 2([x+ 1]+ − [x]+)− 1Show that any f can be written as f(x) = ∨i(x == ui) for somevectors u1, . . . , ukShow that sign(x>ui − (n− 1)) is an indicator to (x == ui)Conclude that we can adjust the weights so that yp(x) ≥ 1 for allexamples (x, y)

Theorem: For every n, let s(n) be the minimal integer such thatthere exists a one hidden layer network with s(n) hidden neurons thatimplements all functions from {0, 1}n to {0, 1}. Then, s(n) isexponential in n.

Proof: Think on the VC dimension ...

What type of functions can be implemented by small size networks?

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 15 / 30

Page 43: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Expressiveness of “One Hidden Layer Network”

Claim: Every Boolean function f : {±1}n → {±1} can be expressedby a one hidden layer network.

Proof:

Show that for integer x we have sign(x) = 2([x+ 1]+ − [x]+)− 1Show that any f can be written as f(x) = ∨i(x == ui) for somevectors u1, . . . , ukShow that sign(x>ui − (n− 1)) is an indicator to (x == ui)Conclude that we can adjust the weights so that yp(x) ≥ 1 for allexamples (x, y)

Theorem: For every n, let s(n) be the minimal integer such thatthere exists a one hidden layer network with s(n) hidden neurons thatimplements all functions from {0, 1}n to {0, 1}. Then, s(n) isexponential in n.

Proof: Think on the VC dimension ...

What type of functions can be implemented by small size networks?

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 15 / 30

Page 44: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Expressiveness of “One Hidden Layer Network”

Claim: Every Boolean function f : {±1}n → {±1} can be expressedby a one hidden layer network.

Proof:

Show that for integer x we have sign(x) = 2([x+ 1]+ − [x]+)− 1

Show that any f can be written as f(x) = ∨i(x == ui) for somevectors u1, . . . , ukShow that sign(x>ui − (n− 1)) is an indicator to (x == ui)Conclude that we can adjust the weights so that yp(x) ≥ 1 for allexamples (x, y)

Theorem: For every n, let s(n) be the minimal integer such thatthere exists a one hidden layer network with s(n) hidden neurons thatimplements all functions from {0, 1}n to {0, 1}. Then, s(n) isexponential in n.

Proof: Think on the VC dimension ...

What type of functions can be implemented by small size networks?

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 15 / 30

Page 45: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Expressiveness of “One Hidden Layer Network”

Claim: Every Boolean function f : {±1}n → {±1} can be expressedby a one hidden layer network.

Proof:

Show that for integer x we have sign(x) = 2([x+ 1]+ − [x]+)− 1Show that any f can be written as f(x) = ∨i(x == ui) for somevectors u1, . . . , uk

Show that sign(x>ui − (n− 1)) is an indicator to (x == ui)Conclude that we can adjust the weights so that yp(x) ≥ 1 for allexamples (x, y)

Theorem: For every n, let s(n) be the minimal integer such thatthere exists a one hidden layer network with s(n) hidden neurons thatimplements all functions from {0, 1}n to {0, 1}. Then, s(n) isexponential in n.

Proof: Think on the VC dimension ...

What type of functions can be implemented by small size networks?

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 15 / 30

Page 46: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Expressiveness of “One Hidden Layer Network”

Claim: Every Boolean function f : {±1}n → {±1} can be expressedby a one hidden layer network.

Proof:

Show that for integer x we have sign(x) = 2([x+ 1]+ − [x]+)− 1Show that any f can be written as f(x) = ∨i(x == ui) for somevectors u1, . . . , ukShow that sign(x>ui − (n− 1)) is an indicator to (x == ui)

Conclude that we can adjust the weights so that yp(x) ≥ 1 for allexamples (x, y)

Theorem: For every n, let s(n) be the minimal integer such thatthere exists a one hidden layer network with s(n) hidden neurons thatimplements all functions from {0, 1}n to {0, 1}. Then, s(n) isexponential in n.

Proof: Think on the VC dimension ...

What type of functions can be implemented by small size networks?

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 15 / 30

Page 47: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Expressiveness of “One Hidden Layer Network”

Claim: Every Boolean function f : {±1}n → {±1} can be expressedby a one hidden layer network.

Proof:

Show that for integer x we have sign(x) = 2([x+ 1]+ − [x]+)− 1Show that any f can be written as f(x) = ∨i(x == ui) for somevectors u1, . . . , ukShow that sign(x>ui − (n− 1)) is an indicator to (x == ui)Conclude that we can adjust the weights so that yp(x) ≥ 1 for allexamples (x, y)

Theorem: For every n, let s(n) be the minimal integer such thatthere exists a one hidden layer network with s(n) hidden neurons thatimplements all functions from {0, 1}n to {0, 1}. Then, s(n) isexponential in n.

Proof: Think on the VC dimension ...

What type of functions can be implemented by small size networks?

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 15 / 30

Page 48: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Expressiveness of “One Hidden Layer Network”

Claim: Every Boolean function f : {±1}n → {±1} can be expressedby a one hidden layer network.

Proof:

Show that for integer x we have sign(x) = 2([x+ 1]+ − [x]+)− 1Show that any f can be written as f(x) = ∨i(x == ui) for somevectors u1, . . . , ukShow that sign(x>ui − (n− 1)) is an indicator to (x == ui)Conclude that we can adjust the weights so that yp(x) ≥ 1 for allexamples (x, y)

Theorem: For every n, let s(n) be the minimal integer such thatthere exists a one hidden layer network with s(n) hidden neurons thatimplements all functions from {0, 1}n to {0, 1}. Then, s(n) isexponential in n.

Proof: Think on the VC dimension ...

What type of functions can be implemented by small size networks?

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 15 / 30

Page 49: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Expressiveness of “One Hidden Layer Network”

Claim: Every Boolean function f : {±1}n → {±1} can be expressedby a one hidden layer network.

Proof:

Show that for integer x we have sign(x) = 2([x+ 1]+ − [x]+)− 1Show that any f can be written as f(x) = ∨i(x == ui) for somevectors u1, . . . , ukShow that sign(x>ui − (n− 1)) is an indicator to (x == ui)Conclude that we can adjust the weights so that yp(x) ≥ 1 for allexamples (x, y)

Theorem: For every n, let s(n) be the minimal integer such thatthere exists a one hidden layer network with s(n) hidden neurons thatimplements all functions from {0, 1}n to {0, 1}. Then, s(n) isexponential in n.

Proof: Think on the VC dimension ...

What type of functions can be implemented by small size networks?

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 15 / 30

Page 50: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Expressiveness of “One Hidden Layer Network”

Claim: Every Boolean function f : {±1}n → {±1} can be expressedby a one hidden layer network.

Proof:

Show that for integer x we have sign(x) = 2([x+ 1]+ − [x]+)− 1Show that any f can be written as f(x) = ∨i(x == ui) for somevectors u1, . . . , ukShow that sign(x>ui − (n− 1)) is an indicator to (x == ui)Conclude that we can adjust the weights so that yp(x) ≥ 1 for allexamples (x, y)

Theorem: For every n, let s(n) be the minimal integer such thatthere exists a one hidden layer network with s(n) hidden neurons thatimplements all functions from {0, 1}n to {0, 1}. Then, s(n) isexponential in n.

Proof: Think on the VC dimension ...

What type of functions can be implemented by small size networks?

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 15 / 30

Page 51: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Geometric Intuition

One hidden layer networks can express intersection of halfspaces

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 16 / 30

Page 52: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Geometric Intuition

Two hidden layer networks can express unions of intersection ofhalfspaces

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 17 / 30

Page 53: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

What can we express with T -depth networks ?

Theorem: Let T : N→ N and for every n, let Fn be the set offunctions that can be implemented using a Turing machine usingruntime of at most T (n). Then, there exist constants b, c ∈ R+ suchthat for every n, there is a network of depth at most T and size atmost c T (n)2 + b such that it implements all functions in Fn.

Sample complexity is order of number of variables (in our casepolynomial in T )

Conclusion: A very weak notion of prior knowledge suffices — if weonly care about functions that can be implemented in time T (n), wecan use neural networks of depth T and size O(T (n)2), and thesample complexity is also bounded by polynomial in T (n) !

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 18 / 30

Page 54: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

What can we express with T -depth networks ?

Theorem: Let T : N→ N and for every n, let Fn be the set offunctions that can be implemented using a Turing machine usingruntime of at most T (n). Then, there exist constants b, c ∈ R+ suchthat for every n, there is a network of depth at most T and size atmost c T (n)2 + b such that it implements all functions in Fn.

Sample complexity is order of number of variables (in our casepolynomial in T )

Conclusion: A very weak notion of prior knowledge suffices — if weonly care about functions that can be implemented in time T (n), wecan use neural networks of depth T and size O(T (n)2), and thesample complexity is also bounded by polynomial in T (n) !

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 18 / 30

Page 55: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

What can we express with T -depth networks ?

Theorem: Let T : N→ N and for every n, let Fn be the set offunctions that can be implemented using a Turing machine usingruntime of at most T (n). Then, there exist constants b, c ∈ R+ suchthat for every n, there is a network of depth at most T and size atmost c T (n)2 + b such that it implements all functions in Fn.

Sample complexity is order of number of variables (in our casepolynomial in T )

Conclusion: A very weak notion of prior knowledge suffices — if weonly care about functions that can be implemented in time T (n), wecan use neural networks of depth T and size O(T (n)2), and thesample complexity is also bounded by polynomial in T (n) !

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 18 / 30

Page 56: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

The ultimate hypothesis class

less prior knowledgemore data

expert system

use prior knowl-edge to con-struct φ(x) andlearn 〈w, φ(x)〉

deep net-works

No Free Lunch

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 19 / 30

Page 57: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Outline

1 Gradient-Based Learning

2 Computation Graph and Backpropagation

3 Expressiveness and Sample Complexity

4 Computational Complexity

5 Deep Learning — Examples

6 Convolutional Networks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 20 / 30

Page 58: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Runtime of learning neural networks

Theorem: It is NP hard to implement the ERM rule even for onehidden layer networks with just 4 neurons in the hidden layer.

But, maybe ERM is hard but some improper algorithm works ?

Theorem: Under some average case complexity assumption, it is hardto learn one hidden layer networks with ω(log(d)) hidden neuronseven improperly

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 21 / 30

Page 59: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Runtime of learning neural networks

Theorem: It is NP hard to implement the ERM rule even for onehidden layer networks with just 4 neurons in the hidden layer.

But, maybe ERM is hard but some improper algorithm works ?

Theorem: Under some average case complexity assumption, it is hardto learn one hidden layer networks with ω(log(d)) hidden neuronseven improperly

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 21 / 30

Page 60: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Runtime of learning neural networks

Theorem: It is NP hard to implement the ERM rule even for onehidden layer networks with just 4 neurons in the hidden layer.

But, maybe ERM is hard but some improper algorithm works ?

Theorem: Under some average case complexity assumption, it is hardto learn one hidden layer networks with ω(log(d)) hidden neuronseven improperly

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 21 / 30

Page 61: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

How to train neural network ?

So, neural networks can form an excellent hypothesis class, but it isintractable to train it.

How is this different than the class of all Python programs that canbe implemented in code length of b bits ?

Main technique: Gradient-based learning (using SGD)

Not convex, no guarantees, can take a long time, but:

Often (but not always) still works fine, finds a good solutionEasier than optimizing over Python programs ...Need to apply some tricks (initialization, learning rate, mini-batching,architecture), and need some luck

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 22 / 30

Page 62: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

How to train neural network ?

So, neural networks can form an excellent hypothesis class, but it isintractable to train it.

How is this different than the class of all Python programs that canbe implemented in code length of b bits ?

Main technique: Gradient-based learning (using SGD)

Not convex, no guarantees, can take a long time, but:

Often (but not always) still works fine, finds a good solutionEasier than optimizing over Python programs ...Need to apply some tricks (initialization, learning rate, mini-batching,architecture), and need some luck

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 22 / 30

Page 63: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

How to train neural network ?

So, neural networks can form an excellent hypothesis class, but it isintractable to train it.

How is this different than the class of all Python programs that canbe implemented in code length of b bits ?

Main technique: Gradient-based learning (using SGD)

Not convex, no guarantees, can take a long time, but:

Often (but not always) still works fine, finds a good solutionEasier than optimizing over Python programs ...Need to apply some tricks (initialization, learning rate, mini-batching,architecture), and need some luck

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 22 / 30

Page 64: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

How to train neural network ?

So, neural networks can form an excellent hypothesis class, but it isintractable to train it.

How is this different than the class of all Python programs that canbe implemented in code length of b bits ?

Main technique: Gradient-based learning (using SGD)

Not convex, no guarantees, can take a long time, but:

Often (but not always) still works fine, finds a good solutionEasier than optimizing over Python programs ...Need to apply some tricks (initialization, learning rate, mini-batching,architecture), and need some luck

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 22 / 30

Page 65: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

How to train neural network ?

So, neural networks can form an excellent hypothesis class, but it isintractable to train it.

How is this different than the class of all Python programs that canbe implemented in code length of b bits ?

Main technique: Gradient-based learning (using SGD)

Not convex, no guarantees, can take a long time, but:

Often (but not always) still works fine, finds a good solution

Easier than optimizing over Python programs ...Need to apply some tricks (initialization, learning rate, mini-batching,architecture), and need some luck

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 22 / 30

Page 66: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

How to train neural network ?

So, neural networks can form an excellent hypothesis class, but it isintractable to train it.

How is this different than the class of all Python programs that canbe implemented in code length of b bits ?

Main technique: Gradient-based learning (using SGD)

Not convex, no guarantees, can take a long time, but:

Often (but not always) still works fine, finds a good solutionEasier than optimizing over Python programs ...

Need to apply some tricks (initialization, learning rate, mini-batching,architecture), and need some luck

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 22 / 30

Page 67: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

How to train neural network ?

So, neural networks can form an excellent hypothesis class, but it isintractable to train it.

How is this different than the class of all Python programs that canbe implemented in code length of b bits ?

Main technique: Gradient-based learning (using SGD)

Not convex, no guarantees, can take a long time, but:

Often (but not always) still works fine, finds a good solutionEasier than optimizing over Python programs ...Need to apply some tricks (initialization, learning rate, mini-batching,architecture), and need some luck

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 22 / 30

Page 68: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Outline

1 Gradient-Based Learning

2 Computation Graph and Backpropagation

3 Expressiveness and Sample Complexity

4 Computational Complexity

5 Deep Learning — Examples

6 Convolutional Networks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 23 / 30

Page 69: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

The MNIST dataset

The task: Handwritten digits recognition

Input space: X = {0, 1, . . . , 255}28×28Output space: Y = {0, 1, . . . , 9}

Multiclass categorization:

We take hypotheses of the form h : X → R|Y|We interpret h(x) as a vector that gives scores for all the labelsThe actual prediction is the label with the highest score: argmaxi hi(x)

Network architecture: x→ Affine(500) → ReLU → Affine(10).

Logistic loss for multiclass categorization:

SoftMax: ∀i, pi =exp(hi(x))∑j exp(hj(x))

LogLoss: If the correct label is y then the loss is

− log(py) = log(∑

j exp(hj(x)− hi(x)))

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 24 / 30

Page 70: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

The MNIST dataset

The task: Handwritten digits recognition

Input space: X = {0, 1, . . . , 255}28×28

Output space: Y = {0, 1, . . . , 9}Multiclass categorization:

We take hypotheses of the form h : X → R|Y|We interpret h(x) as a vector that gives scores for all the labelsThe actual prediction is the label with the highest score: argmaxi hi(x)

Network architecture: x→ Affine(500) → ReLU → Affine(10).

Logistic loss for multiclass categorization:

SoftMax: ∀i, pi =exp(hi(x))∑j exp(hj(x))

LogLoss: If the correct label is y then the loss is

− log(py) = log(∑

j exp(hj(x)− hi(x)))

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 24 / 30

Page 71: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

The MNIST dataset

The task: Handwritten digits recognition

Input space: X = {0, 1, . . . , 255}28×28Output space: Y = {0, 1, . . . , 9}

Multiclass categorization:

We take hypotheses of the form h : X → R|Y|We interpret h(x) as a vector that gives scores for all the labelsThe actual prediction is the label with the highest score: argmaxi hi(x)

Network architecture: x→ Affine(500) → ReLU → Affine(10).

Logistic loss for multiclass categorization:

SoftMax: ∀i, pi =exp(hi(x))∑j exp(hj(x))

LogLoss: If the correct label is y then the loss is

− log(py) = log(∑

j exp(hj(x)− hi(x)))

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 24 / 30

Page 72: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

The MNIST dataset

The task: Handwritten digits recognition

Input space: X = {0, 1, . . . , 255}28×28Output space: Y = {0, 1, . . . , 9}

Multiclass categorization:

We take hypotheses of the form h : X → R|Y|We interpret h(x) as a vector that gives scores for all the labelsThe actual prediction is the label with the highest score: argmaxi hi(x)

Network architecture: x→ Affine(500) → ReLU → Affine(10).

Logistic loss for multiclass categorization:

SoftMax: ∀i, pi =exp(hi(x))∑j exp(hj(x))

LogLoss: If the correct label is y then the loss is

− log(py) = log(∑

j exp(hj(x)− hi(x)))

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 24 / 30

Page 73: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

The MNIST dataset

The task: Handwritten digits recognition

Input space: X = {0, 1, . . . , 255}28×28Output space: Y = {0, 1, . . . , 9}

Multiclass categorization:

We take hypotheses of the form h : X → R|Y|

We interpret h(x) as a vector that gives scores for all the labelsThe actual prediction is the label with the highest score: argmaxi hi(x)

Network architecture: x→ Affine(500) → ReLU → Affine(10).

Logistic loss for multiclass categorization:

SoftMax: ∀i, pi =exp(hi(x))∑j exp(hj(x))

LogLoss: If the correct label is y then the loss is

− log(py) = log(∑

j exp(hj(x)− hi(x)))

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 24 / 30

Page 74: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

The MNIST dataset

The task: Handwritten digits recognition

Input space: X = {0, 1, . . . , 255}28×28Output space: Y = {0, 1, . . . , 9}

Multiclass categorization:

We take hypotheses of the form h : X → R|Y|We interpret h(x) as a vector that gives scores for all the labels

The actual prediction is the label with the highest score: argmaxi hi(x)

Network architecture: x→ Affine(500) → ReLU → Affine(10).

Logistic loss for multiclass categorization:

SoftMax: ∀i, pi =exp(hi(x))∑j exp(hj(x))

LogLoss: If the correct label is y then the loss is

− log(py) = log(∑

j exp(hj(x)− hi(x)))

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 24 / 30

Page 75: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

The MNIST dataset

The task: Handwritten digits recognition

Input space: X = {0, 1, . . . , 255}28×28Output space: Y = {0, 1, . . . , 9}

Multiclass categorization:

We take hypotheses of the form h : X → R|Y|We interpret h(x) as a vector that gives scores for all the labelsThe actual prediction is the label with the highest score: argmaxi hi(x)

Network architecture: x→ Affine(500) → ReLU → Affine(10).

Logistic loss for multiclass categorization:

SoftMax: ∀i, pi =exp(hi(x))∑j exp(hj(x))

LogLoss: If the correct label is y then the loss is

− log(py) = log(∑

j exp(hj(x)− hi(x)))

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 24 / 30

Page 76: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

The MNIST dataset

The task: Handwritten digits recognition

Input space: X = {0, 1, . . . , 255}28×28Output space: Y = {0, 1, . . . , 9}

Multiclass categorization:

We take hypotheses of the form h : X → R|Y|We interpret h(x) as a vector that gives scores for all the labelsThe actual prediction is the label with the highest score: argmaxi hi(x)

Network architecture: x→ Affine(500) → ReLU → Affine(10).

Logistic loss for multiclass categorization:

SoftMax: ∀i, pi =exp(hi(x))∑j exp(hj(x))

LogLoss: If the correct label is y then the loss is

− log(py) = log(∑

j exp(hj(x)− hi(x)))

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 24 / 30

Page 77: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

The MNIST dataset

The task: Handwritten digits recognition

Input space: X = {0, 1, . . . , 255}28×28Output space: Y = {0, 1, . . . , 9}

Multiclass categorization:

We take hypotheses of the form h : X → R|Y|We interpret h(x) as a vector that gives scores for all the labelsThe actual prediction is the label with the highest score: argmaxi hi(x)

Network architecture: x→ Affine(500) → ReLU → Affine(10).

Logistic loss for multiclass categorization:

SoftMax: ∀i, pi =exp(hi(x))∑j exp(hj(x))

LogLoss: If the correct label is y then the loss is

− log(py) = log(∑

j exp(hj(x)− hi(x)))

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 24 / 30

Page 78: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

The MNIST dataset

The task: Handwritten digits recognition

Input space: X = {0, 1, . . . , 255}28×28Output space: Y = {0, 1, . . . , 9}

Multiclass categorization:

We take hypotheses of the form h : X → R|Y|We interpret h(x) as a vector that gives scores for all the labelsThe actual prediction is the label with the highest score: argmaxi hi(x)

Network architecture: x→ Affine(500) → ReLU → Affine(10).

Logistic loss for multiclass categorization:

SoftMax: ∀i, pi =exp(hi(x))∑j exp(hj(x))

LogLoss: If the correct label is y then the loss is

− log(py) = log(∑

j exp(hj(x)− hi(x)))

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 24 / 30

Page 79: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

The MNIST dataset

The task: Handwritten digits recognition

Input space: X = {0, 1, . . . , 255}28×28Output space: Y = {0, 1, . . . , 9}

Multiclass categorization:

We take hypotheses of the form h : X → R|Y|We interpret h(x) as a vector that gives scores for all the labelsThe actual prediction is the label with the highest score: argmaxi hi(x)

Network architecture: x→ Affine(500) → ReLU → Affine(10).

Logistic loss for multiclass categorization:

SoftMax: ∀i, pi =exp(hi(x))∑j exp(hj(x))

LogLoss: If the correct label is y then the loss is

− log(py) = log(∑

j exp(hj(x)− hi(x)))

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 24 / 30

Page 80: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Some Training Tricks

Input normalization: divide each element of x by 255 to make sure itis in [0, 1]

Initialization is important: One trick that works well in practice is toinitialize the bias to be zero and initialize the rows of W to berandom in [−1/

√n, 1/

√n]

Mini-batches: At each iteration of SGD we calculate the average losson k random examples for k > 1. Advantages:

Reduces the variance of the update direction (w.r.t. the full gradient),hence converges fasterWe don’t pay a lot in time because of parallel implementation

Learning rate: Choice of learning rate is important. One way is tostart with some fixed η and decrease it by 1/2 whenever the trainingstops making progress.

Variants of SGD: There are plenty of variants that work better thanvanilla SGD.

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 25 / 30

Page 81: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Some Training Tricks

Input normalization: divide each element of x by 255 to make sure itis in [0, 1]

Initialization is important: One trick that works well in practice is toinitialize the bias to be zero and initialize the rows of W to berandom in [−1/

√n, 1/

√n]

Mini-batches: At each iteration of SGD we calculate the average losson k random examples for k > 1. Advantages:

Reduces the variance of the update direction (w.r.t. the full gradient),hence converges fasterWe don’t pay a lot in time because of parallel implementation

Learning rate: Choice of learning rate is important. One way is tostart with some fixed η and decrease it by 1/2 whenever the trainingstops making progress.

Variants of SGD: There are plenty of variants that work better thanvanilla SGD.

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 25 / 30

Page 82: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Some Training Tricks

Input normalization: divide each element of x by 255 to make sure itis in [0, 1]

Initialization is important: One trick that works well in practice is toinitialize the bias to be zero and initialize the rows of W to berandom in [−1/

√n, 1/

√n]

Mini-batches: At each iteration of SGD we calculate the average losson k random examples for k > 1. Advantages:

Reduces the variance of the update direction (w.r.t. the full gradient),hence converges fasterWe don’t pay a lot in time because of parallel implementation

Learning rate: Choice of learning rate is important. One way is tostart with some fixed η and decrease it by 1/2 whenever the trainingstops making progress.

Variants of SGD: There are plenty of variants that work better thanvanilla SGD.

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 25 / 30

Page 83: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Some Training Tricks

Input normalization: divide each element of x by 255 to make sure itis in [0, 1]

Initialization is important: One trick that works well in practice is toinitialize the bias to be zero and initialize the rows of W to berandom in [−1/

√n, 1/

√n]

Mini-batches: At each iteration of SGD we calculate the average losson k random examples for k > 1. Advantages:

Reduces the variance of the update direction (w.r.t. the full gradient),hence converges faster

We don’t pay a lot in time because of parallel implementation

Learning rate: Choice of learning rate is important. One way is tostart with some fixed η and decrease it by 1/2 whenever the trainingstops making progress.

Variants of SGD: There are plenty of variants that work better thanvanilla SGD.

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 25 / 30

Page 84: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Some Training Tricks

Input normalization: divide each element of x by 255 to make sure itis in [0, 1]

Initialization is important: One trick that works well in practice is toinitialize the bias to be zero and initialize the rows of W to berandom in [−1/

√n, 1/

√n]

Mini-batches: At each iteration of SGD we calculate the average losson k random examples for k > 1. Advantages:

Reduces the variance of the update direction (w.r.t. the full gradient),hence converges fasterWe don’t pay a lot in time because of parallel implementation

Learning rate: Choice of learning rate is important. One way is tostart with some fixed η and decrease it by 1/2 whenever the trainingstops making progress.

Variants of SGD: There are plenty of variants that work better thanvanilla SGD.

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 25 / 30

Page 85: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Some Training Tricks

Input normalization: divide each element of x by 255 to make sure itis in [0, 1]

Initialization is important: One trick that works well in practice is toinitialize the bias to be zero and initialize the rows of W to berandom in [−1/

√n, 1/

√n]

Mini-batches: At each iteration of SGD we calculate the average losson k random examples for k > 1. Advantages:

Reduces the variance of the update direction (w.r.t. the full gradient),hence converges fasterWe don’t pay a lot in time because of parallel implementation

Learning rate: Choice of learning rate is important. One way is tostart with some fixed η and decrease it by 1/2 whenever the trainingstops making progress.

Variants of SGD: There are plenty of variants that work better thanvanilla SGD.

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 25 / 30

Page 86: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Some Training Tricks

Input normalization: divide each element of x by 255 to make sure itis in [0, 1]

Initialization is important: One trick that works well in practice is toinitialize the bias to be zero and initialize the rows of W to berandom in [−1/

√n, 1/

√n]

Mini-batches: At each iteration of SGD we calculate the average losson k random examples for k > 1. Advantages:

Reduces the variance of the update direction (w.r.t. the full gradient),hence converges fasterWe don’t pay a lot in time because of parallel implementation

Learning rate: Choice of learning rate is important. One way is tostart with some fixed η and decrease it by 1/2 whenever the trainingstops making progress.

Variants of SGD: There are plenty of variants that work better thanvanilla SGD.

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 25 / 30

Page 87: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Failures of Deep Learning

Parity of more than 30 bits

Multiplication of large numbers

Matrix inversion

...

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 26 / 30

Page 88: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Outline

1 Gradient-Based Learning

2 Computation Graph and Backpropagation

3 Expressiveness and Sample Complexity

4 Computational Complexity

5 Deep Learning — Examples

6 Convolutional Networks

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 27 / 30

Page 89: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Convolutional Networks

Convolution layer:

Input: C imagesOutput: C ′ imagesCalculation:

O[c′, h′, w′] = b(c′) +

C−1∑c=0

k−1∑h=0

k−1∑w=0

W (c′)[c, h, w]X[c, h+ h′, w + w′]

Observe: equivalent to an Affine layer with weight sharingObserve: can be implemented as a combination of Im2Col layer andAffine layer

Pooling layer:

Input: Image of size H ×WOutput: Image of size (H/k)× (W/k)Calculation: Divide input image to k × k windows and for each suchwindow output the maximal value (or average value)

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 28 / 30

Page 90: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Convolutional Networks

Convolution layer:

Input: C images

Output: C ′ imagesCalculation:

O[c′, h′, w′] = b(c′) +

C−1∑c=0

k−1∑h=0

k−1∑w=0

W (c′)[c, h, w]X[c, h+ h′, w + w′]

Observe: equivalent to an Affine layer with weight sharingObserve: can be implemented as a combination of Im2Col layer andAffine layer

Pooling layer:

Input: Image of size H ×WOutput: Image of size (H/k)× (W/k)Calculation: Divide input image to k × k windows and for each suchwindow output the maximal value (or average value)

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 28 / 30

Page 91: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Convolutional Networks

Convolution layer:

Input: C imagesOutput: C ′ images

Calculation:

O[c′, h′, w′] = b(c′) +

C−1∑c=0

k−1∑h=0

k−1∑w=0

W (c′)[c, h, w]X[c, h+ h′, w + w′]

Observe: equivalent to an Affine layer with weight sharingObserve: can be implemented as a combination of Im2Col layer andAffine layer

Pooling layer:

Input: Image of size H ×WOutput: Image of size (H/k)× (W/k)Calculation: Divide input image to k × k windows and for each suchwindow output the maximal value (or average value)

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 28 / 30

Page 92: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Convolutional Networks

Convolution layer:

Input: C imagesOutput: C ′ imagesCalculation:

O[c′, h′, w′] = b(c′) +

C−1∑c=0

k−1∑h=0

k−1∑w=0

W (c′)[c, h, w]X[c, h+ h′, w + w′]

Observe: equivalent to an Affine layer with weight sharingObserve: can be implemented as a combination of Im2Col layer andAffine layer

Pooling layer:

Input: Image of size H ×WOutput: Image of size (H/k)× (W/k)Calculation: Divide input image to k × k windows and for each suchwindow output the maximal value (or average value)

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 28 / 30

Page 93: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Convolutional Networks

Convolution layer:

Input: C imagesOutput: C ′ imagesCalculation:

O[c′, h′, w′] = b(c′) +

C−1∑c=0

k−1∑h=0

k−1∑w=0

W (c′)[c, h, w]X[c, h+ h′, w + w′]

Observe: equivalent to an Affine layer with weight sharing

Observe: can be implemented as a combination of Im2Col layer andAffine layer

Pooling layer:

Input: Image of size H ×WOutput: Image of size (H/k)× (W/k)Calculation: Divide input image to k × k windows and for each suchwindow output the maximal value (or average value)

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 28 / 30

Page 94: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Convolutional Networks

Convolution layer:

Input: C imagesOutput: C ′ imagesCalculation:

O[c′, h′, w′] = b(c′) +

C−1∑c=0

k−1∑h=0

k−1∑w=0

W (c′)[c, h, w]X[c, h+ h′, w + w′]

Observe: equivalent to an Affine layer with weight sharingObserve: can be implemented as a combination of Im2Col layer andAffine layer

Pooling layer:

Input: Image of size H ×WOutput: Image of size (H/k)× (W/k)Calculation: Divide input image to k × k windows and for each suchwindow output the maximal value (or average value)

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 28 / 30

Page 95: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Convolutional Networks

Convolution layer:

Input: C imagesOutput: C ′ imagesCalculation:

O[c′, h′, w′] = b(c′) +

C−1∑c=0

k−1∑h=0

k−1∑w=0

W (c′)[c, h, w]X[c, h+ h′, w + w′]

Observe: equivalent to an Affine layer with weight sharingObserve: can be implemented as a combination of Im2Col layer andAffine layer

Pooling layer:

Input: Image of size H ×WOutput: Image of size (H/k)× (W/k)Calculation: Divide input image to k × k windows and for each suchwindow output the maximal value (or average value)

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 28 / 30

Page 96: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Convolutional Networks

Convolution layer:

Input: C imagesOutput: C ′ imagesCalculation:

O[c′, h′, w′] = b(c′) +

C−1∑c=0

k−1∑h=0

k−1∑w=0

W (c′)[c, h, w]X[c, h+ h′, w + w′]

Observe: equivalent to an Affine layer with weight sharingObserve: can be implemented as a combination of Im2Col layer andAffine layer

Pooling layer:

Input: Image of size H ×W

Output: Image of size (H/k)× (W/k)Calculation: Divide input image to k × k windows and for each suchwindow output the maximal value (or average value)

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 28 / 30

Page 97: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Convolutional Networks

Convolution layer:

Input: C imagesOutput: C ′ imagesCalculation:

O[c′, h′, w′] = b(c′) +

C−1∑c=0

k−1∑h=0

k−1∑w=0

W (c′)[c, h, w]X[c, h+ h′, w + w′]

Observe: equivalent to an Affine layer with weight sharingObserve: can be implemented as a combination of Im2Col layer andAffine layer

Pooling layer:

Input: Image of size H ×WOutput: Image of size (H/k)× (W/k)

Calculation: Divide input image to k × k windows and for each suchwindow output the maximal value (or average value)

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 28 / 30

Page 98: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Convolutional Networks

Convolution layer:

Input: C imagesOutput: C ′ imagesCalculation:

O[c′, h′, w′] = b(c′) +

C−1∑c=0

k−1∑h=0

k−1∑w=0

W (c′)[c, h, w]X[c, h+ h′, w + w′]

Observe: equivalent to an Affine layer with weight sharingObserve: can be implemented as a combination of Im2Col layer andAffine layer

Pooling layer:

Input: Image of size H ×WOutput: Image of size (H/k)× (W/k)Calculation: Divide input image to k × k windows and for each suchwindow output the maximal value (or average value)

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 28 / 30

Page 99: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Historical Remarks

1940s-70s:Inspired by learning/modeling the brain (Pitts, Hebb, and others)Perceptron Rule (Rosenblatt), Multilayer perceptron (Minksy andPapert)Backpropagation (Werbos 1975)

1980s – early 1990s:Practical Back-prop (Rumelhart, Hinton et al 1986) and SGD (Bottou)Initial empirical success

1990s-2000s:Lost favor to implicit linear methods: SVM, Boosting

2006 –:Regain popularity because of unsupervised pre-training (Hinton,Bengio, LeCun, Ng, and others)Computational advances and several new tricks allow training HUGEnetworks. Empirical success leads to renewed interest2012: Krizhevsky, Sustkever, Hinton: significant improvement ofstate-of-the-art on imagenet dataset (object recognition of 1000classes), without unsupervised pre-training

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 29 / 30

Page 100: Introduction to Machine Learning (67577)shais/Lectures2014/2016... · Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University

Summary

Deep Learning can be used to construct the ultimate hypothesis class

Worst-case complexity is exponential

. . . but, empirically, it works reasonably well and leads tostate-of-the-art on many real world problems

Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks 30 / 30


Recommended