+ All Categories
Home > Documents > EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils...

EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils...

Date post: 24-Jul-2019
Category:
Upload: lylien
View: 215 times
Download: 0 times
Share this document with a friend
100
EE-559 – Deep learning 3a. Linear classifiers, perceptron Fran¸coisFleuret https://fleuret.org/dlc/ [version of: May 17, 2018] ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE
Transcript
Page 1: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

EE-559 – Deep learning

3a. Linear classifiers, perceptron

Francois Fleuret

https://fleuret.org/dlc/

[version of: May 17, 2018]

ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE

Page 2: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

A bit of history, the perceptron

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 2 / 43

Page 3: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The first mathematical model for a neuron was the Threshold Logic Unit, withBoolean inputs and outputs:

f (x) = 1{w ∑i xi+b≥0}.

It can in particular implement

or(u, v) = 1{u+v−0.5≥0} (w = 1, b = −0.5)

and(u, v) = 1{u+v−1.5≥0} (w = 1, b = −1.5)

not(u) = 1{−u+0.5≥0} (w = −1, b = 0.5)

Hence, any Boolean function can be build with such units.

(McCulloch and Pitts, 1943)

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 3 / 43

Page 4: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The first mathematical model for a neuron was the Threshold Logic Unit, withBoolean inputs and outputs:

f (x) = 1{w ∑i xi+b≥0}.

It can in particular implement

or(u, v) = 1{u+v−0.5≥0} (w = 1, b = −0.5)

and(u, v) = 1{u+v−1.5≥0} (w = 1, b = −1.5)

not(u) = 1{−u+0.5≥0} (w = −1, b = 0.5)

Hence, any Boolean function can be build with such units.

(McCulloch and Pitts, 1943)

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 3 / 43

Page 5: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The first mathematical model for a neuron was the Threshold Logic Unit, withBoolean inputs and outputs:

f (x) = 1{w ∑i xi+b≥0}.

It can in particular implement

or(u, v) = 1{u+v−0.5≥0} (w = 1, b = −0.5)

and(u, v) = 1{u+v−1.5≥0} (w = 1, b = −1.5)

not(u) = 1{−u+0.5≥0} (w = −1, b = 0.5)

Hence, any Boolean function can be build with such units.

(McCulloch and Pitts, 1943)

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 3 / 43

Page 6: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The perceptron is very similar

f (x) =

1 if∑i

wi xi + b ≥ 0

0 otherwise

but the inputs are real values and the weights can be different.

This model was originally motivated by biology, with wi being the synapticweights, and xi and f firing rates.

It is a (very) crude biological model.

(Rosenblatt, 1957)

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 4 / 43

Page 7: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The perceptron is very similar

f (x) =

1 if∑i

wi xi + b ≥ 0

0 otherwise

but the inputs are real values and the weights can be different.

This model was originally motivated by biology, with wi being the synapticweights, and xi and f firing rates.

It is a (very) crude biological model.

(Rosenblatt, 1957)

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 4 / 43

Page 8: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

To make things simpler we take responses ±1. Let

σ(x) =

{1 if x ≥ 0−1 otherwise.

−1

1

The perceptron classification rule boils down to

f (x) = σ(w · x + b).

For neural networks, the function σ that follows a linear operator is called theactivation function.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 5 / 43

Page 9: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

To make things simpler we take responses ±1. Let

σ(x) =

{1 if x ≥ 0−1 otherwise.

−1

1

The perceptron classification rule boils down to

f (x) = σ(w · x + b).

For neural networks, the function σ that follows a linear operator is called theactivation function.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 5 / 43

Page 10: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We can represent this “neuron” as follows:

Value

Parameter

Operation

x2

x1

x3

×

×

×

w1

w2

w3

Σ

b

σ y

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 6 / 43

Page 11: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We can also use tensor operations, as in

f (x) = σ(w · x + b).

x ·

w

+

b

σ y

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 7 / 43

Page 12: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Given a training set

(xn, yn) ∈ RD × {−1, 1}, n = 1, . . . ,N,

a very simple scheme to train such a linear operator for classification is theperceptron algorithm:

1. Start with w0 = 0,

2. while ∃nk s.t. ynk(wk · xnk

)≤ 0, update wk+1 = wk + ynk xnk .

The bias b can be introduced as one of the ws by adding a constant componentto x equal to 1.

(Rosenblatt, 1957)

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 8 / 43

Page 13: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Given a training set

(xn, yn) ∈ RD × {−1, 1}, n = 1, . . . ,N,

a very simple scheme to train such a linear operator for classification is theperceptron algorithm:

1. Start with w0 = 0,

2. while ∃nk s.t. ynk(wk · xnk

)≤ 0, update wk+1 = wk + ynk xnk .

The bias b can be introduced as one of the ws by adding a constant componentto x equal to 1.

(Rosenblatt, 1957)

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 8 / 43

Page 14: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Given a training set

(xn, yn) ∈ RD × {−1, 1}, n = 1, . . . ,N,

a very simple scheme to train such a linear operator for classification is theperceptron algorithm:

1. Start with w0 = 0,

2. while ∃nk s.t. ynk(wk · xnk

)≤ 0, update wk+1 = wk + ynk xnk .

The bias b can be introduced as one of the ws by adding a constant componentto x equal to 1.

(Rosenblatt, 1957)

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 8 / 43

Page 15: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

def train_perceptron(x, y, nb_epochs_max):

w = Tensor(x.size (1)).zero_()

for e in range(nb_epochs_max):

nb_changes = 0

for i in range(x.size (0)):

if x[i].dot(w) * y[i] <= 0:

w = w + y[i] * x[i]

nb_changes = nb_changes + 1

if nb_changes == 0: break;

return w

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 9 / 43

Page 16: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 10 / 43

Page 17: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 10 / 43

Page 18: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 10 / 43

Page 19: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 10 / 43

Page 20: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 10 / 43

Page 21: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 10 / 43

Page 22: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

This crude algorithm works often surprisingly well. With MNIST’s “0”s asnegative class, and “1”s as positive one.

epoch 0 nb_changes 64 train_error 0.23% test_error 0.19%

epoch 1 nb_changes 24 train_error 0.07% test_error 0.00%

epoch 2 nb_changes 10 train_error 0.06% test_error 0.05%

epoch 3 nb_changes 6 train_error 0.03% test_error 0.14%

epoch 4 nb_changes 5 train_error 0.03% test_error 0.09%

epoch 5 nb_changes 4 train_error 0.02% test_error 0.14%

epoch 6 nb_changes 3 train_error 0.01% test_error 0.14%

epoch 7 nb_changes 2 train_error 0.00% test_error 0.14%

epoch 8 nb_changes 0 train_error 0.00% test_error 0.14%

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 11 / 43

Page 23: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

This crude algorithm works often surprisingly well. With MNIST’s “0”s asnegative class, and “1”s as positive one.

epoch 0 nb_changes 64 train_error 0.23% test_error 0.19%

epoch 1 nb_changes 24 train_error 0.07% test_error 0.00%

epoch 2 nb_changes 10 train_error 0.06% test_error 0.05%

epoch 3 nb_changes 6 train_error 0.03% test_error 0.14%

epoch 4 nb_changes 5 train_error 0.03% test_error 0.09%

epoch 5 nb_changes 4 train_error 0.02% test_error 0.14%

epoch 6 nb_changes 3 train_error 0.01% test_error 0.14%

epoch 7 nb_changes 2 train_error 0.00% test_error 0.14%

epoch 8 nb_changes 0 train_error 0.00% test_error 0.14%

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 11 / 43

Page 24: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

This crude algorithm works often surprisingly well. With MNIST’s “0”s asnegative class, and “1”s as positive one.

epoch 0 nb_changes 64 train_error 0.23% test_error 0.19%

epoch 1 nb_changes 24 train_error 0.07% test_error 0.00%

epoch 2 nb_changes 10 train_error 0.06% test_error 0.05%

epoch 3 nb_changes 6 train_error 0.03% test_error 0.14%

epoch 4 nb_changes 5 train_error 0.03% test_error 0.09%

epoch 5 nb_changes 4 train_error 0.02% test_error 0.14%

epoch 6 nb_changes 3 train_error 0.01% test_error 0.14%

epoch 7 nb_changes 2 train_error 0.00% test_error 0.14%

epoch 8 nb_changes 0 train_error 0.00% test_error 0.14%

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 11 / 43

Page 25: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We can get a convergence result under two assumptions:

γ

w∗

R

·

1. The xn are in a sphere of radius R:∃R > 0, ∀n, ‖xn‖ ≤ R.

2. The two populations can be separated with a margin γ > 0.∃w∗, ‖w∗‖ = 1, ∃γ > 0, ∀n, yn (xn · w∗) ≥ γ/2.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 12 / 43

Page 26: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We can get a convergence result under two assumptions:

γ

w∗

R

·

1. The xn are in a sphere of radius R:∃R > 0, ∀n, ‖xn‖ ≤ R.

2. The two populations can be separated with a margin γ > 0.∃w∗, ‖w∗‖ = 1, ∃γ > 0, ∀n, yn (xn · w∗) ≥ γ/2.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 12 / 43

Page 27: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We can get a convergence result under two assumptions:

γ

w∗

R

·

1. The xn are in a sphere of radius R:∃R > 0, ∀n, ‖xn‖ ≤ R.

2. The two populations can be separated with a margin γ > 0.∃w∗, ‖w∗‖ = 1, ∃γ > 0, ∀n, yn (xn · w∗) ≥ γ/2.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 12 / 43

Page 28: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

To prove the convergence, let us make the assumption that there still is amisclassified sample at iteration k, and wk+1 is the weight vector updated withit. We have

wk+1 · w∗ =(wk + ynk xnk

)· w∗

= wk · w∗ + ynk (xnk · w∗)

≥ wk · w∗ + γ/2

≥ (k + 1) γ/2.

Since‖wk‖‖w∗‖ ≥ wk · w∗,

we get

‖wk‖2 ≥(wk · w∗

)2/‖w∗‖2

≥ k2γ2/4.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 13 / 43

Page 29: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

To prove the convergence, let us make the assumption that there still is amisclassified sample at iteration k, and wk+1 is the weight vector updated withit. We have

wk+1 · w∗ =(wk + ynk xnk

)· w∗

= wk · w∗ + ynk (xnk · w∗)

≥ wk · w∗ + γ/2

≥ (k + 1) γ/2.

Since‖wk‖‖w∗‖ ≥ wk · w∗,

we get

‖wk‖2 ≥(wk · w∗

)2/‖w∗‖2

≥ k2γ2/4.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 13 / 43

Page 30: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

And

‖wk+1‖2 = wk+1 · wk+1

=(wk + ynk xnk

)·(wk + ynk xnk

)= wk · wk + 2 ynk w

k · xnk︸ ︷︷ ︸≤0

+ ‖xnk ‖2︸ ︷︷ ︸

≤R2

≤ ‖wk‖2 + R2

≤ (k + 1)R2.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 14 / 43

Page 31: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Putting these two results together, we get

k2γ2/4 ≤ ‖wk‖2 ≤ k R2

hencek ≤ 4R2/γ2,

hence no misclassified sample can remain after⌊4R2/γ2

⌋iterations.

This result makes sense:

• The bound does not change if the population is scaled, and

• the larger the margin, the more quickly the algorithm classifies all thesamples correctly.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 15 / 43

Page 32: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Putting these two results together, we get

k2γ2/4 ≤ ‖wk‖2 ≤ k R2

hencek ≤ 4R2/γ2,

hence no misclassified sample can remain after⌊4R2/γ2

⌋iterations.

This result makes sense:

• The bound does not change if the population is scaled, and

• the larger the margin, the more quickly the algorithm classifies all thesamples correctly.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 15 / 43

Page 33: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The perceptron stops as soon as it finds a separating boundary.

Other algorithms maximize the distance of samples to the decision boundary,which improves robustness to noise.

Support Vector Machines (SVM) achieve this by minimizing

L(w , b) = λ‖w‖2 +1

N

∑n

max(0, 1− yn(w · xn + b)),

which is convex and has a global optimum.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 16 / 43

Page 34: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The perceptron stops as soon as it finds a separating boundary.

Other algorithms maximize the distance of samples to the decision boundary,which improves robustness to noise.

Support Vector Machines (SVM) achieve this by minimizing

L(w , b) = λ‖w‖2 +1

N

∑n

max(0, 1− yn(w · xn + b)),

which is convex and has a global optimum.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 16 / 43

Page 35: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

L(w , b) = λ‖w‖2 +1

N

∑n

max(0, 1− yn(w · xn + b))

2‖w‖2

Support vectors

Minimizing max(0, 1− yn(w · xn + b)) pushes the nth sample beyond the planew · x + b = yn, and minimizing ‖w‖2 increases the distance between thew · x + b = ±1.

At convergence, only a small number of samples matter, the “support vectors”.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 17 / 43

Page 36: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

L(w , b) = λ‖w‖2 +1

N

∑n

max(0, 1− yn(w · xn + b))

2‖w‖2

Support vectors

Minimizing max(0, 1− yn(w · xn + b)) pushes the nth sample beyond the planew · x + b = yn, and minimizing ‖w‖2 increases the distance between thew · x + b = ±1.

At convergence, only a small number of samples matter, the “support vectors”.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 17 / 43

Page 37: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

L(w , b) = λ‖w‖2 +1

N

∑n

max(0, 1− yn(w · xn + b))

2‖w‖2

Support vectors

Minimizing max(0, 1− yn(w · xn + b)) pushes the nth sample beyond the planew · x + b = yn, and minimizing ‖w‖2 increases the distance between thew · x + b = ±1.

At convergence, only a small number of samples matter, the “support vectors”.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 17 / 43

Page 38: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The termmax(0, 1− α)

is the so called “hinge loss”

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 18 / 43

Page 39: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Probabilistic interpretation of linear classifiers

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 19 / 43

Page 40: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The Linear Discriminant Analysis (LDA) algorithm provides a nice bridgebetween these linear classifiers and probabilistic modeling.

Consider the following class populations

∀y ∈ {0, 1}, x ∈ RD ,

µX |Y=y (x) =1√

(2π)D |Σ|exp

(−

1

2(x −my )Σ−1(x −my )T

).

That is, they are Gaussian with the same covariance matrix Σ. This is thehomoscedasticity assumption.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 20 / 43

Page 41: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The Linear Discriminant Analysis (LDA) algorithm provides a nice bridgebetween these linear classifiers and probabilistic modeling.

Consider the following class populations

∀y ∈ {0, 1}, x ∈ RD ,

µX |Y=y (x) =1√

(2π)D |Σ|exp

(−

1

2(x −my )Σ−1(x −my )T

).

That is, they are Gaussian with the same covariance matrix Σ. This is thehomoscedasticity assumption.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 20 / 43

Page 42: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We have

P(Y = 1 | X = x)

=µX |Y=1(x)P(Y = 1)

µX (x)

=µX |Y=1(x)P(Y = 1)

µX |Y=0(x)P(Y = 0) + µX |Y=1(x)P(Y = 1)

=1

1 +µX|Y=0(x)

µX|Y=1(x)P(Y=0)P(Y=1)

.

It follows that, with

σ(x) =1

1 + e−x,

we get

P(Y = 1 | X = x) = σ

(log

µX |Y=1(x)

µX |Y=0(x)+ log

P(Y = 1)

P(Y = 0)

).

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 21 / 43

Page 43: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We have

P(Y = 1 | X = x) =µX |Y=1(x)P(Y = 1)

µX (x)

=µX |Y=1(x)P(Y = 1)

µX |Y=0(x)P(Y = 0) + µX |Y=1(x)P(Y = 1)

=1

1 +µX|Y=0(x)

µX|Y=1(x)P(Y=0)P(Y=1)

.

It follows that, with

σ(x) =1

1 + e−x,

we get

P(Y = 1 | X = x) = σ

(log

µX |Y=1(x)

µX |Y=0(x)+ log

P(Y = 1)

P(Y = 0)

).

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 21 / 43

Page 44: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We have

P(Y = 1 | X = x) =µX |Y=1(x)P(Y = 1)

µX (x)

=µX |Y=1(x)P(Y = 1)

µX |Y=0(x)P(Y = 0) + µX |Y=1(x)P(Y = 1)

=1

1 +µX|Y=0(x)

µX|Y=1(x)P(Y=0)P(Y=1)

.

It follows that, with

σ(x) =1

1 + e−x,

we get

P(Y = 1 | X = x) = σ

(log

µX |Y=1(x)

µX |Y=0(x)+ log

P(Y = 1)

P(Y = 0)

).

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 21 / 43

Page 45: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We have

P(Y = 1 | X = x) =µX |Y=1(x)P(Y = 1)

µX (x)

=µX |Y=1(x)P(Y = 1)

µX |Y=0(x)P(Y = 0) + µX |Y=1(x)P(Y = 1)

=1

1 +µX|Y=0(x)

µX|Y=1(x)P(Y=0)P(Y=1)

.

It follows that, with

σ(x) =1

1 + e−x,

we get

P(Y = 1 | X = x) = σ

(log

µX |Y=1(x)

µX |Y=0(x)+ log

P(Y = 1)

P(Y = 0)

).

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 21 / 43

Page 46: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We have

P(Y = 1 | X = x) =µX |Y=1(x)P(Y = 1)

µX (x)

=µX |Y=1(x)P(Y = 1)

µX |Y=0(x)P(Y = 0) + µX |Y=1(x)P(Y = 1)

=1

1 +µX|Y=0(x)

µX|Y=1(x)P(Y=0)P(Y=1)

.

It follows that, with

σ(x) =1

1 + e−x,

we get

P(Y = 1 | X = x) = σ

(log

µX |Y=1(x)

µX |Y=0(x)+ log

P(Y = 1)

P(Y = 0)

).

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 21 / 43

Page 47: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We have

P(Y = 1 | X = x) =µX |Y=1(x)P(Y = 1)

µX (x)

=µX |Y=1(x)P(Y = 1)

µX |Y=0(x)P(Y = 0) + µX |Y=1(x)P(Y = 1)

=1

1 +µX|Y=0(x)

µX|Y=1(x)P(Y=0)P(Y=1)

.

It follows that, with

σ(x) =1

1 + e−x,

we get

P(Y = 1 | X = x) = σ

(log

µX |Y=1(x)

µX |Y=0(x)+ log

P(Y = 1)

P(Y = 0)

).

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 21 / 43

Page 48: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

So with our Gaussians µX |Y=y of same Σ, we get

P(Y = 1 | X = x)

= σ

(log

µX |Y=1(x)

µX |Y=0(x)+ log

P(Y = 1)

P(Y = 0)︸ ︷︷ ︸a

)

= σ(logµX |Y=1(x)− logµX |Y=0(x) + a

)= σ

(−

1

2(x −m1)Σ−1(x −m1)T +

1

2(x −m0)Σ−1(x −m0)T + a

)= σ

(−

1

2xΣ−1xT + m1Σ−1xT −

1

2m1Σ−1mT

1

+1

2xΣ−1xT −m0Σ−1xT +

1

2m0Σ−1mT

0 + a

)= σ

((m1 −m0)Σ−1︸ ︷︷ ︸

w

xT +1

2

(m0Σ−1mT

0 −m1Σ−1mT1

)+ a︸ ︷︷ ︸

b

)

= σ(w · x + b).

The homoscedasticity makes the second-order terms vanish.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 22 / 43

Page 49: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

So with our Gaussians µX |Y=y of same Σ, we get

P(Y = 1 | X = x)

= σ

(log

µX |Y=1(x)

µX |Y=0(x)+ log

P(Y = 1)

P(Y = 0)︸ ︷︷ ︸a

)

= σ(logµX |Y=1(x)− logµX |Y=0(x) + a

)= σ

(−

1

2(x −m1)Σ−1(x −m1)T +

1

2(x −m0)Σ−1(x −m0)T + a

)= σ

(−

1

2xΣ−1xT + m1Σ−1xT −

1

2m1Σ−1mT

1

+1

2xΣ−1xT −m0Σ−1xT +

1

2m0Σ−1mT

0 + a

)= σ

((m1 −m0)Σ−1︸ ︷︷ ︸

w

xT +1

2

(m0Σ−1mT

0 −m1Σ−1mT1

)+ a︸ ︷︷ ︸

b

)

= σ(w · x + b).

The homoscedasticity makes the second-order terms vanish.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 22 / 43

Page 50: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

So with our Gaussians µX |Y=y of same Σ, we get

P(Y = 1 | X = x)

= σ

(log

µX |Y=1(x)

µX |Y=0(x)+ log

P(Y = 1)

P(Y = 0)︸ ︷︷ ︸a

)

= σ(logµX |Y=1(x)− logµX |Y=0(x) + a

)

= σ

(−

1

2(x −m1)Σ−1(x −m1)T +

1

2(x −m0)Σ−1(x −m0)T + a

)= σ

(−

1

2xΣ−1xT + m1Σ−1xT −

1

2m1Σ−1mT

1

+1

2xΣ−1xT −m0Σ−1xT +

1

2m0Σ−1mT

0 + a

)= σ

((m1 −m0)Σ−1︸ ︷︷ ︸

w

xT +1

2

(m0Σ−1mT

0 −m1Σ−1mT1

)+ a︸ ︷︷ ︸

b

)

= σ(w · x + b).

The homoscedasticity makes the second-order terms vanish.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 22 / 43

Page 51: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

So with our Gaussians µX |Y=y of same Σ, we get

P(Y = 1 | X = x)

= σ

(log

µX |Y=1(x)

µX |Y=0(x)+ log

P(Y = 1)

P(Y = 0)︸ ︷︷ ︸a

)

= σ(logµX |Y=1(x)− logµX |Y=0(x) + a

)= σ

(−

1

2(x −m1)Σ−1(x −m1)T +

1

2(x −m0)Σ−1(x −m0)T + a

)

= σ

(−

1

2xΣ−1xT + m1Σ−1xT −

1

2m1Σ−1mT

1

+1

2xΣ−1xT −m0Σ−1xT +

1

2m0Σ−1mT

0 + a

)= σ

((m1 −m0)Σ−1︸ ︷︷ ︸

w

xT +1

2

(m0Σ−1mT

0 −m1Σ−1mT1

)+ a︸ ︷︷ ︸

b

)

= σ(w · x + b).

The homoscedasticity makes the second-order terms vanish.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 22 / 43

Page 52: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

So with our Gaussians µX |Y=y of same Σ, we get

P(Y = 1 | X = x)

= σ

(log

µX |Y=1(x)

µX |Y=0(x)+ log

P(Y = 1)

P(Y = 0)︸ ︷︷ ︸a

)

= σ(logµX |Y=1(x)− logµX |Y=0(x) + a

)= σ

(−

1

2(x −m1)Σ−1(x −m1)T +

1

2(x −m0)Σ−1(x −m0)T + a

)= σ

(−

1

2xΣ−1xT + m1Σ−1xT −

1

2m1Σ−1mT

1

+1

2xΣ−1xT −m0Σ−1xT +

1

2m0Σ−1mT

0 + a

)

= σ

((m1 −m0)Σ−1︸ ︷︷ ︸

w

xT +1

2

(m0Σ−1mT

0 −m1Σ−1mT1

)+ a︸ ︷︷ ︸

b

)

= σ(w · x + b).

The homoscedasticity makes the second-order terms vanish.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 22 / 43

Page 53: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

So with our Gaussians µX |Y=y of same Σ, we get

P(Y = 1 | X = x)

= σ

(log

µX |Y=1(x)

µX |Y=0(x)+ log

P(Y = 1)

P(Y = 0)︸ ︷︷ ︸a

)

= σ(logµX |Y=1(x)− logµX |Y=0(x) + a

)= σ

(−

1

2(x −m1)Σ−1(x −m1)T +

1

2(x −m0)Σ−1(x −m0)T + a

)= σ

(−

1

2xΣ−1xT + m1Σ−1xT −

1

2m1Σ−1mT

1

+1

2xΣ−1xT −m0Σ−1xT +

1

2m0Σ−1mT

0 + a

)= σ

((m1 −m0)Σ−1︸ ︷︷ ︸

w

xT +1

2

(m0Σ−1mT

0 −m1Σ−1mT1

)+ a︸ ︷︷ ︸

b

)

= σ(w · x + b).

The homoscedasticity makes the second-order terms vanish.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 22 / 43

Page 54: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

So with our Gaussians µX |Y=y of same Σ, we get

P(Y = 1 | X = x)

= σ

(log

µX |Y=1(x)

µX |Y=0(x)+ log

P(Y = 1)

P(Y = 0)︸ ︷︷ ︸a

)

= σ(logµX |Y=1(x)− logµX |Y=0(x) + a

)= σ

(−

1

2(x −m1)Σ−1(x −m1)T +

1

2(x −m0)Σ−1(x −m0)T + a

)= σ

(−

1

2xΣ−1xT + m1Σ−1xT −

1

2m1Σ−1mT

1

+1

2xΣ−1xT −m0Σ−1xT +

1

2m0Σ−1mT

0 + a

)= σ

((m1 −m0)Σ−1︸ ︷︷ ︸

w

xT +1

2

(m0Σ−1mT

0 −m1Σ−1mT1

)+ a︸ ︷︷ ︸

b

)

= σ(w · x + b).

The homoscedasticity makes the second-order terms vanish.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 22 / 43

Page 55: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

So with our Gaussians µX |Y=y of same Σ, we get

P(Y = 1 | X = x)

= σ

(log

µX |Y=1(x)

µX |Y=0(x)+ log

P(Y = 1)

P(Y = 0)︸ ︷︷ ︸a

)

= σ(logµX |Y=1(x)− logµX |Y=0(x) + a

)= σ

(−

1

2(x −m1)Σ−1(x −m1)T +

1

2(x −m0)Σ−1(x −m0)T + a

)= σ

(−

1

2xΣ−1xT + m1Σ−1xT −

1

2m1Σ−1mT

1

+1

2xΣ−1xT −m0Σ−1xT +

1

2m0Σ−1mT

0 + a

)= σ

((m1 −m0)Σ−1︸ ︷︷ ︸

w

xT +1

2

(m0Σ−1mT

0 −m1Σ−1mT1

)+ a︸ ︷︷ ︸

b

)

= σ(w · x + b).

The homoscedasticity makes the second-order terms vanish.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 22 / 43

Page 56: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

µX |Y=0 µX |Y=1

P(Y = 1 | X = x)

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 23 / 43

Page 57: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

µX |Y=0 µX |Y=1

P(Y = 1 | X = x)

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 23 / 43

Page 58: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

µX |Y=0 µX |Y=1

P(Y = 1 | X = x)

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 23 / 43

Page 59: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

µX |Y=0 µX |Y=1

P(Y = 1 | X = x)

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 23 / 43

Page 60: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Note that the (logistic) sigmoid function

σ(x) =1

1 + e−x,

looks like a “soft heavyside”

0

1

So the overall modelf (x ;w , b) = σ(w · x + b)

looks very similar to the perceptron.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 24 / 43

Page 61: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Note that the (logistic) sigmoid function

σ(x) =1

1 + e−x,

looks like a “soft heavyside”

0

1

So the overall modelf (x ;w , b) = σ(w · x + b)

looks very similar to the perceptron.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 24 / 43

Page 62: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We can use the model from LDA

f (x ;w , b) = σ(w · x + b)

but instead of modeling the densities and derive the values of w and b, directlycompute them by maximizing their probability given the training data.

First, to simplify the next slide, note that we have

1− σ(x) = 1−1

1 + e−x= σ(−x),

hence if Y takes value in {−1, 1} then

∀y ∈ {−1, 1}, P(Y = y | X = x) = σ(y(w · x + b)).

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 25 / 43

Page 63: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We can use the model from LDA

f (x ;w , b) = σ(w · x + b)

but instead of modeling the densities and derive the values of w and b, directlycompute them by maximizing their probability given the training data.

First, to simplify the next slide, note that we have

1− σ(x) = 1−1

1 + e−x= σ(−x),

hence if Y takes value in {−1, 1} then

∀y ∈ {−1, 1}, P(Y = y | X = x) = σ(y(w · x + b)).

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 25 / 43

Page 64: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We have

log µW ,B(w , b | D = d)

= logµD(d |W = w ,B = b)µW ,B(w , b)

µD(d)

= log µD(d |W = w ,B = b) + log µW ,B(w , b)− log Z

=∑n

log σ(yn(w · xn + b)) + log µW ,B(w , b)− log Z ′

This is the logistic regression, whose loss aims at minimizing

− log σ(ynf (xn))

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 26 / 43

Page 65: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We have

log µW ,B(w , b | D = d)

= logµD(d |W = w ,B = b)µW ,B(w , b)

µD(d)

= log µD(d |W = w ,B = b) + log µW ,B(w , b)− log Z

=∑n

log σ(yn(w · xn + b)) + log µW ,B(w , b)− log Z ′

This is the logistic regression, whose loss aims at minimizing

− log σ(ynf (xn))

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 26 / 43

Page 66: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Although the probabilistic and Bayesian formulations may be helpful in certaincontexts, the bulk of deep learning is disconnected from such modeling.

We will come back sometime to a probabilistic interpretation, but most of themethods will be envisioned from the signal-processing and optimization angles.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 27 / 43

Page 67: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Although the probabilistic and Bayesian formulations may be helpful in certaincontexts, the bulk of deep learning is disconnected from such modeling.

We will come back sometime to a probabilistic interpretation, but most of themethods will be envisioned from the signal-processing and optimization angles.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 27 / 43

Page 68: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Multi-dimensional output

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 28 / 43

Page 69: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We can combine multiple linear predictors into a “layer” that takes severalinputs and produces several outputs:

∀i = 1, . . . ,M, yi = σ

N∑j=1

wi,jxj + bi

where bi is the “bias” of the i-th unit, and wi,1, . . . ,wi,N are its weights.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 29 / 43

Page 70: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

With M = 2 and N = 3, we can picture such a layer as

x1

x2

x3

×

×

×

×

×

×

w1,1

w1,2

w1,3

w2,1

w2,2

w2,3

Σ

Σ

b1

b2

σ

σ

y1

y2

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 30 / 43

Page 71: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

If we forget the historical interpretation as “neurons”, we can use a cleareralgebraic / tensorial formulation:

y = σ (wx + b)

where x ∈ RN , w ∈ RM×N , b ∈ RM , y ∈ RM , and σ denotes a component-wiseextension of the R→ R mapping:

σ : (y1, . . . , yM) 7→ (σ(y1), . . . , σ(yM)).

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 31 / 43

Page 72: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

With “×” for the matrix-vector product, the “tensorial block figure” remainsalmost identical to that of the single neuron.

x ×

w

+

b

σ y

RN RM RM RM

RM×N RM

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 32 / 43

Page 73: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

With “×” for the matrix-vector product, the “tensorial block figure” remainsalmost identical to that of the single neuron.

x ×

w

+

b

σ y

RN RM RM RM

RM×N RM

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 32 / 43

Page 74: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Limitations of linear predictors, feature design

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 33 / 43

Page 75: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The main weakness of linear predictors is their lack of capacity. Forclassification, the populations have to be linearly separable.

“xor”

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 34 / 43

Page 76: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The main weakness of linear predictors is their lack of capacity. Forclassification, the populations have to be linearly separable.

“xor”

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 34 / 43

Page 77: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The main weakness of linear predictors is their lack of capacity. Forclassification, the populations have to be linearly separable.

“xor”

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 34 / 43

Page 78: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The xor example can be solved by pre-processing the data to make the twopopulations linearly separable:

Φ : (xu , xv ) 7→ (xu , xv , xuxv ).

So we can model the xor with

f (x) = σ(w Φ(x) + b).

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 35 / 43

Page 79: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The xor example can be solved by pre-processing the data to make the twopopulations linearly separable:

Φ : (xu , xv ) 7→ (xu , xv , xuxv ).

So we can model the xor with

f (x) = σ(w Φ(x) + b).

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 35 / 43

Page 80: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The xor example can be solved by pre-processing the data to make the twopopulations linearly separable:

Φ : (xu , xv ) 7→ (xu , xv , xuxv ).

So we can model the xor with

f (x) = σ(w Φ(x) + b).

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 35 / 43

Page 81: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

x Φ ×

w

+

b

σ y

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 36 / 43

Page 82: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

This is similar to the polynomial regression. If we have

Φ : x 7→ (1, x , x2, . . . , xD)

andα = (α0, . . . , αD)

thenD∑

d=0

αdxd = α · Φ(x).

By increasing D, we can approximate any continuous real function on acompact space (Stone-Weierstrass theorem).

It means that we can make the capacity as high as we want.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 37 / 43

Page 83: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

This is similar to the polynomial regression. If we have

Φ : x 7→ (1, x , x2, . . . , xD)

andα = (α0, . . . , αD)

thenD∑

d=0

αdxd = α · Φ(x).

By increasing D, we can approximate any continuous real function on acompact space (Stone-Weierstrass theorem).

It means that we can make the capacity as high as we want.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 37 / 43

Page 84: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We can apply the same to a more realistic binary classification problem:MNIST’s “8” vs. the other classes with a perceptron.

The original 28× 28 features are supplemented with the products of pairs offeatures taken at random.

0

1

2

3

4

5

6

7

103 104

Err

or (%

)

Nb. of features

Train errorTest error

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 38 / 43

Page 85: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

We can apply the same to a more realistic binary classification problem:MNIST’s “8” vs. the other classes with a perceptron.

The original 28× 28 features are supplemented with the products of pairs offeatures taken at random.

0

1

2

3

4

5

6

7

103 104

Err

or (%

)

Nb. of features

Train errorTest error

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 38 / 43

Page 86: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Remember the bias-variance tradeoff.

E((Y − y)2) = (E(Y )− y)2︸ ︷︷ ︸Bias

+ V(Y )︸ ︷︷ ︸Variance

.

The right class of models reduces the bias more and increases the variance less.

Beside increasing capacity to reduce the bias, “feature design” may also be away of reducing capacity without hurting the bias, or with improving it.

In particular, good features should be invariant to perturbations of the signalknown to keep the value to predict unchanged.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 39 / 43

Page 87: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Remember the bias-variance tradeoff.

E((Y − y)2) = (E(Y )− y)2︸ ︷︷ ︸Bias

+ V(Y )︸ ︷︷ ︸Variance

.

The right class of models reduces the bias more and increases the variance less.

Beside increasing capacity to reduce the bias, “feature design” may also be away of reducing capacity without hurting the bias, or with improving it.

In particular, good features should be invariant to perturbations of the signalknown to keep the value to predict unchanged.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 39 / 43

Page 88: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Remember the bias-variance tradeoff.

E((Y − y)2) = (E(Y )− y)2︸ ︷︷ ︸Bias

+ V(Y )︸ ︷︷ ︸Variance

.

The right class of models reduces the bias more and increases the variance less.

Beside increasing capacity to reduce the bias, “feature design” may also be away of reducing capacity without hurting the bias, or with improving it.

In particular, good features should be invariant to perturbations of the signalknown to keep the value to predict unchanged.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 39 / 43

Page 89: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Training points

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 40 / 43

Page 90: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Votes (K=11)

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 40 / 43

Page 91: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Prediction (K=11)

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 40 / 43

Page 92: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Training points

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 40 / 43

Page 93: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Votes, radial feature (K=11)

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 40 / 43

Page 94: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Prediction, radial feature (K=11)

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 40 / 43

Page 95: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

A classical example is the “Histogram of Oriented Gradient” descriptors (HOG),initially designed for person detection.

Roughly: divide the image in 8× 8 blocks, compute in each the distribution ofedge orientations over 9 bins.

Dalal and Triggs (2005) combined them with a SVM, and Dollar et al. (2009)extended them with other modalities into the “channel features”.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 41 / 43

Page 96: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Many methods (perceptron, SVM, k-means, PCA, etc.) only require tocompute κ(x , x ′) = Φ(x) · Φ(x ′) for any (x , x ′).

So one needs to specify κ alone, and may keep Φ undefined.

This is the kernel trick, which we will not talk about in this course.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 42 / 43

Page 97: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Many methods (perceptron, SVM, k-means, PCA, etc.) only require tocompute κ(x , x ′) = Φ(x) · Φ(x ′) for any (x , x ′).

So one needs to specify κ alone, and may keep Φ undefined.

This is the kernel trick, which we will not talk about in this course.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 42 / 43

Page 98: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

Training a model composed of manually engineered features and a parametricmodel such as logistic regression is now referred to as “shallow learning”.

The signal goes through a single processing trained from data.

Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 43 / 43

Page 99: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

The end

Page 100: EE-559 { Deep learning 3a. Linear classi ers, perceptron · The perceptron classi cation rule boils down to f(x) = ˙(w x + b): For neural networks, the function ˙that follows a

References

N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. InConference on Computer Vision and Pattern Recognition (CVPR), pages 886–893, 2005.

P. Dollar, Z. Tu, P. Perona, and S. Belongie. Integral channel features. In British MachineVision Conference, pages 91.1–91.11, 2009.

W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity.The bulletin of mathematical biophysics, 5(4):115–133, 1943.

F. Rosenblatt. The perceptron–A perceiving and recognizing automaton. Technical Report85-460-1, Cornell Aeronautical Laboratory, 1957.


Recommended