15 Machine Learning Multilayer Perceptron

transcript

Neural NetworksMultilayer Perceptron

Andres Mendez-Vazquez

December 12, 2015

1 / 94

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

2 / 94

3 / 94

Do you remember?

The Perceptron has the following problemGiven that the perceptron is a linear classifier

It is clear that

It will never be able to classify stuff that is not linearly separable

4 / 94

Do you remember?

The Perceptron has the following problemGiven that the perceptron is a linear classifier

It is clear that

It will never be able to classify stuff that is not linearly separable

4 / 94

Example: XOR Problem

The Problem

Class 1

Class 2

5 / 94

The Perceptron cannot solve it

BecauseThe perceptron is a linear classifier!!!

ThusSomething needs to be done!!!

MaybeAdd an extra layer!!!

6 / 94

A little bit of historyIt was first cited by VapnikVapnik cites (Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimalprogramming problems with inequality constraints. I: Necessary conditionsfor extremal solutions. AIAA J. 1, 11 (1963) 2544-2550) as the firstpublication of the backpropagation algorithm in his book "Support VectorMachines."

It was first used byArthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamicsystem optimization method in 1969.

HoweverIt was not until 1974 and later, when applied in the context of neuralnetworks and through the work of Paul Werbos, David E. Rumelhart,Geoffrey E. Hinton and Ronald J. Williams that it gained recognition.

7 / 94

Something NotableIt led to a “renaissance” in the field of artificial neural network research.

NeverthelessDuring the 2000s it fell out of favour but has returned again in the 2010s,now able to train much larger networks using huge modern computingpower such as GPUs.

8 / 94

Something NotableIt led to a “renaissance” in the field of artificial neural network research.

NeverthelessDuring the 2000s it fell out of favour but has returned again in the 2010s,now able to train much larger networks using huge modern computingpower such as GPUs.

8 / 94

9 / 94

Multi-Layer Perceptron (MLP)Multi-Layer Architecture·

Output

Target

Hidden

Sigmoid Activationfunction

Identity Activationfunction

10 / 94

Information Flow

We have the following information flow

Function Signals

Error Signals

11 / 94

Explanation

Problems with Hidden Layers1 Increase complexity of Training2 It is necessary to think about “Long and Narrow” network vs “Short

and Fat” network.

Intuition for a One Hidden Layer1 For every input case of region, that region can be delimited by

hyperplanes on all sides using hidden units on the first hidden layer.2 A hidden unit in the second layer than ANDs them together to bound

the region.

AdvantagesIt has been proven that an MLP with one hidden layer can learn anynonlinear function of the input.

12 / 94

Explanation

and Fat” network.

the region.

12 / 94

Explanation

and Fat” network.

the region.

12 / 94

Explanation

and Fat” network.

the region.

12 / 94

Explanation

and Fat” network.

the region.

12 / 94

The Process

We have something like thisLayer 1

(0,0,1)

(1,0,0)

Layer 2

13 / 94

14 / 94

Remember!!! The Quadratic Learning Error function

Cost Function our well know error at pattern m

J (m) = 12e2

k (m) (1)

Delta Rule or Widrow-Hoff Rule

∆wkj (m) = −ηek (m) xj(m) (2)

Actually this is know as Gradient Descent

wkj (m + 1) = wkj (m) + ∆wkj (m) (3)

15 / 94

J (m) = 12e2

k (m) (1)

∆wkj (m) = −ηek (m) xj(m) (2)

wkj (m + 1) = wkj (m) + ∆wkj (m) (3)

15 / 94

J (m) = 12e2

k (m) (1)

∆wkj (m) = −ηek (m) xj(m) (2)

wkj (m + 1) = wkj (m) + ∆wkj (m) (3)

15 / 94

Back-propagation

SetupLet tk be the k-th target (or desired) output and zk be the k-th computedoutput with k = 1, . . . , d and w represents all the weights of the network

Training Error for a single Pattern or Sample!!!

J (w) = 12

c∑k=1

(tk − zk)2 = 12 ‖t − z‖2 (4)

16 / 94

Back-propagation

SetupLet tk be the k-th target (or desired) output and zk be the k-th computedoutput with k = 1, . . . , d and w represents all the weights of the network

Training Error for a single Pattern or Sample!!!

J (w) = 12

c∑k=1

(tk − zk)2 = 12 ‖t − z‖2 (4)

16 / 94

17 / 94

Gradient DescentGradient DescentThe back-propagation learning rule is based on gradient descent.

Reducing the ErrorThe weights are initialized with pseudo-random values and are changed ina direction that will reduce the error:

∆w = −η ∂J∂w (5)

Whereη is the learning rate which indicates the relative size of the change inweights:

w (m + 1) = w (m) + ∆w (m) (6)

where m is the m-th pattern presented18 / 94

∆w = −η ∂J∂w (5)

w (m + 1) = w (m) + ∆w (m) (6)

∆w = −η ∂J∂w (5)

w (m + 1) = w (m) + ∆w (m) (6)

19 / 94

Multilayer ArchitectureMultilayer Architecture: hidden–to-output weights

Output

Target

Hidden

20 / 94

Observation about the activation function

Hidden Output is equal to

yj = f( d∑

i=1wjixi

Output is equal to

zk = f

ynH∑j=1

21 / 94

Observation about the activation function

Hidden Output is equal to

yj = f( d∑

i=1wjixi

Output is equal to

zk = f

ynH∑j=1

21 / 94

Hidden–to-Output Weights

Error on the hidden–to-output weights∂J∂wkj

= ∂J∂netk

· ∂netk∂wkj

= −δk ·∂netk∂wkj

It describes how the overall error changes with the activation of the unit’snet:

netk =ynH∑j=1

wkjyj = wTk · y (8)

δk = − ∂J∂netk

= − ∂J∂zk· ∂zk∂netk

= (tk − zk) f ′ (netk) (9)

22 / 94

= ∂J∂netk

· ∂netk∂wkj

netk =ynH∑j=1

= (tk − zk) f ′ (netk) (9)

22 / 94

= ∂J∂netk

· ∂netk∂wkj

netk =ynH∑j=1

= (tk − zk) f ′ (netk) (9)

22 / 94

zk = f (netk) (10)

Thus∂zk∂netk

= f ′ (netk) (11)

Since netk = wTk · y therefore:

∂netk∂wkj

= yj (12)

23 / 94

zk = f (netk) (10)

Thus∂zk∂netk

= f ′ (netk) (11)

∂netk∂wkj

= yj (12)

23 / 94

zk = f (netk) (10)

Thus∂zk∂netk

= f ′ (netk) (11)

∂netk∂wkj

= yj (12)

23 / 94

Finally

The weight update (or learning rule) for the hidden-to-output weightsis:

4wkj = ηδkyj = η (tk − zk) f ′ (netk) yj (13)

24 / 94

25 / 94

Multi-Layer ArchitectureMulti-Layer Architecture: Input–to-Hidden weights

Output

Target

Hidden

26 / 94

Input–to-Hidden WeightsError on the Input–to-Hidden weights

∂J∂wji

= ∂J∂yj· ∂yj∂netj

· ∂netj∂wji

∂J∂yj

c∑k=1

(tk − zk)2]

= −c∑

k=1(tk − zk) ∂zk

= −c∑

∂netk· ∂netk∂yj

= −c∑

k=1(tk − zk) ∂f (netk)

∂netk· wkj

27 / 94

Input–to-Hidden WeightsError on the Input–to-Hidden weights

∂J∂wji

= ∂J∂yj· ∂yj∂netj

· ∂netj∂wji

∂J∂yj

c∑k=1

(tk − zk)2]

= −c∑

∂netk· ∂netk∂yj

= −c∑

k=1(tk − zk) ∂f (netk)

∂netk· wkj

27 / 94

Input–to-Hidden Weights

Finally∂J∂yj

= −c∑

k=1(tk − zk) f ′ (netk) · wkj (15)

Remember

= (tk − zk) f ′ (netk) (16)

28 / 94

What is ∂yj∂netj

netj =d∑

i=1wjixi = wT

j · x (17)

yj = f (netj)

Then∂yj∂netj

= ∂f (netj)∂netj

= f ′ (netj)

29 / 94

netj =d∑

i=1wjixi = wT

j · x (17)

yj = f (netj)

Then∂yj∂netj

= f ′ (netj)

29 / 94

netj =d∑

i=1wjixi = wT

j · x (17)

yj = f (netj)

Then∂yj∂netj

= f ′ (netj)

29 / 94

Then, we can define δj

By defying the sensitivity for a hidden unit:

δj = f ′ (netj)c∑

k=1wkjδk (18)

Which means that:“The sensitivity at a hidden unit is simply the sum of the individualsensitivities at the output units weighted by the hidden-to-outputweights wkj ; all multiplied by f ′ (netj)”

30 / 94

k=1wkjδk (18)

30 / 94

k=1wkjδk (18)

30 / 94

What about ∂netj∂wji

We have that∂netj∂wji

=∂wT

j · x∂wji

= ∂∑d

i=1 wjixi∂wji

31 / 94

Finally

The learning rule for the input-to-hidden weights is:

∆wji = ηxiδj = η

[ c∑k=1

wkjδk

]f ′ (netj) xi (19)

32 / 94

Basically, the entire training process has the following steps

InitializationAssuming that no prior information is available, pick the synaptic weightsand thresholds

Forward ComputationCompute the induced function signals of the network by proceedingforward through the network, layer by layer.

Backward ComputationCompute the local gradients of the network.

FinallyAdjust the weights!!!

33 / 94

34 / 94

Now, Calculating Total Change

We have for thatThe Total Training Error is the sum over the errors of N individual patterns

The Total Training Error

J =N∑

p=1Jp = 1

N∑p=1

d∑k=1

(tpk − zp

k )2 = 12

n∑p=1‖tp − zp‖2 (20)

35 / 94

Now, Calculating Total Change

We have for thatThe Total Training Error is the sum over the errors of N individual patterns

The Total Training Error

J =N∑

p=1Jp = 1

N∑p=1

d∑k=1

(tpk − zp

k )2 = 12

n∑p=1‖tp − zp‖2 (20)

35 / 94

About the Total Training Error

RemarksA weight update may reduce the error on the single pattern beingpresented but can increase the error on the full training set.However, given a large number of such individual updates, the totalerror of equation (20) decreases.

36 / 94

About the Total Training Error

RemarksA weight update may reduce the error on the single pattern beingpresented but can increase the error on the full training set.However, given a large number of such individual updates, the totalerror of equation (20) decreases.

36 / 94

37 / 94

Now, we want the training to stop

ThereforeIt is necessary to have a way to stop when the change of the weights isenough!!!

A simple way to stop the trainingThe algorithm terminates when the change in the criterion functionJ (w) is smaller than some preset value Θ.

∆J (w) = |J (w (t + 1))− J (w (t))| (21)

There are other stopping criteria that lead to better performance thanthis one.

38 / 94

∆J (w) = |J (w (t + 1))− J (w (t))| (21)

38 / 94

∆J (w) = |J (w (t + 1))− J (w (t))| (21)

38 / 94

∆J (w) = |J (w (t + 1))− J (w (t))| (21)

38 / 94

Other Stopping Criteria

Norm of the GradientThe back-propagation algorithm is considered to have converged when theEuclidean norm of the gradient vector reaches a sufficiently small gradientthreshold.

‖∇wJ (m)‖ < Θ (22)

Rate of change in the average error per epochThe back-propagation algorithm is considered to have converged when theabsolute rate of change in the average squared error per epoch issufficiently small. ∣∣∣∣∣∣ 1

N∑p=1

∣∣∣∣∣∣ < Θ (23)

39 / 94

Other Stopping Criteria

Norm of the GradientThe back-propagation algorithm is considered to have converged when theEuclidean norm of the gradient vector reaches a sufficiently small gradientthreshold.

‖∇wJ (m)‖ < Θ (22)

Rate of change in the average error per epochThe back-propagation algorithm is considered to have converged when theabsolute rate of change in the average squared error per epoch issufficiently small. ∣∣∣∣∣∣ 1

N∑p=1

∣∣∣∣∣∣ < Θ (23)

39 / 94

About the Stopping Criteria

Observations1 Before training starts, the error on the training set is high.

I Through the learning process, the error becomes smaller.2 The error per pattern depends on the amount of training data and the

expressive power (such as the number of weights) in the network.3 The average error on an independent test set is always higher than on

the training set, and it can decrease as well as increase.4 A validation set is used in order to decide when to stop training.

I We do not want to over-fit the network and decrease the power of theclassifier generalization “we stop training at a minimum of the error onthe validation set”

40 / 94

Some More Terminology

EpochAs with other types of backpropagation, ’learning’ is a supervised processthat occurs with each cycle or ’epoch’ through a forward activation flow ofoutputs, and the backwards error propagation of weight adjustments.

In our caseI am using the batch sum of all correcting weights to define that epoch.

41 / 94

Some More Terminology

EpochAs with other types of backpropagation, ’learning’ is a supervised processthat occurs with each cycle or ’epoch’ through a forward activation flow ofoutputs, and the backwards error propagation of weight adjustments.

In our caseI am using the batch sum of all correcting weights to define that epoch.

41 / 94

42 / 94

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

(netj)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

16 return w (m) 43 / 94

44 / 94

Example of Architecture to be usedGiven the following Architecture and assuming N samples

45 / 94

46 / 94

Generating the output zk

Given the input

x1 x2 · · · xN]

Wherexi is a vector of features

x1ix2i...

47 / 94

Generating the output zk

Given the input

x1 x2 · · · xN]

Wherexi is a vector of features

x1ix2i...

47 / 94

ThereforeWe must have the following matrix for the input to hidden inputs

W IH =

w11 w12 · · · w1dw21 w22 · · · w2d...

.... . .

...wnH 1 wnH 2 · · · wnH d

Given that wj =

wj1wj2...

ThusWe can create the netj for all the inputs by simply

netj = W IH X =

1 x1 wT1 x2 · · · wT

1 xNwT

2 x1 wT2 x2 · · · wT

2 xN...

.... . .

nH x1 wTnH x2 · · · wT

48 / 94

ThereforeWe must have the following matrix for the input to hidden inputs

W IH =

w11 w12 · · · w1dw21 w22 · · · w2d...

.... . .

...wnH 1 wnH 2 · · · wnH d

Given that wj =

wj1wj2...

ThusWe can create the netj for all the inputs by simply

netj = W IH X =

1 x1 wT1 x2 · · · wT

1 xNwT

2 x1 wT2 x2 · · · wT

2 xN...

.... . .

nH x1 wTnH x2 · · · wT

48 / 94

Now, we need to generate the yk

We apply the activation function element by element in netj

1 x2)· · · f

2 x2)· · · f

...... . . . ...

nH x1)

nH x2)· · · f

nH xN)

IMPORTANT about overflows!!!Be careful about the numeric stability of the activation function.I the case of python, we can use the ones provided by scipy.special

49 / 94

1 x2)· · · f

2 x2)· · · f

...... . . . ...

nH x1)

nH x2)· · · f

nH xN)

49 / 94

1 x2)· · · f

2 x2)· · · f

...... . . . ...

nH x1)

nH x2)· · · f

nH xN)

49 / 94

However, We can create a Sigmoid function

It is possible to use the following pseudo-codeSigmoid(x)

1 if try{

1.01.0+exp{−αx}

}catch {OVERFLOW } / I will use a

/ try-and-catch to catch/ the overflow

2 if x < 03 return 04 else5 return 16 else7 return 1.0

1.0+exp{−αx}/ 1.0 refers to the floating point (Rationals/ trying to represent Reals)

50 / 94

1 if try{

1.01.0+exp{−αx}

50 / 94

1 if try{

1.01.0+exp{−αx}

50 / 94

1 if try{

1.01.0+exp{−αx}

50 / 94

51 / 94

For this, we get netk

For this, we obtain the W HO

W HO =(

wo11 wo

12 · · · wo1nH

netk =(

wo11 wo

12 · · · wo1nH

wT1 x1)

wT1 x2)

· · · f(

wT1 xN)

wT2 x1)

wT2 x2)

· · · f(

wT2 xN)

......

. . ....

x1)︸︷︷︸ f

nHx2)︸︷︷︸ · · · f

nHxN)︸︷︷︸

yk1 yk2 · · · ykN

In matrix notationnetk =

o yk1 wTo yk2 · · · wT

o ykN)

52 / 94

W HO =(

wo11 wo

12 · · · wo1nH

netk =(

wo11 wo

12 · · · wo1nH

wT1 x1)

wT1 x2)

· · · f(

wT1 xN)

wT2 x1)

wT2 x2)

· · · f(

wT2 xN)

......

. . ....

x1)︸︷︷︸ f

nHx2)︸︷︷︸ · · · f

nHxN)︸︷︷︸

o ykN)

52 / 94

W HO =(

wo11 wo

12 · · · wo1nH

netk =(

wo11 wo

12 · · · wo1nH

wT1 x1)

wT1 x2)

· · · f(

wT1 xN)

wT2 x1)

wT2 x2)

· · · f(

wT2 xN)

......

. . ....

x1)︸︷︷︸ f

nHx2)︸︷︷︸ · · · f

nHxN)︸︷︷︸

o ykN)

52 / 94

53 / 94

Now, we have

Thus, we have zk (In our case k = 1, but it could be a range ofvalues)

)· · · f

) )(32)

Thus, we generate a vector of differences

d = t − zk =(

t1 − f(wT

o yk1)

t2 − f(wT

o yk2)

· · · tN − f(wT

o ykN) )

where t =(

t1 t2 · · · tN)is a row vector of desired outputs for each

sample.

54 / 94

Now, we have

Thus, we have zk (In our case k = 1, but it could be a range ofvalues)

)· · · f

) )(32)

Thus, we generate a vector of differences

d = t − zk =(

t1 − f(wT

o yk1)

t2 − f(wT

o yk2)

· · · tN − f(wT

o ykN) )

where t =(

t1 t2 · · · tN)is a row vector of desired outputs for each

sample.

54 / 94

Now, we multiply element wise

We have the following vector of derivatives of net

Df =(ηf ′(wT

)ηf ′(wT

)· · · ηf ′

) )(34)

where η is the step rate.

Finally, by element wise multiplication (Hadamard Product)

d =(η[t1 − f

o yk1)]

f ′(wT

o yk1)

η[t2 − f

o yk2)]

f ′(wT

o yk2)· · ·

η[tN − f

o ykN)]

f ′(wT

o ykN))

55 / 94

Now, we multiply element wise

We have the following vector of derivatives of net

Df =(ηf ′(wT

)ηf ′(wT

)· · · ηf ′

) )(34)

where η is the step rate.

Finally, by element wise multiplication (Hadamard Product)

d =(η[t1 − f

o yk1)]

f ′(wT

o yk1)

η[t2 − f

o yk2)]

f ′(wT

o yk2)· · ·

η[tN − f

o ykN)]

f ′(wT

o ykN))

55 / 94

Tile d

Tile downward

dtile = nH rows

dd...d

Finally, we multiply element wise against y1 (Hadamard Product)

∆wtemp1j = y1 ◦ dtile (36)

56 / 94

Tile d

Tile downward

dtile = nH rows

dd...d

Finally, we multiply element wise against y1 (Hadamard Product)

∆wtemp1j = y1 ◦ dtile (36)

56 / 94

We obtain the total ∆w1j

We sum along the rows of ∆wtemp1j

∆w1j =

t1 − f(

wTo yk1

)]f ′(

wTo yk1

)y11 + η

[t1 − f

)]f ′(

wTo yk1

...η[

t1 − f(

wTo yk1

)]f ′(

wTo yk1

)ynH 1 + η

[t1 − f

)]f ′(

wTo yk1

)ynH N

where yhm = f(wT

h xm)with h = 1, 2, ...,nH and m = 1, 2, ...,N .

57 / 94

Finally, we update the first weights

We have then

W HO (t + 1) = W HO (t) + ∆wT1j (t) (38)

58 / 94

59 / 94

We multiply element wise the W HO and ∆w1j

T = ∆wT1j ◦W T

HO (39)

Now, we obtain the element wise derivative of netj

Dnetj =

f ′(wT

1 x2)· · · f ′

f ′(wT

2 x2)· · · f ′

...... . . . ...

f ′(wT

nH x1)

f ′(wT

nH x2)· · · f ′

nH xN)

60 / 94

We multiply element wise the W HO and ∆w1j

T = ∆wT1j ◦W T

HO (39)

Now, we obtain the element wise derivative of netj

Dnetj =

f ′(wT

1 x2)· · · f ′

f ′(wT

2 x2)· · · f ′

...... . . . ...

f ′(wT

nH x1)

f ′(wT

nH x2)· · · f ′

nH xN)

60 / 94

We tile to the right T

T tile =(

T T · · · T)

︸︷︷︸N Columns

Now, we multiply element wise together with η

Pt = η (Dnetj ◦T tile) (42)

where η is constant multiplied against the result the Hadamar Product(Result a nH ×N matrix)

61 / 94

We tile to the right T

T tile =(

T T · · · T)

︸︷︷︸N Columns

Now, we multiply element wise together with η

Pt = η (Dnetj ◦T tile) (42)

where η is constant multiplied against the result the Hadamar Product(Result a nH ×N matrix)

61 / 94

Finally

We get use the transpose of X which is a N × d matrix

Finally, we get a nH × d matrix

∆wij = PtXT (44)

Thus, given W IH

W IH (t + 1) = W HO (t) + ∆wTij (t) (45)

62 / 94

Finally

∆wij = PtXT (44)

Thus, given W IH

W IH (t + 1) = W HO (t) + ∆wTij (t) (45)

62 / 94

Finally

∆wij = PtXT (44)

Thus, given W IH

W IH (t + 1) = W HO (t) + ∆wTij (t) (45)

62 / 94

63 / 94

We have different activation functions

The two most important1 Sigmoid function.2 Hyperbolic tangent function

64 / 94

We have different activation functions

The two most important1 Sigmoid function.2 Hyperbolic tangent function

64 / 94

Logistic Function

This non-linear function has the following definition for a neuron j

fj (vj (n)) = 11 + exp {−avj (n)} a > 0 and −∞ < vj (n) <∞ (46)

Example

65 / 94

Logistic FunctionThis non-linear function has the following definition for a neuron j

fj (vj (n)) = 11 + exp {−avj (n)} a > 0 and −∞ < vj (n) <∞ (46)

Example

65 / 94

The differential of the sigmoid function

Now if we differentiate, we have

f ′j (vj (n)) =[

11 + exp {−avj (n)}

] [1− 1

1 + exp {−avj (n)}

= exp {−avj (n)}(1 + exp {−avj (n)})2

66 / 94

The differential of the sigmoid function

Now if we differentiate, we have

f ′j (vj (n)) =[

11 + exp {−avj (n)}

] [1− 1

1 + exp {−avj (n)}

= exp {−avj (n)}(1 + exp {−avj (n)})2

66 / 94

The outputs finish as

For the output neurons

δk = (tk − zk) f ′ (netk)= (tk − fk (vk (n))) fk (vk (n)) (1− fk (vk (n)))

For the hidden neurons

δj =fj (vj (n)) (1− fj (vj (n)))c∑

k=1wkjδk

67 / 94

δj =fj (vj (n)) (1− fj (vj (n)))c∑

k=1wkjδk

67 / 94

δj =fj (vj (n)) (1− fj (vj (n)))c∑

k=1wkjδk

67 / 94

Hyperbolic tangent function

Another commonly used form of sigmoidal non linearity is thehyperbolic tangent function

fj (vj (n)) = a tanh (bvj (n)) (47)

Example

68 / 94

Hyperbolic tangent functionAnother commonly used form of sigmoidal non linearity is thehyperbolic tangent function

fj (vj (n)) = a tanh (bvj (n)) (47)

Example

68 / 94

The differential of the hyperbolic tangent

We have

fj (vj (n)) =absech2 (bvj (n))

=ab(1− tanh2 (bvj (n))

)BTWI leave to you to figure out the outputs.

69 / 94

We have

69 / 94

We have

69 / 94

70 / 94

Maximizing information content

Two ways of achieving this, LeCun 1993The use of an example that results in the largest training error.The use of an example that is radically different from all thosepreviously used.

For thisRandomized the samples presented to the multilayer perceptron when notdoing batch training.

Or use an emphasizing schemeBy using the error identify the difficult vs. easy patterns:

Use them to train the neural network

71 / 94

However

Be careful about emphasizing schemeThe distribution of examples within an epoch presented to thenetwork is distorted.The presence of an outlier or a mislabeled example can have acatastrophic consequence on the performance of the algorithm.

Definition of OutlierAn outlier is an observation that lies outside the overall pattern of adistribution (Moore and McCabe 1999).

72 / 94

However

72 / 94

However

0 1 2 3

Outlier

72 / 94

73 / 94

Activation Function

We say thatAn activation function f (v) is antisymmetric if f (−v) = −f (v)

It seems to beThat the multilayer perceptron learns faster using an antisymmetricfunction.

Example: The hyperbolic tangent

74 / 94

Activation Function

We say thatAn activation function f (v) is antisymmetric if f (−v) = −f (v)

74 / 94

Activation FunctionWe say thatAn activation function f (v) is antisymmetric if f (−v) = −f (v)

74 / 94

75 / 94

Target Values

ImportantIt is important that the target values be chosen within the range of thesigmoid activation function.

SpecificallyThe desired response for neuron in the output layer of the multilayerperceptron should be offset by some amount ε

76 / 94

Target Values

ImportantIt is important that the target values be chosen within the range of thesigmoid activation function.

SpecificallyThe desired response for neuron in the output layer of the multilayerperceptron should be offset by some amount ε

76 / 94

For example

Given the a limiting value

We have thenIf we have a limiting value +a, we set t = a − ε.If we have a limiting value −a, we set t = −a + ε.

77 / 94

For example

77 / 94

For example

77 / 94

78 / 94

Normalizing the inputs

Something Important (LeCun, 1993)Each input variable should be preprocessed so that:

The mean value, averaged over the entire training set, is close to zero.Or it is smalll compared to its standard deviation.

Example

79 / 94

Example

79 / 94

Example

79 / 94

ExampleMean Value

79 / 94

The normalization must include two other measures

UncorrelatedWe can use the principal component analysis

Example

80 / 94

The normalization must include two other measures

UncorrelatedWe can use the principal component analysis

Example

80 / 94

In addition

Quite interestingThe decorrelated input variables should be scaled so that their covariancesare The decorrelated approximately equal.

WhyEnsuring that the different synaptic weights in the ely the same speednetwork learn at approximately the same speed.

81 / 94

In addition

Quite interestingThe decorrelated input variables should be scaled so that their covariancesare The decorrelated approximately equal.

WhyEnsuring that the different synaptic weights in the ely the same speednetwork learn at approximately the same speed.

81 / 94

There are other heuristics

AsInitializationLearning form hintsLearning ratesetc

82 / 94

In addition

In section 4.15, Simon HaykinWe have the following techniques:

Network growingI You start with a small network and add neurons and layers to

accomplish the learning task.

Network pruningI Start with a large network, then prune weights that are not necessary in

an orderly fashion.

83 / 94

In addition

an orderly fashion.

83 / 94

In addition

an orderly fashion.

83 / 94

84 / 94

Virtues and limitations of Back-Propagation Layer

Something NotableThe back-propagation algorithm has emerged as the most popularalgorithm for the training of multilayer perceptrons.

It has two distinct propertiesIt is simple to compute locally.It performs stochastic gradient descent in weight space when doingpattern-by-pattern training

85 / 94

Connectionism

Back-propagationt is an example of a connectionist paradigm that relies on localcomputations to discover the processing capabilities of neural networks.

This form of restrictionIt is known as the locality constraint

86 / 94

Connectionism

Back-propagationt is an example of a connectionist paradigm that relies on localcomputations to discover the processing capabilities of neural networks.

This form of restrictionIt is known as the locality constraint

86 / 94

Why this is advocated in Artificial Neural Networks

FirstArtificial neural networks that perform local computations are often heldup as metaphors for biological neural networks.

SecondThe use of local computations permits a graceful degradation inperformance due to hardware errors, and therefore provides the basis for afault-tolerant network design.

ThirdLocal computations favor the use of parallel architectures as an efficientmethod for the implementation of artificial neural networks.

87 / 94

However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)

FirstThe reciprocal synaptic connections between the neurons of amultilayer perceptron may assume weights that are excitatory orinhibitory.In the real nervous system, neurons usually appear to be the one orthe other.

SecondIn a multilayer perceptron, hormonal and other types of globalcommunications are ignored.

88 / 94

ThirdIn back-propagation learning, a synaptic weight is modified by apresynaptic activity and an error (learning) signal independent ofpostsynaptic activity.There is evidence from neurobiology to suggest otherwise.

FourthIn a neurobiological sense, the implementation of back-propagationlearning requires the rapid transmission of information backward alongan axon.It appears highly unlikely that such an operation actually takes placein the brain.

89 / 94

FifthBack-propagation learning implies the existence of a "teacher," whichin the con text of the brain would presumably be another set ofneurons with novel properties.The existence of such neurons is biologically implausible.

90 / 94

FifthBack-propagation learning implies the existence of a "teacher," whichin the con text of the brain would presumably be another set ofneurons with novel properties.The existence of such neurons is biologically implausible.

90 / 94

Computational Efficiency

Something NotableThe computational complexity of an algorithm is usually measured interms of the number of multiplications, additions, and storage involved inits implementation.

This is the electrical engineering approach!!!

Taking in account the total number of synapses, W including biasesWe have 4wkj = ηδkyj = η (tk − zk) f ′ (netk) yj (Backward Pass)

We have that for this step1 We need to calculate netk linear in the number of weights.2 We need to calculate yj = f (netj) which is linear in the number of

weights.

91 / 94

weights.

91 / 94

weights.

91 / 94

Now the Forward Pass

∆wji = ηxiδj = ηf ′ (netj)[ c∑

k=1wkjδk

We have that for this step[∑c

k=1 wkjδk ] takes, because of the previous calculations of δk ’s, linear onthe number of weights

Clearly all this takes to have memoryIn addition the calculation of the derivatives of the activation functions,but assuming a constant time.

92 / 94

k=1wkjδk

92 / 94

k=1wkjδk

92 / 94

We have that

The Complexity of the multi-layer perceptron is

O (W ) Complexity

93 / 94

Exercises

We have from NN by Haykin4.2, 4.3, 4.6, 4.8, 4.16, 4.17, 3.7

94 / 94

15 Machine Learning Multilayer Perceptron

Engineering