CS344: Introduction to Artificial Intelligence (associated lab: CS386)

CS344: Introduction to Artificial Intelligence

(associated lab: CS386)Pushpak Bhattacharyya

CSE Dept., IIT Bombay

Lecture 31: Feedforward N/W; sigmoid neuron

28th March, 2011

Feedforward Network

Limitations of perceptron Non-linear separability is all

pervading Single perceptron does not have

enough computing power Eg: XOR cannot be computed by

perceptron

Solutions Tolerate error (Ex: pocket algorithm

used by connectionist expert systems). Try to get the best possible hyperplane

using only perceptrons Use higher dimension surfaces

Ex: Degree - 2 surfaces like parabola Use layered network

Pocket Algorithm Algorithm evolved in 1985 –

essentially uses PTA Basic Idea:

Always preserve the best weight obtained so far in the “pocket”

Change weights, if found better (i.e. changed weights result in reduced error).

XOR using 2 layers

)))),(()),(,(( 2121

212121

xxNOTANDxNOTxANDORxxxxxx

• Non-LS function expressed as a linearly separable function of individual linearly separable functions.

Example - XOR

x1 x2 x1x2

0 0 00 1 1

1 0 0

1 1 0

w2=1.5w1=-1θ = 1

x1 x2

2112

0

wwww

Calculation of XOR

Calculation of x1x2

w2=1w1=1θ = 0.5

x1x2 x1x2

Example - XOR

w2=1w1=1θ = 0.5

x1x2 x1x2

-1x1 x2

-11.51.5

1 1

Some Terminology A multilayer feedforward neural

network has Input layer Output layer Hidden layer (assists computation)

Output units and hidden units are called

computation units.

Training of the MLP Multilayer Perceptron (MLP)

Question:- How to find weights for the hidden layers when no target output is available?

Credit assignment problem – to be solved by “Gradient Descent”

x2 x1

h2 h1

33 cxmy

11 cxmy 22 cxmy

1221111 )( cxwxwmh

1221111 )( cxwxwmh

32211

32615 )(kxkxkchwhwOut

Can Linear Neurons Work?

Note: The whole structure shown in earlier slide is reducible to a single neuron with given behavior

Claim: A neuron with linear I-O behavior can’t compute X-OR.

Proof: Considering all possible cases:

[assuming 0.1 and 0.9 as the lower and upper thresholds]

For (0,0), Zero class:

For (0,1), One class:

32211 kxkxkOut

1.0.1.0)0.0.( 21

mccwwm

9.0..9.0)0.1.(

1

12

cmwmcwwm

For (1,0), One class:

For (1,1), Zero class:

These equations are inconsistent. Hence X-OR can’t be computed.

Observations:1. A linear neuron can’t compute X-OR.2. A multilayer FFN with linear neurons is

collapsible to a single linear neuron, hence no a additional power due to hidden layer.

3. Non-linearity is essential for power.

9.0.. 1 cmwm

9.0.. 1 cmwm

Multilayer Perceptron

Training of the MLP Multilayer Perceptron (MLP)

Question:- How to find weights for the hidden layers when no target output is available?

Credit assignment problem – to be solved by “Gradient Descent”

Gradient Descent Technique

Let E be the error at the output layer

ti = target output; oi = observed output

i is the index going over n neurons in the outermost layer

j is the index going over the p patterns (1 to p)

Ex: XOR:– p=4 and n=1

p

j

n

ijii otE

1 1

2)(21

Weights in a FF NN wmn is the weight of the

connection from the nth

neuron to the mth neuron E vs surface is a

complex surface in the space defined by the weights wij

gives the direction in which a movement of the operating point in the wmn co-ordinate space will result in maximum decrease in error

W

m

n

wmn

mnwE

mnmn w

Ew

Sigmoid neurons Gradient Descent needs a derivative

computation- not possible in perceptron due to the discontinuous step function used!

Sigmoid neurons with easy-to-compute derivatives used!

Computing power comes from non-linearity of sigmoid function.

xyxy

as 0 as 1

Derivative of Sigmoid function

)1(1

111

1)1(

)()1(

11

1

22

yyee

eee

edxdy

ey

xx

x

xx

x

x

Training algorithm Initialize weights to random values. For input x = <xn,xn-1,…,x0>, modify

weights as followsTarget output = t, Observed output = o

Iterate until E < (threshold)

ii w

Ew

2)(21 otE

Calculation of ∆wi

ii

ii

i

i

n

iii

ii

xoootwwEw

xoootwnet

neto

oE

xwnetwherewnet

netE

wE

)1()(

)10 constant, learning(

)1()(

:1

0

ObservationsDoes the training technique support our intuition?

The larger the xi, larger is ∆wi Error burden is borne by the weight

values corresponding to large input values

Date post:	22-Feb-2016
Category:	Documents
Upload:	peggy
View:	34 times
Download:	0 times

CS344: Introduction to Artificial Intelligence (associated lab: CS386)

Documents