Artificial Neural Networks

transcript

Artificial Neural NetworksArtificial Neural Networks

Dr. Lahouari GhoutiDr. Lahouari Ghouti

Information & Computer Science DepartmentInformation & Computer Science Department

Single-Layer PerceptronSingle-Layer Perceptron(SLP)(SLP)

ArchitectureArchitecture We consider the following architecture: feed-forward neural We consider the following architecture: feed-forward neural

network with one layernetwork with one layer

It is sufficient to study single-layer perceptrons with just one It is sufficient to study single-layer perceptrons with just one neuron:neuron:

Perceptron: Neuron ModelPerceptron: Neuron Model Uses a non-linear (McCulloch-Pitts) model of Uses a non-linear (McCulloch-Pitts) model of

neuron:neuron:x1

b (bias)

z yg(z)

• g is the sign function:

g(z) = +1 IF z >= 0

-1 IF z < 0Is the function sign(z)

Perceptron: Applications Perceptron: Applications

The perceptron is used for classification (?): The perceptron is used for classification (?): classify correctly a set of examples into one classify correctly a set of examples into one of the two classes Cof the two classes C11, C, C22::

If the output of the perceptron is +1 then the input If the output of the perceptron is +1 then the input is assigned to class Cis assigned to class C11

If the output is -1 then the input is assigned to CIf the output is -1 then the input is assigned to C22

Perceptron: Classification Perceptron: Classification

The equation below describes a hyperplane in the input space. The equation below describes a hyperplane in the input space. This hyperplane is used to separate the two classes CThis hyperplane is used to separate the two classes C11 and C and C22

0 bxwm

decisionboundary

w1x1 + w2x2 + b = 0

decisionregion for C1

w1x1 + w2x2 + b > 0

w1x1 + w2x2 + b <= 0

decisionregion for C2

Weighted Bias

Perceptron: Limitations Perceptron: Limitations

The perceptron can only model The perceptron can only model linearly-linearly-separableseparable functions. functions.

The perceptron The perceptron can be used to model the can be used to model the following Boolean functions:following Boolean functions:

ANDAND OROR COMPLEMENTCOMPLEMENT But it But it cannotcannot model the model the XORXOR. Why?. Why?

Perceptron: Limitations (Cont’d)Perceptron: Limitations (Cont’d)

The XOR is not a linearly-separable problemThe XOR is not a linearly-separable problem It is impossible to separate the classes CIt is impossible to separate the classes C11 and C and C22

with only one linewith only one line

Perceptron: Learning Algorithm Perceptron: Learning Algorithm

Variables and parameters:Variables and parameters:xx(n) = input vector(n) = input vector

= [+1, x= [+1, x11(n), x(n), x22(n), …, x(n), …, xmm(n)](n)]TT

ww(n) = weight vector(n) = weight vector

= [b(n), w= [b(n), w11(n), w(n), w22(n), …, w(n), …, wmm(n)](n)]TT

b(n) = biasb(n) = biasy(n) = actual responsey(n) = actual responsed(n) = desired responsed(n) = desired response

= learning rate parameter (More elaboration later)= learning rate parameter (More elaboration later)

The Fixed-Increment Learning AlgorithmThe Fixed-Increment Learning Algorithm Initialization:Initialization: set set ww((00) =) =00 Activation:Activation: activate perceptron by applying input activate perceptron by applying input

example (vector example (vector xx(n) and desired response d(n))(n) and desired response d(n)) Compute actual responseCompute actual response of the perceptron: of the perceptron:

y(n) = sgn[y(n) = sgn[wwTT(n)(n)xx(n)] (n)] Adapt the weight vectorAdapt the weight vector: if d(n) and y(n) : if d(n) and y(n) are are

different thendifferent thenww(n + 1) = (n + 1) = ww(n) + (n) + [[d(n)d(n)-y(n)]-y(n)]xx(n)(n)

Where d(n) =+1 if x(n) C1

-1 if x(n) C2

• Continuation: increment time index n by 1 and go to Activation step

A Learning ExampleA Learning Example

Consider a training set C1 C2, where: C1 = {(1,1), (1, -1), (0, -1)} elements of class 1 C2 = {(-1,-1), (-1,1), (0,1)} elements of class -1

Use the perceptron learning algorithm to classify these examples.

• w(0) = [1, 0, 0]T = 1

-1 1/2

Decision boundary: 2x1 - x2 = 0

A Learning Example (Cont’d)A Learning Example (Cont’d)

The Learning Algorithm: ConvergenceThe Learning Algorithm: Convergence Let n = Number of training samples (Set X); Let n = Number of training samples (Set X); XX11 = Set of training sample belonging to class C = Set of training sample belonging to class C11; ; XX22 = set of training sample belonging to C = set of training sample belonging to C22 For a given sample n:For a given sample n:

xx(n) = [(n) = [+1+1, x, x11(n),…, x(n),…, xpp(n)](n)]TT = input vector = input vector

ww(n) = [(n) = [bb(n)(n), w, w11(n),…, w(n),…, wpp(n)](n)]TT = weight vector = weight vector

Net activity Level: Net activity Level: v(n) = wv(n) = wTT(n)x(n)(n)x(n)

Output: Output: y(n) = y(n) = +1 if v(n) >= 0

-1 if v(n) < 0

The Learning Algorithm: Convergence (Cont’d)The Learning Algorithm: Convergence (Cont’d) The decision hyperplane separates classes CThe decision hyperplane separates classes C11 and and

If the two classes CIf the two classes C11 and C and C22 are linearly separable, are linearly separable,

then there exists a weight vector then there exists a weight vector ww such that such that

wwTTx x ≥ 0≥ 0 for all for all xx belonging to class C belonging to class C11

wwTTx x < 0< 0 for all for all xx belonging to class C belonging to class C22

Error-Correction LearningError-Correction Learning Update rule: Update rule: w(n + 1) = w(n) + w(n + 1) = w(n) + ΔΔw(n)w(n) Learning processLearning process

– If If xx(n) is correctly classified by (n) is correctly classified by ww(n), then(n), then

w(n + 1) = w(n)w(n + 1) = w(n) – Otherwise, the weight vector is updated as followsOtherwise, the weight vector is updated as follows

w(n + 1) =w(n + 1) = w(n) – η(n)x(n) if w(n)Tx(n) ≥ 0; x(n) belongs to C2

w(n) + η(n)x(n) if w(n)Tx(n) < 0; x(n) belongs to C1

Perceptron Convergence AlgorithmPerceptron Convergence Algorithm Variables and parametersVariables and parameters

– xx(n) = [+1, x(n) = [+1, x11(n),…, x(n),…, xpp(n)]; (n)]; ww(n) = [b(n) = [b(n), w(n), w11(n),(n),…,w…,wpp(n)](n)]

– y(n) = actual response (output); d(n) = desired responsey(n) = actual response (output); d(n) = desired response– ηη = learning rate, a positive number less than 1 = learning rate, a positive number less than 1

Step 1: InitializationStep 1: Initialization– Set Set ww(0) = 0, then do the following for n = 1, 2, 3, …(0) = 0, then do the following for n = 1, 2, 3, …

Step 2: ActivationStep 2: Activation– Activate the perceptron by applying input vector Activate the perceptron by applying input vector xx(n) (n)

and desired output d(n)and desired output d(n)

Perceptron Convergence Algorithm (Cont’d)Perceptron Convergence Algorithm (Cont’d) Step 3: Step 3: Computation of actual responseComputation of actual response

y(n) = sgn[wy(n) = sgn[wTT(n)x(n)](n)x(n)]– Where sgn(.) is the signum functionWhere sgn(.) is the signum function

Step 4: Step 4: Adaptation of weight vectorAdaptation of weight vector

w(n+1) = w(n) + w(n+1) = w(n) + ηη[d(n) – y(n)]x(n)[d(n) – y(n)]x(n)

WhereWhere

d(n) = d(n) = Step 5Step 5

– Increment n by 1, and go back to step 2Increment n by 1, and go back to step 2

+1 if x(n) belongs to C1

-1 if x(n) belongs to C2

Learning: Performance MeasureLearning: Performance Measure A learning rule is designed to optimize a performance A learning rule is designed to optimize a performance

measuremeasure– However, in the development of the perceptron However, in the development of the perceptron

convergence algorithm we did not mention a convergence algorithm we did not mention a performance measureperformance measure

Intuitively, what would be an appropriate performance Intuitively, what would be an appropriate performance measure for a classification neural network?measure for a classification neural network?

Define the performance measure:Define the performance measure:

J = -E[e(n)v(n)]

Learning: Performance MeasureLearning: Performance MeasureOr, as an instantaneous estimate:

J’(n) = -e(n)v(n)

The error at iteration n:• e(n) = = d(n) – y(n)

• v(n) = linear combiner output at iteration n;

• E[.] = expectation operator

Can we derive our learning rule by minimizing this Can we derive our learning rule by minimizing this performance function [Haykin’s textbook]:performance function [Haykin’s textbook]:

Now Now v(n) = wv(n) = wTT(n)x(n)(n)x(n), thus, thus

Learning rule:Learning rule:

)()]()([)(

)(')(' nnynd

nJnJw x

)()]()([

)(')('

nvnynd

nJnJw ww

)()]()([)()(')()1( nnyndnnJnn w xwww

Learning: Performance Measure (Cont’d)Learning: Performance Measure (Cont’d)

Presentation of Training ExamplesPresentation of Training Examples

Presenting all training examples once to the ANN is Presenting all training examples once to the ANN is called an called an epochepoch..

In incremental stochastic gradient descent training In incremental stochastic gradient descent training examples can be presented in:examples can be presented in:

– Fixed order (1,2,3…,M)Fixed order (1,2,3…,M)

– Randomly permutated order (5,2,7,…,3)Randomly permutated order (5,2,7,…,3)

– Completely random (4,1,7,1,5,4,……)Completely random (4,1,7,1,5,4,……)

Concluding RemarksConcluding Remarks A single layer perceptron can perform pattern A single layer perceptron can perform pattern

classification only on linearly separable patterns, classification only on linearly separable patterns, regardless of the type of nonlinearity (hard limiter, regardless of the type of nonlinearity (hard limiter, sigmoidal)sigmoidal)

Papert and Minsky in 1969 elucidated limitations of Papert and Minsky in 1969 elucidated limitations of Rosenblatt’s single layer perceptron (e.g. requirement of Rosenblatt’s single layer perceptron (e.g. requirement of linear separability, inability to solve XOR problem) and linear separability, inability to solve XOR problem) and cast doubt on the viability of neural networkscast doubt on the viability of neural networks

However, multilayer perceptron and the back-However, multilayer perceptron and the back-propagation algorithm overcomes many of the propagation algorithm overcomes many of the shortcomings of the single layer perceptronshortcomings of the single layer perceptron

Adaline: Adaptive Linear ElementAdaline: Adaptive Linear Element

The output The output yy is a linear combination of the input is a linear combination of the input x:x:

0jjj )(n(n)wxy

Adaline: Adaptive Linear Element (Cont’d)Adaline: Adaptive Linear Element (Cont’d) Adaline: uses a linear neuron model and the Least-Mean-Square (LMS) Adaline: uses a linear neuron model and the Least-Mean-Square (LMS)

learning algorithmlearning algorithmThe idea:The idea: try to minimize the square error, which is a function of the weights try to minimize the square error, which is a function of the weights

We can find the minimum of the error function We can find the minimum of the error function EE by means of the by means of the Steepest descent method (Optimization Procedure)Steepest descent method (Optimization Procedure)

)n(e)w(n)( 2

0jjj )(n(n)wx)n(d)n(e

Steepest Descent Method: BasicsSteepest Descent Method: Basics

)) E(nofgradient ()w(n)1n(w

Start with an arbitrary pointStart with an arbitrary point find a direction in which E is decreasing most rapidlyfind a direction in which E is decreasing most rapidly

make a small step in that directionmake a small step in that direction

,, ))w( of(gradient

Steepest Descent Method: Basics (Cont’d)Steepest Descent Method: Basics (Cont’d)

(w1,w2)

(w1+w1,w2 +w2)

Steepest Descent Method: Basics (Cont’d)Steepest Descent Method: Basics (Cont’d)

global minlocal min

gradient?

LLeast-east-MMean-ean-SSquare algorithm quare algorithm (Widrow-Hoff Algorithm)(Widrow-Hoff Algorithm)

Approximation of gradient(E) Approximation of gradient(E)

Update rule for the weights becomes:Update rule for the weights becomes:

x(n)e(n) w(n)1)w(n

]x(n)e(n)[

e(n)e(n)

)w(n)(

Summary of LMS algorithm Summary of LMS algorithm

Training sample: Training sample: Input signal vectorInput signal vector xx(n)(n)

Desired responseDesired response d(n)d(n)

User selected parameter User selected parameter >0 >0

InitializationInitialization set ŵ(1) = 0set ŵ(1) = 0

ComputationComputation for n = 1, 2, … computefor n = 1, 2, … compute

e(n) = d(n) - ŵe(n) = d(n) - ŵTT(n)(n)xx(n)(n)

ŵ(n+1) = ŵ(n) + ŵ(n+1) = ŵ(n) + xx(n)e(n)(n)e(n)

Neuron with Sigmoid-FunctionNeuron with Sigmoid-Function

Inputs

Weights

Activation

1jjj )(n(n)wxa

Output

Multi-Layer Neural NetworksMulti-Layer Neural Networks

Input layer

Hidden layer

Output layer

Backpropagation PrincipalBackpropagation Principal

Backward Step: Propagate errors from output to hidden layer

Forward Step:Propagate activation from input to output layer

Backpropagation AlgorithmBackpropagation Algorithm Initialize each wInitialize each wii to some small random value to some small random value

Until the termination condition is met, DoUntil the termination condition is met, Do

– For each training example <(xFor each training example <(x11,…x,…xnn),t> Do),t> Do

» Input the instance (xInput the instance (x11,…,x,…,xnn) to the network and compute the ) to the network and compute the

network outputs ynetwork outputs ykk

» For each output unit kFor each output unit k kk=y=ykk(1-y(1-ykk)(t)(tkk-y-ykk))

» For each hidden unit hFor each hidden unit h

hh=y=yhh(1-y(1-yhh) ) k k wwh,kh,k kk

» For each network weight wFor each network weight wi,ji,j Do Do

» wwi,ji,j=w=wi,ji,j++wwi,j i,j wherewhere

wwi,ji,j= = jj x xi,ji,j

Backpropagation Algorithm (Cont’d)Backpropagation Algorithm (Cont’d) Gradient descent over entire Gradient descent over entire network network weight vectorweight vector Easily generalized to arbitrary directed graphsEasily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimumWill find a local, not necessarily global error minimum

-in practice often works well (can be invoked multiple times with -in practice often works well (can be invoked multiple times with different initial weights)different initial weights)

Often include weight Often include weight momentummomentum term term

wwi,ji,j(n)= (n)= jj x xi,j i,j + + wwi,j i,j (n-1)(n-1) Minimizes error training examplesMinimizes error training examples

– Will it generalize well to unseen instances (over-fitting)?Will it generalize well to unseen instances (over-fitting)? Training can be slow typical 1000-10000 iterationsTraining can be slow typical 1000-10000 iterations (use Levenberg-Marquardt instead of gradient descent)(use Levenberg-Marquardt instead of gradient descent) Using network after training is fastUsing network after training is fast

Convergence of BackpropagationConvergence of BackpropagationGradient descent to some local minimum perhaps not global minimumGradient descent to some local minimum perhaps not global minimum

Add momentum term: Add momentum term: wwkiki(n)(n) wwkiki(n) = (n) = kk(n) x(n) xi i (n) + (n) + wwkiki(n-1)(n-1)

with with [0,1] [0,1] Stochastic gradient descentStochastic gradient descent Train multiple nets with different initial weightsTrain multiple nets with different initial weightsNature of convergenceNature of convergence Initialize weights near zeroInitialize weights near zero Therefore, initial networks near-linearTherefore, initial networks near-linear Increasingly non-linear functions possible as training progressesIncreasingly non-linear functions possible as training progresses

Optimization MethodsOptimization Methods There are other more efficient (faster convergence) optimization There are other more efficient (faster convergence) optimization

methods than gradient descent:methods than gradient descent:– Newton’s method uses a quadratic approximation (2Newton’s method uses a quadratic approximation (2ndnd order Taylor order Taylor

expansion) expansion)

– F(F(xx++xx) = F() = F(xx) + ) + F(x) F(x) x + x + x x 22F(x) F(x) x + …x + …

– Conjugate gradientsConjugate gradients

– Levenberg-Marquardt algorithmLevenberg-Marquardt algorithm

Universal Approximation Property of ANNUniversal Approximation Property of ANN

Boolean Functions:Boolean Functions: Every boolean function can be represented by network with single Every boolean function can be represented by network with single

hidden layerhidden layer

But might require exponential (in number of inputs) hidden unitsBut might require exponential (in number of inputs) hidden units

Continuous Functions:Continuous Functions: Every bounded continuous function can be approximated with Every bounded continuous function can be approximated with

arbitrarily small error, by network with one hidden layer [Cybenko arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989]1989, Hornik 1989]

Any function can be approximated to arbitrary accuracy by a network Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]with two hidden layers [Cybenko 1988]

Using Weight DerivativesUsing Weight Derivatives How often to updateHow often to update

– after each training case?after each training case?– after a full sweep through the training data?after a full sweep through the training data?

How much to updateHow much to update– Use a fixed learning rate?Use a fixed learning rate?– Adapt the learning rate?Adapt the learning rate?– Add momentum?Add momentum?– Don’t use steepest descent?Don’t use steepest descent?

What Next?What Next? Bias EffectBias Effect

Batch vs. Continuous LearningBatch vs. Continuous Learning

Variable Learning Rate (Update Rule?)Variable Learning Rate (Update Rule?)

Effect of Neurons/LayerEffect of Neurons/Layer

Effect of Hidden LayersEffect of Hidden Layers

Artificial Neural Networks

Documents