Post on 09-Jan-2016
description
transcript
Artificial Neural NetworksArtificial Neural Networks
Dr. Lahouari GhoutiDr. Lahouari Ghouti
Information & Computer Science DepartmentInformation & Computer Science Department
Artificial Neural Networks
Single-Layer PerceptronSingle-Layer Perceptron(SLP)(SLP)
Artificial Neural Networks
ArchitectureArchitecture We consider the following architecture: feed-forward neural We consider the following architecture: feed-forward neural
network with one layernetwork with one layer
It is sufficient to study single-layer perceptrons with just one It is sufficient to study single-layer perceptrons with just one neuron:neuron:
Artificial Neural Networks
Perceptron: Neuron ModelPerceptron: Neuron Model Uses a non-linear (McCulloch-Pitts) model of Uses a non-linear (McCulloch-Pitts) model of
neuron:neuron:x1
x2
xm
w2
w1
wm
b (bias)
z yg(z)
• g is the sign function:
g(z) = +1 IF z >= 0
-1 IF z < 0Is the function sign(z)
Artificial Neural Networks
Perceptron: Applications Perceptron: Applications
The perceptron is used for classification (?): The perceptron is used for classification (?): classify correctly a set of examples into one classify correctly a set of examples into one of the two classes Cof the two classes C11, C, C22::
If the output of the perceptron is +1 then the input If the output of the perceptron is +1 then the input is assigned to class Cis assigned to class C11
If the output is -1 then the input is assigned to CIf the output is -1 then the input is assigned to C22
Artificial Neural Networks
Perceptron: Classification Perceptron: Classification
The equation below describes a hyperplane in the input space. The equation below describes a hyperplane in the input space. This hyperplane is used to separate the two classes CThis hyperplane is used to separate the two classes C11 and C and C22
0 bxwm
1iii
x2
C1
C2
x1
decisionboundary
w1x1 + w2x2 + b = 0
decisionregion for C1
w1x1 + w2x2 + b > 0
w1x1 + w2x2 + b <= 0
decisionregion for C2
0 xwm
0iii
Weighted Bias
Artificial Neural Networks
Perceptron: Limitations Perceptron: Limitations
The perceptron can only model The perceptron can only model linearly-linearly-separableseparable functions. functions.
The perceptron The perceptron can be used to model the can be used to model the following Boolean functions:following Boolean functions:
ANDAND OROR COMPLEMENTCOMPLEMENT But it But it cannotcannot model the model the XORXOR. Why?. Why?
Artificial Neural Networks
Perceptron: Limitations (Cont’d)Perceptron: Limitations (Cont’d)
The XOR is not a linearly-separable problemThe XOR is not a linearly-separable problem It is impossible to separate the classes CIt is impossible to separate the classes C11 and C and C22
with only one linewith only one line
0 1
0
1
-1
-11
1
x2
x1
C1
C1
C2
Artificial Neural Networks
Perceptron: Learning Algorithm Perceptron: Learning Algorithm
Variables and parameters:Variables and parameters:xx(n) = input vector(n) = input vector
= [+1, x= [+1, x11(n), x(n), x22(n), …, x(n), …, xmm(n)](n)]TT
ww(n) = weight vector(n) = weight vector
= [b(n), w= [b(n), w11(n), w(n), w22(n), …, w(n), …, wmm(n)](n)]TT
b(n) = biasb(n) = biasy(n) = actual responsey(n) = actual responsed(n) = desired responsed(n) = desired response
= learning rate parameter (More elaboration later)= learning rate parameter (More elaboration later)
Artificial Neural Networks
The Fixed-Increment Learning AlgorithmThe Fixed-Increment Learning Algorithm Initialization:Initialization: set set ww((00) =) =00 Activation:Activation: activate perceptron by applying input activate perceptron by applying input
example (vector example (vector xx(n) and desired response d(n))(n) and desired response d(n)) Compute actual responseCompute actual response of the perceptron: of the perceptron:
y(n) = sgn[y(n) = sgn[wwTT(n)(n)xx(n)] (n)] Adapt the weight vectorAdapt the weight vector: if d(n) and y(n) : if d(n) and y(n) are are
different thendifferent thenww(n + 1) = (n + 1) = ww(n) + (n) + [[d(n)d(n)-y(n)]-y(n)]xx(n)(n)
Where d(n) =+1 if x(n) C1
-1 if x(n) C2
• Continuation: increment time index n by 1 and go to Activation step
Artificial Neural Networks
A Learning ExampleA Learning Example
Consider a training set C1 C2, where: C1 = {(1,1), (1, -1), (0, -1)} elements of class 1 C2 = {(-1,-1), (-1,1), (0,1)} elements of class -1
Use the perceptron learning algorithm to classify these examples.
• w(0) = [1, 0, 0]T = 1
Artificial Neural Networks
x1
x2
C2
C1
1
1
-1
-1 1/2
Decision boundary: 2x1 - x2 = 0
+
+
- +
- -
A Learning Example (Cont’d)A Learning Example (Cont’d)
Artificial Neural Networks
The Learning Algorithm: ConvergenceThe Learning Algorithm: Convergence Let n = Number of training samples (Set X); Let n = Number of training samples (Set X); XX11 = Set of training sample belonging to class C = Set of training sample belonging to class C11; ; XX22 = set of training sample belonging to C = set of training sample belonging to C22 For a given sample n:For a given sample n:
xx(n) = [(n) = [+1+1, x, x11(n),…, x(n),…, xpp(n)](n)]TT = input vector = input vector
ww(n) = [(n) = [bb(n)(n), w, w11(n),…, w(n),…, wpp(n)](n)]TT = weight vector = weight vector
Net activity Level: Net activity Level: v(n) = wv(n) = wTT(n)x(n)(n)x(n)
Output: Output: y(n) = y(n) = +1 if v(n) >= 0
-1 if v(n) < 0
Artificial Neural Networks
The Learning Algorithm: Convergence (Cont’d)The Learning Algorithm: Convergence (Cont’d) The decision hyperplane separates classes CThe decision hyperplane separates classes C11 and and
CC22
If the two classes CIf the two classes C11 and C and C22 are linearly separable, are linearly separable,
then there exists a weight vector then there exists a weight vector ww such that such that
wwTTx x ≥ 0≥ 0 for all for all xx belonging to class C belonging to class C11
wwTTx x < 0< 0 for all for all xx belonging to class C belonging to class C22
Artificial Neural Networks
Error-Correction LearningError-Correction Learning Update rule: Update rule: w(n + 1) = w(n) + w(n + 1) = w(n) + ΔΔw(n)w(n) Learning processLearning process
– If If xx(n) is correctly classified by (n) is correctly classified by ww(n), then(n), then
w(n + 1) = w(n)w(n + 1) = w(n) – Otherwise, the weight vector is updated as followsOtherwise, the weight vector is updated as follows
w(n + 1) =w(n + 1) = w(n) – η(n)x(n) if w(n)Tx(n) ≥ 0; x(n) belongs to C2
w(n) + η(n)x(n) if w(n)Tx(n) < 0; x(n) belongs to C1
Artificial Neural Networks
Perceptron Convergence AlgorithmPerceptron Convergence Algorithm Variables and parametersVariables and parameters
– xx(n) = [+1, x(n) = [+1, x11(n),…, x(n),…, xpp(n)]; (n)]; ww(n) = [b(n) = [b(n), w(n), w11(n),(n),…,w…,wpp(n)](n)]
– y(n) = actual response (output); d(n) = desired responsey(n) = actual response (output); d(n) = desired response– ηη = learning rate, a positive number less than 1 = learning rate, a positive number less than 1
Step 1: InitializationStep 1: Initialization– Set Set ww(0) = 0, then do the following for n = 1, 2, 3, …(0) = 0, then do the following for n = 1, 2, 3, …
Step 2: ActivationStep 2: Activation– Activate the perceptron by applying input vector Activate the perceptron by applying input vector xx(n) (n)
and desired output d(n)and desired output d(n)
Artificial Neural Networks
Perceptron Convergence Algorithm (Cont’d)Perceptron Convergence Algorithm (Cont’d) Step 3: Step 3: Computation of actual responseComputation of actual response
y(n) = sgn[wy(n) = sgn[wTT(n)x(n)](n)x(n)]– Where sgn(.) is the signum functionWhere sgn(.) is the signum function
Step 4: Step 4: Adaptation of weight vectorAdaptation of weight vector
w(n+1) = w(n) + w(n+1) = w(n) + ηη[d(n) – y(n)]x(n)[d(n) – y(n)]x(n)
WhereWhere
d(n) = d(n) = Step 5Step 5
– Increment n by 1, and go back to step 2Increment n by 1, and go back to step 2
+1 if x(n) belongs to C1
-1 if x(n) belongs to C2
Artificial Neural Networks
Learning: Performance MeasureLearning: Performance Measure A learning rule is designed to optimize a performance A learning rule is designed to optimize a performance
measuremeasure– However, in the development of the perceptron However, in the development of the perceptron
convergence algorithm we did not mention a convergence algorithm we did not mention a performance measureperformance measure
Intuitively, what would be an appropriate performance Intuitively, what would be an appropriate performance measure for a classification neural network?measure for a classification neural network?
Define the performance measure:Define the performance measure:
J = -E[e(n)v(n)]
Artificial Neural Networks
Learning: Performance MeasureLearning: Performance MeasureOr, as an instantaneous estimate:
J’(n) = -e(n)v(n)
The error at iteration n:• e(n) = = d(n) – y(n)
• v(n) = linear combiner output at iteration n;
• E[.] = expectation operator
Artificial Neural Networks
Can we derive our learning rule by minimizing this Can we derive our learning rule by minimizing this performance function [Haykin’s textbook]:performance function [Haykin’s textbook]:
Now Now v(n) = wv(n) = wTT(n)x(n)(n)x(n), thus, thus
Learning rule:Learning rule:
)()]()([)(
)(')(' nnynd
n
nJnJw x
w
)(
)()]()([
)(
)(')('
n
nvnynd
n
nJnJw ww
)()]()([)()(')()1( nnyndnnJnn w xwww
Learning: Performance Measure (Cont’d)Learning: Performance Measure (Cont’d)
Artificial Neural Networks
Presentation of Training ExamplesPresentation of Training Examples
Presenting all training examples once to the ANN is Presenting all training examples once to the ANN is called an called an epochepoch..
In incremental stochastic gradient descent training In incremental stochastic gradient descent training examples can be presented in:examples can be presented in:
– Fixed order (1,2,3…,M)Fixed order (1,2,3…,M)
– Randomly permutated order (5,2,7,…,3)Randomly permutated order (5,2,7,…,3)
– Completely random (4,1,7,1,5,4,……)Completely random (4,1,7,1,5,4,……)
Artificial Neural Networks
Concluding RemarksConcluding Remarks A single layer perceptron can perform pattern A single layer perceptron can perform pattern
classification only on linearly separable patterns, classification only on linearly separable patterns, regardless of the type of nonlinearity (hard limiter, regardless of the type of nonlinearity (hard limiter, sigmoidal)sigmoidal)
Papert and Minsky in 1969 elucidated limitations of Papert and Minsky in 1969 elucidated limitations of Rosenblatt’s single layer perceptron (e.g. requirement of Rosenblatt’s single layer perceptron (e.g. requirement of linear separability, inability to solve XOR problem) and linear separability, inability to solve XOR problem) and cast doubt on the viability of neural networkscast doubt on the viability of neural networks
However, multilayer perceptron and the back-However, multilayer perceptron and the back-propagation algorithm overcomes many of the propagation algorithm overcomes many of the shortcomings of the single layer perceptronshortcomings of the single layer perceptron
Artificial Neural Networks
Adaline: Adaptive Linear ElementAdaline: Adaptive Linear Element
The output The output yy is a linear combination of the input is a linear combination of the input x:x:
x1
x2
xm
w2
w1
wm
y
m
0jjj )(n(n)wxy
Artificial Neural Networks
Adaline: Adaptive Linear Element (Cont’d)Adaline: Adaptive Linear Element (Cont’d) Adaline: uses a linear neuron model and the Least-Mean-Square (LMS) Adaline: uses a linear neuron model and the Least-Mean-Square (LMS)
learning algorithmlearning algorithmThe idea:The idea: try to minimize the square error, which is a function of the weights try to minimize the square error, which is a function of the weights
We can find the minimum of the error function We can find the minimum of the error function EE by means of the by means of the Steepest descent method (Optimization Procedure)Steepest descent method (Optimization Procedure)
)n(e)w(n)( 2
2
1E
m
0jjj )(n(n)wx)n(d)n(e
Artificial Neural Networks
Steepest Descent Method: BasicsSteepest Descent Method: Basics
)) E(nofgradient ()w(n)1n(w
Start with an arbitrary pointStart with an arbitrary point find a direction in which E is decreasing most rapidlyfind a direction in which E is decreasing most rapidly
make a small step in that directionmake a small step in that direction
mw
E
w
EE
,, ))w( of(gradient
1
Artificial Neural Networks
Steepest Descent Method: Basics (Cont’d)Steepest Descent Method: Basics (Cont’d)
(w1,w2)
(w1+w1,w2 +w2)
Artificial Neural Networks
Steepest Descent Method: Basics (Cont’d)Steepest Descent Method: Basics (Cont’d)
-3-2
-10
12
3
-2
0
2
-10
-5
0
5
global minlocal min
gradient?
Artificial Neural Networks
LLeast-east-MMean-ean-SSquare algorithm quare algorithm (Widrow-Hoff Algorithm)(Widrow-Hoff Algorithm)
Approximation of gradient(E) Approximation of gradient(E)
Update rule for the weights becomes:Update rule for the weights becomes:
x(n)e(n) w(n)1)w(n
]x(n)e(n)[
w(n)
e(n)e(n)
w(n)
)w(n)(
T
E
Artificial Neural Networks
Summary of LMS algorithm Summary of LMS algorithm
Training sample: Training sample: Input signal vectorInput signal vector xx(n)(n)
Desired responseDesired response d(n)d(n)
User selected parameter User selected parameter >0 >0
InitializationInitialization set ŵ(1) = 0set ŵ(1) = 0
ComputationComputation for n = 1, 2, … computefor n = 1, 2, … compute
e(n) = d(n) - ŵe(n) = d(n) - ŵTT(n)(n)xx(n)(n)
ŵ(n+1) = ŵ(n) + ŵ(n+1) = ŵ(n) + xx(n)e(n)(n)e(n)
Artificial Neural Networks
Neuron with Sigmoid-FunctionNeuron with Sigmoid-Function
Inputs
Weights
Activation
x1
x2
xm
w2
w1
wm
y
m
1jjj )(n(n)wxa
aeay
1
1
Output
Artificial Neural Networks
Multi-Layer Neural NetworksMulti-Layer Neural Networks
Input layer
Hidden layer
Output layer
Artificial Neural Networks
Backpropagation PrincipalBackpropagation Principal
xk
xi
wki
wjk
j
k
yj
Backward Step: Propagate errors from output to hidden layer
Forward Step:Propagate activation from input to output layer
Artificial Neural Networks
Backpropagation AlgorithmBackpropagation Algorithm Initialize each wInitialize each wii to some small random value to some small random value
Until the termination condition is met, DoUntil the termination condition is met, Do
– For each training example <(xFor each training example <(x11,…x,…xnn),t> Do),t> Do
» Input the instance (xInput the instance (x11,…,x,…,xnn) to the network and compute the ) to the network and compute the
network outputs ynetwork outputs ykk
» For each output unit kFor each output unit k kk=y=ykk(1-y(1-ykk)(t)(tkk-y-ykk))
» For each hidden unit hFor each hidden unit h
hh=y=yhh(1-y(1-yhh) ) k k wwh,kh,k kk
» For each network weight wFor each network weight wi,ji,j Do Do
» wwi,ji,j=w=wi,ji,j++wwi,j i,j wherewhere
wwi,ji,j= = jj x xi,ji,j
Artificial Neural Networks
Backpropagation Algorithm (Cont’d)Backpropagation Algorithm (Cont’d) Gradient descent over entire Gradient descent over entire network network weight vectorweight vector Easily generalized to arbitrary directed graphsEasily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimumWill find a local, not necessarily global error minimum
-in practice often works well (can be invoked multiple times with -in practice often works well (can be invoked multiple times with different initial weights)different initial weights)
Often include weight Often include weight momentummomentum term term
wwi,ji,j(n)= (n)= jj x xi,j i,j + + wwi,j i,j (n-1)(n-1) Minimizes error training examplesMinimizes error training examples
– Will it generalize well to unseen instances (over-fitting)?Will it generalize well to unseen instances (over-fitting)? Training can be slow typical 1000-10000 iterationsTraining can be slow typical 1000-10000 iterations (use Levenberg-Marquardt instead of gradient descent)(use Levenberg-Marquardt instead of gradient descent) Using network after training is fastUsing network after training is fast
Artificial Neural Networks
Convergence of BackpropagationConvergence of BackpropagationGradient descent to some local minimum perhaps not global minimumGradient descent to some local minimum perhaps not global minimum
Add momentum term: Add momentum term: wwkiki(n)(n) wwkiki(n) = (n) = kk(n) x(n) xi i (n) + (n) + wwkiki(n-1)(n-1)
with with [0,1] [0,1] Stochastic gradient descentStochastic gradient descent Train multiple nets with different initial weightsTrain multiple nets with different initial weightsNature of convergenceNature of convergence Initialize weights near zeroInitialize weights near zero Therefore, initial networks near-linearTherefore, initial networks near-linear Increasingly non-linear functions possible as training progressesIncreasingly non-linear functions possible as training progresses
Artificial Neural Networks
Optimization MethodsOptimization Methods There are other more efficient (faster convergence) optimization There are other more efficient (faster convergence) optimization
methods than gradient descent:methods than gradient descent:– Newton’s method uses a quadratic approximation (2Newton’s method uses a quadratic approximation (2ndnd order Taylor order Taylor
expansion) expansion)
– F(F(xx++xx) = F() = F(xx) + ) + F(x) F(x) x + x + x x 22F(x) F(x) x + …x + …
– Conjugate gradientsConjugate gradients
– Levenberg-Marquardt algorithmLevenberg-Marquardt algorithm
Artificial Neural Networks
Universal Approximation Property of ANNUniversal Approximation Property of ANN
Boolean Functions:Boolean Functions: Every boolean function can be represented by network with single Every boolean function can be represented by network with single
hidden layerhidden layer
But might require exponential (in number of inputs) hidden unitsBut might require exponential (in number of inputs) hidden units
Continuous Functions:Continuous Functions: Every bounded continuous function can be approximated with Every bounded continuous function can be approximated with
arbitrarily small error, by network with one hidden layer [Cybenko arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989]1989, Hornik 1989]
Any function can be approximated to arbitrary accuracy by a network Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]with two hidden layers [Cybenko 1988]
Artificial Neural Networks
Using Weight DerivativesUsing Weight Derivatives How often to updateHow often to update
– after each training case?after each training case?– after a full sweep through the training data?after a full sweep through the training data?
How much to updateHow much to update– Use a fixed learning rate?Use a fixed learning rate?– Adapt the learning rate?Adapt the learning rate?– Add momentum?Add momentum?– Don’t use steepest descent?Don’t use steepest descent?
Artificial Neural Networks
What Next?What Next? Bias EffectBias Effect
Batch vs. Continuous LearningBatch vs. Continuous Learning
Variable Learning Rate (Update Rule?)Variable Learning Rate (Update Rule?)
Effect of Neurons/LayerEffect of Neurons/Layer
Effect of Hidden LayersEffect of Hidden Layers
Artificial Neural Networks