Date post: | 17-Jan-2016 |
Category: |
Documents |
Upload: | nickolas-morris-parrish |
View: | 227 times |
Download: | 3 times |
Ming-Feng Yeh 1
CHAPTER 10CHAPTER 10
Widrow-Hoff Widrow-Hoff LearningLearning
Ming-Feng Yeh 2
ObjectivesObjectives
Widrow-Hoff learning is an approximate steepest descent algorithm, in which the performance index is mean square error.
It is widely used today in many signal processing applications.
It is precursor to the backpropagation algorithm for multilayer networks.
Ming-Feng Yeh 3
ADALINE NetworkADALINE Network
ADALINE (Adaptive Linear Neuron) network and its learning rule, LMS (Least Mean Square) algorithm are proposed by Widrow and Marcian Hoff in 1960.Both ADALINE network and the perceptron suffer from the same inherent limitation: they can only solve linearly separable problems.The LMS algorithm minimizes mean square error (MSE), and therefore tires to move the decision boundaries as far from the training patterns as possible.
Ming-Feng Yeh 4
ADALINE NetworkADALINE Network
n = Wp + b a = purelin(Wp + b)
+
aS1
nS11 b
S1
W
SR
R
pR1
S
Single-layer perceptron
aS1+
nS11 b
S1
W
SR
R
pR1
S
Ming-Feng Yeh 5
Single ADALINESingle ADALINE
Set n = 0, then Wp + b = 0 specifies a decision boundary.The ADALINE can be used to classify objects into two categories if they are linearly separable.
11w
12w
1p
2p n a
1
b
nnpurelinabn
wwp
p
)(
, 12112
1
Wp
Wp
1p
2p
W0a
0a
Ming-Feng Yeh 6
Mean Square ErrorMean Square Error
The LMS algorithm is an example of supervised training.The LMS algorithm will adjust the weights and biases of the ADALINE in order to minimize the mean square error, where the error is the difference between the target output (tq) and the network output (pq).
1 ,1 pz
wx
b
])[(])[(][)( 2T22 zxx tEatEeEF
zxpw TT1 ba
E[·]: expected value
MSE:
Ming-Feng Yeh 7
Performance OptimizationPerformance Optimization
Develop algorithms to optimize a performance index F(x), where the word “optimize” will mean to find the value of x that minimizes F(x).
The optimization algorithms are iterative as
or : a search direction
: positive learning rate, which determines the length of the step
: initial guess
kkkk pxx 1 kkkk pxxx 1
kp
k
0x
Ming-Feng Yeh 8
Taylor Series ExpansionTaylor Series Expansion
Taylor series:
Vector case:
)()(
)()()()()(
)(
3)(!3
12)(!2
1)(3
3
2
2
xxxF
xxxxxxxFxF
xxdxxdF
xxdx
xFd
xxdx
xFd
xxdxxdF
)()()(
)()()()()()(
))(()())(()(
)()()()(
)()()()()()(
2!2
1
3311!21
2211!21
2!2
1211!2
1
11
31
2
21
2
2
2
21
2
1
xxxx
xxxxxxxxx
xx
xx
xxxx
xx
xxxx
xxxx
xxxx
xxxx
T
TTT
xxxx
nnxx
nnxx
FF
FFF
xxxxFxxxxF
xxFxxF
xxFxxFFF
n
n
),...,,( 21 nxxxx
Ming-Feng Yeh 9
Gradient & HessianGradient & Hessian
Gradient:
Hessian:
)...,,,()( 21 nxxxFF xT
n
Fx
Fx
Fx
F
)()()()(21
xxxx
)()()(
)()()(
)()()(
)(
2
2
2
2
1
2
2
2
22
2
12
21
2
21
2
21
2
2
xxx
xxx
xxx
x
Fx
Fxx
Fxx
Fxx
Fx
Fxx
Fxx
Fxx
Fx
F
nnn
n
n
Ming-Feng Yeh 10
Directional DerivativeDirectional Derivative
The ith element of the gradient, F(x)xi, is the first derivative of performance index F along the xi axis.
Let p be a vector in the direction along which we wish to know the derivative.Directional derivative:
. Find the derivative of F(x) at the point in the direction
pxp )(FT22
21 2)( xxF x
T5.05.0x T12 p
0
5
2
112
)(
2
1
4
2)(
2
1
p
xpx
xx
xx
F
x
xF
T
Ming-Feng Yeh 11
Steepest DescentSteepest Descent
Goal: The function F(x) can decrease at each iteration, i.e.,
Central idea: first-order Taylor series expansion
Any vector pk that satisfies is called a descent direction.
A vector that points in the steepest descent direction is
Steepest descent:
)()( 1 kk FF xx
kFFFF kk
Tkkkk xx
xgxgxxxx )(,)()()( 1
00)()( 1 kTkk
Tkkk
Tkkk FF pgpgxgxx
0kTk pg
kk gp
kkkkkkk gxpxx 1
kgx
Ming-Feng Yeh 12
Approximated-Based FormulationApproximated-Based Formulation
Given input/output training data: {p1,t1}, {p2,t2},…, {pQ,tQ}. The objective of network training is to find the optimal weights to minimize the error (minimum-squares error) between the target value and the actual response.
Model (network) function:
Least-squares-error function:
The weight vector x can be training by minimizing the error function along the gradient-descent direction:
),,( xza TTb 1 , pzwx 2),(),( xzxz tE
x
xzxz
x
xzx
),(),(
),( tE
Ming-Feng Yeh 13
Delta Learning RuleDelta Learning Rule
ADALINE:
Least-Squares-Error Criterion:
minimize
Gradient:
Delta learning rule:
zxxz TR
jjj bpwa ),(
T
Tb
1pz
wx
22
1atE
jjj
patw
a
a
E
w
E
atb
a
a
E
b
E
xx
E )()()1(
)()()1(
atkbkb
patkwkw jjj
Ming-Feng Yeh 14
Mean Square ErrorMean Square Error
matrixn correlatioinput :][ and between n vector correlatio-cross :][
][ 2
TEttE
tEc
zzRzzh
022)2()( TT RxhRxxhxx cF
hRx 1
xzzxzxxzzxzx
zxx
][][2][]2[
])[()(
TTT2
TTT2
2T
EtEtEttE
tEF
Rxxhx TT2 c
matrix symmetric: ,2)(
ectorconstant v: ,)()(TT
TT
RRxxRRxRxx
hhhxxh
Ming-Feng Yeh 15
Mean Square ErrorMean Square Error
If the correlation matrix R is positive definite, there will be a unique stationary point , which will be a strong minimum.Strong Minimum: the point is a strong minimum of F(x) if a scalar exists, such that
for all x such that .Global Minimum: the point is a unique global minimum of F(x) for all .Weak Minimum: the point is a weak minimum of F(x) if it is not a strong minimum, and a scalar exists, such that for all x such that .
0
hRx 1
x)()( xxx FF
0 xx
0xx
)()( xxx FF0 x
Ming-Feng Yeh 16
LMS AlgorithmLMS Algorithm
LMS algorithm is to locate the minimum point.
Use an approximate steepest descent algorithm to estimate the gradient.
Estimate the mean square error F(x) by
Estimated gradient:
)()()()(ˆ 22 kekaktF x
)()(ˆ 2 keF x
Rjw
keke
w
keke
jjj ,...,2,1for
)()(2
)()(
22
b
keke
b
keke R
)(
)(2)(
)(2
12
Ming-Feng Yeh 17
LMS AlgorithmLMS Algorithm
bkpw
kp
kpkp
wwwbkkaR
iii
R
R
1
2
1
21 )(
)(
)()(
)()( Wp
bkpwktkaktkeR
iii
1
)()()()()(
1)(
),()(
b
kekp
w
kej
j
)()(2)()(ˆ 2 kkekeF zx
Ming-Feng Yeh 18
LMS AlgorithmLMS Algorithm
The steepest descent algorithm with constant learning rate is
kFkk xx
xxx )(1
)()(2)()(ˆ 2 kkekeF zx )()(21 kkekk zxx
Matrix notation of LMS algorithm:
)(2)()1(
)()(2)()1( T
kkk
kkkk
ebb
peWW
The LMS algorithm is also referred to as the delta rule or the Widrow-Hoff learning algorithm.
Ming-Feng Yeh 19
Quadratic FunctionsQuadratic FunctionsGeneral form of quadratic function:
Ax
dAxx
Axxxdx
)(
)(
)(
2
T21T
G
G
cG
ADALINE network mean square error:Rxxhxx TT2)( cF
RAhd 2 ,2
(A: Hessian matrix)
If the eigenvalues of the Hessian matrix are all positive, then the quadratic function will have one unique global minimum.
Ming-Feng Yeh 20
Stable Learning RatesStable Learning RatesSuppose that the performance index is a quadratic function:
cG xdAxxx TT
21)(
dAxx )(G
Steepest descent algorithm with constant learning rate:
)(1 dAxxgxx kkkkk
dxAIx kk ][1
A linear dynamic system will be stable if the eigenvalues of the matrix [I-A] are less than one in magnitude.
Ming-Feng Yeh 21
Stable Learning RatesStable Learning RatesLet {1, 2,…, n} and {z1,z2,…, zn} be the eigenvalues and eigenvectors of the Hessian matrix. Then
iiiiiiii zzzAzzzAI )1(][
Condition for the stability of the steepest descent algorithm is then
11 i
Assume that the quadratic function has a strong minimum point, then its eigenvalues must be positive numbers. Hence,
i 2
This must be true for all eigenvalues:max
2
Ming-Feng Yeh 22
Analysis of ConvergenceAnalysis of ConvergenceIn the LMS algorithm , xk is a function only of z(k-1), z(k-2),…, z(0).
)()(21 kkekk zxx
Assume that successive input vectors are statistically independent, then xk is independent of z(k).
The expected value of the weight vector will converge to . This is the minimum MSE solution.
hRx 1 ]}[{ 2
keE,][1 dxAIx kk RAhd 2 ,2
hxRIx 2][]2[][ ssss EEThe condition on stability isThe steady state solution isor .
max10
xhRx 1][ ssE
Ming-Feng Yeh 23
Orange/Apple ExampleOrange/Apple Example
1,
111
,1,11
1
2211 tt pp
5.01
0.2,0.1,0.0
.101010101
][
max
T222
1T112
1T
ppppppER
In practical applications, the stable learning rate might NOT be practical to calculate R, and could be selected by trial and error.
Ming-Feng Yeh 24
Orange/Apple ExampleOrange/Apple ExampleStart, arbitrary, with all the weights set to zero, and then will apply input p1, p2, p1, p2, etc., in that order, calculating the new weights after each input is presented.
1)0()0()0()0(0)0()0()0()0( 11 atatea pWpW
1,
111
,1,11
1
2211 tt pp
2.0
000)0(
W
.4.04.04.0)0()0(2)0()1( T pWW e
4.1)1()1()1()1(4.0)1()1()1()1( 22 atatea pWpW
.16.096.016.0)1()1(2)1()2( T pWW e
Ming-Feng Yeh 25
Orange/Apple ExampleOrange/Apple Example
36.0)2()2()2()2(
64.0)2()2()2()2(
1
1
atate
a pWpW
.0160.01040.10160.0)2()2(2)2()3( T pWW e .010)( W
This decision boundary falls halfway between the two reference patterns. The perceptron rule did NOT produce such a boundary,
The perceptron rule stops as soon as the patterns are correctly classified, even though some patterns may be close to the boundaries. The LMS algorithm minimizes the mean square error.
Ming-Feng Yeh 26
Solved Problem P10.2Solved Problem P10.2Category I:
Category II:
TT 11,11 21 pp
T223 p
1p2p
3p
IIISince they are linear separable, we can design an ADALINE network to make such a distinction.As shown in figure, 1 ,1 ,3 1211 wwb
Category III:
Category IV:
TT 11,11 21 pp
T013 p1p
2p3pThey are NOT linear separable, so
an ADALINE network CANNOT distinguish between them.
Ming-Feng Yeh 27
Solved Problem P10.3Solved Problem P10.3
1,
11
,1,11
2211 tt pp
These patterns occur with equal probability, and they are used to train an ADALINE network with no bias. What does the MSE performance surface look like?
2122
212 ,212)( wwwwwcF TT xRxxhxx
1)1(5.015.0][ 222 tEc
10
11
)1(5.011
15.0][ zh tE
1001
5.05.0][ 2211TTTE ppppzzR
Ming-Feng Yeh 28
Solved Problem P10.3Solved Problem P10.3
10
10
1001
11hRx
2122
212 ,212)( wwwwwcF TT xRxxhxx
0 1 2 3-3 -2 -1
0
1
2
3
-2
-1
4
1w
2wThe Hessian matrix of F(x), 2R, has both eigenvalues at 2. So the contour of the performance surface will be circular. The center of the contours (the minimum point) is .x
Ming-Feng Yeh 29
Solved Problem P10.4Solved Problem P10.4
1,
11
,1,11
2211 tt pp
Train the network using the LMS algorithm, with the initial guess set to zero and a learning rate = 0.25.
21
21
)0()0(2)0()1(,101)0()0()0(
,011
00)0(
Teate
purelina
pWW
10)1()1(2)1()2(
,101)1()1()1(
,01
1)1( 2
121
Teate
purelina
pWW
w1 w1
Ming-Feng Yeh 30
Tapped Delay LineTapped Delay Line
D
D
D
)(ky )()(1 kykp
)1()(2 kykp
)1()( RkykpR
At the output of the tapped delay line we have an R-dim. vector, consisting of the input signal at the current time and at delays of from 1 to R–1 time steps.
Ming-Feng Yeh 31
Adaptive FilterAdaptive Filter
)(kn)(ka
1
b
11w
12wD
D
D
)(ky
Rw1
bikyw
bpurelinkaR
ii
1
1 )1(
)()( Wp
Ming-Feng Yeh 32
Solved Problem P10.1Solved Problem P10.1
)(kn )(ka
11w
12wD
)(ky
D13w
4)1(,5)0(,...}0,0,0,4,5,0,0,0{...,)}({
3 ,1 ,2 131211
yyky
www
Just prior to k = 0 ( k < 0 ):Three zeros have enteredthe filter, i.e., y(3) = y(2) = y(1) = 0, the output just prior to k = 0 is zero.
k = 0: 10005
312)0()0(
Wpa
Ming-Feng Yeh 33
Solved Problem P10.1Solved Problem P10.1
k = 1: 13054
312)1()1(
Wpa
k = 2: 1954
0312)2()2(
Wpa
k = 3: 124
00
312)3()3(
Wpa
k = 4: 0000
312)4()4(
Wpa
Ming-Feng Yeh 34
Solved Problem P10.1Solved Problem P10.1
The effect of y(0) last from k = 0 through k = 2, so it will have an influence for three time intervals.This corresponds to the length of the impulse response of this filter.
0)4(,12)3(,19)2(,13)1(,10)0(,0)1( aaaaaa
)2()1(
)()()( 131211
kyky
kywwwkka Wp
Ming-Feng Yeh 35
Solved Problem P10.6Solved Problem P10.6
)(kn )(ka
)(ky11w
12w
D
D +
+
)()( kykt
)(ke
Application of ADALINE: adaptive predictorThe purpose of this filter is to predict the next value of the input signal from the two previous values.Suppose that the input signal is a stationary random process with autocorrelation function given by
.1)2(,1)1(,3)0(,)()()( yyyy CCCnkykyEnC
Ming-Feng Yeh 36
Solved Problem P10.6Solved Problem P10.6Sketch the contour plot of the performance index (MSE).i.
)2()1(
)()(kyky
kk pz
.1)2(,1)1(,3)0(,)()()( yyyy CCCnkykyEnC
3)0()()( 22 yCkyEktEc
3113
)0()1()1()0(
yy
yyT
CCCC
E zzR
11
)2()1(
)2()()1()(
y
y
CC
kykykyky
EtE zh
Ming-Feng Yeh 37
Solved Problem P10.6Solved Problem P10.6Performance Index (MSE): Rxxhxx TT2)( cFThe optimal weights are
21
21
83
84
81
83
11
11
11
3113
hRx
The Hessian matrix is Eigenvalues: 1 = 4, 2 = 8.
Eigenvectors:
6226
2)(2 RAxF
11
,11
21 vv
The contours of F(x) will be elliptical, with the long axis of each ellipse along the 1st eigenvector, since the 1st eigenvalue has the smallest magnitude.The ellipses will be centered at .x
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
1v
2v
0 1 2-1-2
0
1
2
-1
-2
Ming-Feng Yeh 38
Solved Problem P10.6Solved Problem P10.6The maximum stable value of the learning for the LMS algorithm:
ii.25.0822 max
x
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 1 2-1-2
0
1
2
-1
-2
The LMS algorithm is approximate steepest descent, so the trajectory for small learning rates will move perpendicular to the contour lines.
iii.
Ming-Feng Yeh 39
ApplicationsApplications
Noise cancellation system to remove 60-Hz noise from EEG signal (Fig. 10.6)
Echo cancellation system in long distance telephone lines (Fig. 10.10)
Filtering engine noise from pilot’s voice signal (Fig. P10.8)