If we are using such linear interpolation, then our radial basis function (RBF) 0 that weights an...

If we are using such linear interpolation, then our radial If we are using such linear interpolation, then our radial basis function (RBF) basis function (RBF) 00 that weights an input vector that weights an input vector based on its distance to a neuron’s reference (weight) based on its distance to a neuron’s reference (weight) vector is vector is 00(D) = D(D) = D-1-1. .

For the training samples For the training samples xxpp, p = 1, …, P, p = 1, …, P00, surrounding , surrounding the new input the new input xx, we find for the network’s output o:, we find for the network’s output o:

(In the following, to keep things simple, we will assume (In the following, to keep things simple, we will assume that the network has only one output neuron. However, that the network has only one output neuron. However, any number of output neurons could be implemented.)any number of output neurons could be implemented.)

Radial Basis FunctionsRadial Basis Functions

November 4, 2010 Neural Networks Lecture 15: Radial Basis Functions

1

)( where, 1

00

ppp

pp xfddP

o xx

Since it is difficult to define what “surrounding” should Since it is difficult to define what “surrounding” should mean, it is common to consider mean, it is common to consider allall P training samples P training samples and use any monotonically decreasing RBF :and use any monotonically decreasing RBF :

This, however, implies a network that has as many This, however, implies a network that has as many hidden nodes as there are training samples. This in hidden nodes as there are training samples. This in unacceptable because of its computational complexity unacceptable because of its computational complexity and likely poor generalization ability – the network and likely poor generalization ability – the network resembles a look-up table.resembles a look-up table.



2

P

pppdP

o1

1xx

It is more useful to have fewer neurons and accept that It is more useful to have fewer neurons and accept that the training set cannot be learned 100% accurately:the training set cannot be learned 100% accurately:

Here, ideally, each reference vector Here, ideally, each reference vector ii of these N of these N neurons should be placed in the center of an input-neurons should be placed in the center of an input-space cluster of training samples with identical (or at space cluster of training samples with identical (or at least similar) desired output least similar) desired output ii..

To learn near-optimal values for the reference vectors To learn near-optimal values for the reference vectors and the output weights, we can – as usual – employ and the output weights, we can – as usual – employ gradient descent.gradient descent.



3

N

iiiN

o1

1μx


4

The RBF NetworkThe RBF NetworkExample: Example: Network function f: Network function f: RR3 3 RR

output layeroutput layer

RBF layerRBF layer

input layerinput layer

input vectorinput vector

output vectoroutput vector

xx00=1=1 xx22

oo11

xx33

11,,11

ww22 ww33 ww44

22,,22 33,,33 44,,44

ww11

11

ww00

Radial Basis FunctionsRadial Basis FunctionsFor a fixed number of neurons N, we could learn the For a fixed number of neurons N, we could learn the following output weights and reference vectors:following output weights and reference vectors:

To do this, we first have to define an error function E:To do this, we first have to define an error function E:

Taken together, we get:Taken together, we get:


5

NN

N Nw

Nw

,...,,,..., 11

1

P

p

P

pppp odEE

1 1

2)(

2

1

N

iipipp wdE μx

Then the error gradient with regard to wThen the error gradient with regard to w11, …, w, …, wNN is: is:

For For i,ji,j, the j-th vector component of , the j-th vector component of ii, we get:, we get:

Learning in RBF NetworksLearning in RBF Networks


6

ipppi

p odw

Eμx

)(2

ji

ip

ip

ip

ippji

p wodE

,

2

2,

)(2

μx

μx

μx

The vector length (||…||) expression is inconvenient, The vector length (||…||) expression is inconvenient, because it is the square root of the given vector because it is the square root of the given vector multiplied by itself.multiplied by itself.

To eliminate this difficulty, we introduce a function R To eliminate this difficulty, we introduce a function R with R(Dwith R(D22) = ) = (D) and substitute (D) and substitute ..

This leads to a simplified differentiation:This leads to a simplified differentiation:



7

2

2

2

2 ' ip

ip

ip

ip

ipR

Rμx

μx

μx

μx

μx



8

Together with the following derivative…Together with the following derivative…

… … we finally get the result for our error gradient:we finally get the result for our error gradient:

jijpipppiji

p xRodwE

,,2

,

'4

μx

jijpji

ip x ,,,

2

2

μx

This gives us the following updating rules:This gives us the following updating rules:

where the (positive) learning rates where the (positive) learning rates ii and and i,ji,j could be could be

chosen individually for each parameter wchosen individually for each parameter w ii and and i,ji,j..

As usual, we can start with random parameters and As usual, we can start with random parameters and then iterate these rules for learning until a given error then iterate these rules for learning until a given error threshold is reached.threshold is reached.



9

jijpipppijiji xRodw ,,

2

,, ')( μx

ipppii odw μx )(



10

If the node function is given by a Gaussian, then:If the node function is given by a Gaussian, then:

2

expD

DR

As a result:As a result:

22

exp1

'D

DR



11

The specific update rules are now:The specific update rules are now:

2

2

exp)(

ip

ppii odwμx

2

2

,,,, exp))((

ip

jijpppijiji xodwμx

andand

It turns out that, particularly for Gaussian RBFs, it is It turns out that, particularly for Gaussian RBFs, it is more efficient and typically leads to better results to more efficient and typically leads to better results to use partially offline training:use partially offline training:

First, we use any clustering procedure (e.g., k-means) First, we use any clustering procedure (e.g., k-means) to estimate cluster centers, which are then used to set to estimate cluster centers, which are then used to set the values of the reference vectors the values of the reference vectors ii and their and their spreads (standard deviations) spreads (standard deviations) ii..

Then we use the gradient descent method described Then we use the gradient descent method described above to determine the weights wabove to determine the weights w ii..



12

Date post:	19-Dec-2015
Category:	Documents
View:	215 times
Download:	0 times

If we are using such linear interpolation, then our radial basis function (RBF) 0 that weights an...

Documents