Download - Adaptive Networks

October 28, 2010 Neural Networks Lecture 13: Adaptive Networks

1

Adaptive NetworksAdaptive Networks

As you know, there is no equation that would tell you As you know, there is no equation that would tell you the ideal number of neurons in a multi-layer network.the ideal number of neurons in a multi-layer network.

Ideally, we would like to use the smallest number of Ideally, we would like to use the smallest number of neurons that allows the network to do its task neurons that allows the network to do its task sufficiently accurately, because of:sufficiently accurately, because of:

• the small number of weights in the system,the small number of weights in the system,

• fewer training samples being required,fewer training samples being required,

• faster training,faster training,

• typically, better generalization for new test samples.typically, better generalization for new test samples.


2

Adaptive NetworksAdaptive Networks

So far, we have determined the number of hidden-So far, we have determined the number of hidden-layer units in BPNs by “trial and error.”layer units in BPNs by “trial and error.”

However, there are algorithmic approaches for However, there are algorithmic approaches for adapting the size of a network to a given task.adapting the size of a network to a given task.

Some techniques start with a large network and then Some techniques start with a large network and then iteratively prune connections and nodes that iteratively prune connections and nodes that contribute little to the network function.contribute little to the network function.

Other methods start with a minimal network and then Other methods start with a minimal network and then add connections and nodes until the network reaches add connections and nodes until the network reaches a given performance level.a given performance level.

Finally, there are algorithms that combine these Finally, there are algorithms that combine these “pruning” and “growing” approaches.“pruning” and “growing” approaches.


3

Cascade CorrelationCascade Correlation

None of these algorithms are guaranteed to produce None of these algorithms are guaranteed to produce “ideal” networks.“ideal” networks.

(It is not even clear how to define an “ideal” network.)(It is not even clear how to define an “ideal” network.)

However, numerous algorithms exist that have been However, numerous algorithms exist that have been shown to yield good results for most applications.shown to yield good results for most applications.

We will take a look at one such algorithm named We will take a look at one such algorithm named “cascade correlation.”“cascade correlation.”

It is of the “network growing” type and can be used to It is of the “network growing” type and can be used to build multi-layer networks of adequate size.build multi-layer networks of adequate size.

However, these networks are not strictly feed-forward in However, these networks are not strictly feed-forward in a level-by-level manner.a level-by-level manner.


4

Refresher: Covariance and CorrelationRefresher: Covariance and CorrelationFor a dataset (xFor a dataset (xii, y, yii) with i = 1, …, n the covariance is:) with i = 1, …, n the covariance is:

n

i

ii

n

yyxx

1

))((),cov( yx

xx

yy

cov(x,y) > 0cov(x,y) > 0

xx

yy

xx

yy

cov(x,y) cov(x,y) ≈ 0 0

xx

yy

xx

yy

cov(x,y) < 0cov(x,y) < 0

xx

yy


5

Refresher: Covariance and CorrelationRefresher: Covariance and Correlation

Covariance tells us something about the strength and Covariance tells us something about the strength and direction (directly vs. inversely proportional) of the direction (directly vs. inversely proportional) of the linear relationship between x and y.linear relationship between x and y.

For many applications, it is useful to normalize this For many applications, it is useful to normalize this variable so that it ranges from -1 to 1.variable so that it ranges from -1 to 1.

The result is the correlation coefficient r, which for a The result is the correlation coefficient r, which for a dataset (xdataset (xii, y, yii) with i = 1, …, n is given by:) with i = 1, …, n is given by:

n

i i

n

i i

n

i ii

yyxx

yyxx

0

2

0

2

0

)()(

))((),(corrr yx


6


xx

yy

0 < r < 10 < r < 1

xx

yy

r r ≈ 0 0

xx

yy

-1 < r < 0-1 < r < 0

xx

yy

r = 1r = 1

xx

yy

r r = -1

xx

yy

r undef’dr undef’d


7


In the case of high (close to 1) or low (close to -1) In the case of high (close to 1) or low (close to -1) correlation coefficients, we can use one variable as a correlation coefficients, we can use one variable as a predictor of the other one.predictor of the other one.

To quantify the linear relationship between the two To quantify the linear relationship between the two variables, we can use linear regression:variables, we can use linear regression:

xx

yy

regression lineregression line


8


Now let us return to the cascade correlation algorithm.Now let us return to the cascade correlation algorithm.

We start with a minimal network consisting of only the We start with a minimal network consisting of only the input neurons (one of them should be a constant input neurons (one of them should be a constant offset = 1) and the output neurons, completely offset = 1) and the output neurons, completely connected as usual.connected as usual.

The output neurons (and later the hidden neurons) The output neurons (and later the hidden neurons) typically use output functions that can also produce typically use output functions that can also produce negative outputs; e.g., we can subtract 0.5 from our negative outputs; e.g., we can subtract 0.5 from our sigmoid function for a (-0.5, 0.5) output range.sigmoid function for a (-0.5, 0.5) output range.

Then we successively add hidden-layer neurons and Then we successively add hidden-layer neurons and train them to reduce the network error step by step: train them to reduce the network error step by step:


9


Input nodesInput nodes

xx11 xx22 xx33

Output nodeOutput node

Solid Solid connections are connections are being modifiedbeing modified

oo11


10



xx11 xx22 xx33



oo11

First First hidden hidden nodenode


11



xx11 xx22 xx33



oo11

First First hidden hidden nodenode

SecondSecondhidden hidden nodenode


12

Cascade CorrelationCascade CorrelationWeights to each new hidden node are trained to maximize the Weights to each new hidden node are trained to maximize the covariance of the node’s output with the current network error.covariance of the node’s output with the current network error.

Covariance:Covariance:

K

k

P

pkpknewpnewnew EExxS(w

1 1,, ))(()

: vector of weights to the new node: vector of weights to the new node

: output of the new node to p-th input sample: output of the new node to p-th input sample

: error of k-th output node for p-th input sample: error of k-th output node for p-th input sample before the new node is added before the new node is added

: averages over the training set: averages over the training set

neww

pnewx ,

pkE ,

knew Ex and