Date post: | 20-Dec-2015 |
Category: |
Documents |
View: | 221 times |
Download: | 0 times |
Self Organization: Hebbian Learning
CS/CMPE 333 – Neural Networks
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 2
Introduction
So far, we have studied neural networks that learn from their environment in a supervised manner
Neural networks can also learn in an unsupervised manner as well. This is also known as self organized learning
Self organized learning discovers significant features or patterns in the input data through general rules that operate locally
Self organizing networks typically consist of two layers with feedforward connections and elements to facilitate ‘local’ learning
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 3
Self-Organization
“Global order can arise from local interactions” – Turing (1952)
Input signal produces certain activity patterns in network <-> weights are modified (feedback loop)
Principles of self organization
1. Modification in weights tend to self-amplify
2. Limitation of resources leads to competition and selection of the most active synapse and disregard of less active synapse
3. Modifications in weights tends to cooperate
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 4
Hebbian Learning
A self-organizing principle was proposed by Hebb in 1949 in the context of biological neurons
Hebb’s principle When a neuron repeatedly excites another neuron, then the
threshold of the latter neuron is decreased, or the synaptic weight between the neurons is increased, in effect increasing the likelihood of the second neuron to excite
Hebbian learning rule
Δwji = ηyjxi There is no desired or target signal required in the Hebbian
rule, hence it is unsupervised learning The update rule is local to the weight
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 5
Hebbian Update
Consider the update of a single weight w (x and y are the pre- and post-synaptic activities)
w(n + 1) = w(n) + ηx(n)y(n) For a linear activation function
w(n + 1) = w(n)[1 + ηx2(n)] Weights increase without bounds. If initial weight is
negative, then it will increase in the negative. If it is positive, then it will increase in the positive range
Hebbian learning is intrinsically unstable, unlike error-correction learning with BP algorithm
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 6
Geometric Interpretation of Hebbian Learning
Consider a single linear neuron with p inputs
y = wTx = xTw
and
Δw = η[x1y x2y … xpy]T
The dot product can be written as
y = |w||x| cos(α) α = angle between vectors x and w If α is zero (x and w are ‘close’) y is large. If α is 90 (x and w
are ‘far’) y is zero.
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 7
Similarity Measure
A network trained with Hebbian learning creates a similarity measure (the inner product) in its input space according to the information contained in the weights The weights capture (memorizes) the information in the data
during training
During operation, when the weights are fixed, a large output y signifies that the present input is "similar" to the inputs x that created the weights during training
Similarity measures Hamming distance Correlation
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 8
Hebbian Learning as Correlation Learning
Hebbian learning (pattern-by-pattern mode)Δw(n) = ηx(n)y(n) = ηxT(n)x(n)w(n)
Using batch mode
Δw(n) = η[Σn=1 Nx(n)xT(n)]w(0) The term Σn=1 Nx(n)xT(n) is sample approximation of
the auto-correlation of the input data Thus Hebbian learning can be thought of learning the auto-
correlation of the input space Correlation is a well-known operation in signal
processing and statistics. In particular, it completely describes signals defined by Gaussian distributions Applications in signal processing
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 9
Oja’s Rule
The simple Hebbian rule causes the weights to increase (or decrease) without bounds
The weights need to be normalized to one as
wji(n + 1) = [wji(n) + ηxi(n)yj(n)] /
√Σi[wji(n) + ηxi(n)yj(n)]2 This equation effectively imposes a constraint on the weights
that the sum at a neuron be equal to 1 Oja approximated the normalization (for small η) as:
wji(n + 1) = wjin) + ηyj(n)[xi(n) – yj(n)wji(n)] This is Oja’s rule, or the generalized Hebbian rule It involves a ‘forgetting term’ that prevents the weights from
growing without bounds
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 10
Oja’s Rule – Geometric Interpretation
The simple Hebbian rule finds the weight vector with the largest variance with the input data. However, the magnitude of the weight vector increases
without bounds
Oja’s rule has a similar interpretation; normalization only changes the magnitude while the direction of the weight vector is same Magnitude is equal to one
Oja’s rule converges asymptotically, unlike Hebbian rule which is unstable
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 11
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 12
The Maximum Eigenfilter
A linear neuron trained with Oja’s rule produces a weight vector that is the eigenvector of the input auto correlation matrix, and produces at its output the largest eigenvalue
A linear neuron trained with Oja’s rule solves the following eigen problem
Re1 = λ1e1 R = auto-correlation matrix of input data e1 = largest eigenvector which corresponds to the weight
vector w obtained by Oja’s rule λ1 = largest eigenvalue, which corresponds to the network’s
output
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 13
Principal Component Analysis (1)
Oja’s rule when applied to a single neuron creates a principal component in the input space in the form of the weight vector
How can we find other components in the input space with significant variance ?
In statistics, PCA is used to obtain the significant components of data in the form of orthogonal principal axes PCA is also known as K-L filtering in signal processing First proposed in 1901. Later developments occurred in the
1930s, 1940s and 1960s.
Hebbian network with Oja’s rule can perform PCA
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 14
Principal Component Analysis (2)
PCA Consider a set of vectors x with zero mean and unit
variance. There exist an orthogonal transformation y = QTx such that the covariance matrix of y is Λ = E[yyT]
Λij = λi if i = j and Λij = 0 otherwise (diagonal matrix) λ1 > λ2 > … > λp = eigenvalues of covariance matrix of x (C
= E[xxT] Columns of Q are the corresponding eigenvectors Vector y is the principal component that has the maximum
variance with all other components
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 15
PCA – Example
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 16
Hebbian Network for PCA
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 17
Hebbian Network for PCA
Procedure Use Oja’s rule to find the principal component Project the data orthogonal to the principal component Use Oja’s rule on the projected data to find the next major
component Repeat the above for m <= p (m = desired components; p =
input space dimensionality) How to find the projection onto orthogonal direction?
Deflation method: subtract the principal component from the input
Oja’s rule can be modified to perform this operation; Sanger’s rule
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 18
Sanger’s Rule
Sanger’s rule is a modification of Oja’s rule that implements the deflation method for PCA Classical PCA involves matrix operations Sanger’s rule implements PCA in an iterative fashion for
neural networks
Consider p inputs and m outputs, where m < p
yj(n) = Σi=1 p wji(n)xi(n) j = 1, m
and, the update (Sanger’s rule)
Δwji(n) = η[yj(n)xi(n) – yj(n) Σk=1 j wki(n)yk(n)]
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 19
PCA for Feature Extraction
PCA is the optimal linear feature extractor. This means that there is no other linear system that is able to provide better features for reconstruction. PCA may or may not be the best preprocessing for pattern classification or recognition. Classification requires good discrimination which PCA might not be able to provide.
Feature extraction: transform p-dimensional input space to an m-dimensional space (m < p), such that the m-dimensions capture the information with minimal loss
The error e in the reconstruction is given by:
e2 = Σi=M+1 p λi
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 20
PCA for Data Compression
PCA identifies an orthogonal coordinate system for the input data such that the variance of the projection on the principal axis is largest, followed by the next major axis, and so on
By discarding some of the minor components, PCA can be used for data compression, where a p-dimension (bit) input is encoded in a m < p dimensional space
Weights are computed by Sanger’s rule on typical inputs
The de-compressor (receiver) must know the weights of the network to reconstruct the original signalx’ = WTy
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 21
PCA for Classification (1)
Can PCA enhance classification ? In general, no. PCA is good for reconstruction and not
feature discrimination or classification
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 22
PCA for Classification (2)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 23
PCA – Some Remarks
Practical uses of PCA Data Compression Cluster analysis Feature extraction Preprocessing for classification/recognition (e.g.
preprocessing for MLP training)
Biological basis It is unlikely that the processing performed by biological
neurons in, say perception, involves PCA only. More complex feature extraction processes are involved.
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS 24
Anti-Hebbian Learning
Modifying the Hebbian rule as
Δwji(n) = - ηxi(n)yj(n)
The anti-Hebbian rule find the direction in space that has the minimum variance. In other words, it is the complement of the Hebbian rule Anti-Hebbian does de-correlation. It de-correlates the output
from the input
Hebbian rule is unstable, since it tries to maximize the variance. Anti-Hebbian rule, on the other hand, is stable and converges