Artificial Neural
Networks
Part 8
Neural Learning Application of MLPs
Modelling of Dynamic Systems
Time-variant or time dependent
Financial market, Weather forecasting, Power consumption
The outputs of dynamic systems, at any instant of time, depend
on their previous input and output values
MLPs in Functional Approximation MLPs in Dynamic Systems
Neural Learning Application of MLPs
Modelling of Dynamic Systems
Time delay neural network (TDNN)
Feedforward architectures with multiple layers,
without any feedback
𝒙 𝒕 = 𝒇(𝒙 𝒕 − 𝟏 , 𝒙 𝒕 − 𝟐 ,… , 𝒙 𝒕 − 𝒏𝒑 )
Autoregressive (AR) model
Neural Learning Application of MLPs
Modelling of Dynamic Systems
Time delay neural network (TDNN)
Feedforward architectures with multiple layers,
without any feedback
Neural Learning Application of MLPs
Modelling of Dynamic Systems
Time delay neural network (TDNN)
Feedforward architectures with multiple layers,
without any feedback
Neural Learning Application of MLPs
Modelling of Dynamic Systems
MLP with feedback
It is able to “remember” previous outputs to produce
the current or future response.
𝒙 𝒕 = 𝒇(𝒙 𝒕 − 𝟏 , 𝒙 𝒕 − 𝟐 ,… , 𝒙 𝒕 − 𝒏𝒑 , 𝒚 𝒕 − 𝟏 , 𝒚 𝒕 − 𝟏 ,… , 𝒚 𝒕 − 𝒏𝒒 )
Neural Learning Application of MLPs
Modelling of Dynamic Systems
MLP with feedback
It is able to “remember” previous outputs to produce
the current or future response.
You might want to trydemo nnd12sd1 .
It allows you to see the
error surface in two
dimensional space and
inspect the learning.
Neural Learning
You might want to try demo nnd11gn .
Neural Learning
Neural Learning
How to decide the most suitable MLP- topology?
How large should the network be?
Computation time• the number of hidden nodes directly impacts the computation
time required to train the network
Generalization vs. Memorization
Generalization Introduction
Recall the idea of getting a neural network to learn a classification decision
boundary:
Our goal is for the network to generalize to classify new inputs
correctly.
If the training data contains noise, we don’t want the training data to be
classified totally accurately as that is likely to reduce the generalization
ability.
Similarly if our network is required to recover an underlying function
(curve-fitting) from noisy data:
Generalization Introduction
x
f(x)
x
f(x)
The network can give a more accurate generalization to new inputs if its
output curve does not pass through all the data points. Again, allowing a
larger error on the training data is likely to lead to better generalization.
Generalization Introduction
Given a large network, it is possible that repeated training iterations
successively improve performance of the network on training data, e.g., by
"memorizing" training samples, but the resulting network may perform
poorly on test data. This phenomenon is called over-training.
x
f(x)
x
f(x)Over-trainingUnder-training
Generalization Improving generalization
we need to avoid both under-fitting and over-fitting of the
training data. There are a number of approaches for
improving generalization:
1. Arrange to have the optimum number of free parameters
(independent connection weights) in our model.
• Number of hidden layers
• Number of nodes in each layer
2. Stop the gradient descent training at the appropriate point.
Generalization Improving generalization
We have talked about training data sets: the data used for training our
networks.
The testing data set is the unseen data that is used to test the network’s
generalization.
What we can do is assume that the training data and testing data are
drawn (randomly) from the same data set
Dataset
Total(Inputs & targets)
Training data(Include Inputs &
corresponding targets)
Test data(Include Inputs &
corresponding targets)
Generalization Improving generalization
The portion of the data we have available for training that is withheld from
the network training is called the validation data set, and the remainder of
the data is called the training data set.
This approach is called the hold out method.
Dataset
Total(Inputs & targets)
Training data set(Include Inputs &
corresponding targets)
Validation data set(Include Inputs &
corresponding targets)
Training data (Include Inputs &
corresponding targets)
Test data(Include Inputs &
corresponding targets)
Cross-Validation
General practical problem concerning how
to best split the available training data into
distinct training and validation data sets. For
example:
• What fraction of the patterns should be in
the validation set?
• Should the data be split randomly, or by
some systematic algorithm?
Generalization Improving generalization
Random subsampling cross-validation method
• around 60-90 % of samples are chosen at random for the training
• This partitioning system must be repeated several times during the
learning process to provide (for each trial) different samples in both
subsets.
Generalization Improving generalization
we divide all the training data at random into K distinct subsets, train the
network using K–1 subsets, and test the network on the remaining
subset.
K-fold cross-validation
Cross-Validation
Generalization Improving generalization
we divide all the training data at random into K distinct subsets, train the
network using K–1 subsets, and test the network on the remaining
subset.
Dataset
Total(Inputs & targets)
Training data(Include Inputs &
corresponding targets)
Test data (Include Inputs &
corresponding targets)
…
Subset 1
Subset 2
Subset K
K-fold cross-validation
Cross-Validation
Generalization Improving generalization Cross-Validation
The process of training and validation is then repeated for each of the K
possible choices of the subset omitted from the training.
Dataset
Total(Inputs & targets)
Test data (Include Inputs &
corresponding targets)
…
Subset 1
Subset 2
Subset K
Left-out Subset
Remained Subsets
The average performance
on the K omitted subsets is
then our estimate of the
generalization performance.
K-fold cross-validation
If K is made equal to the full sample size, it is called leave-one-out cross
validation (LOOCV)
DataSample Variable
K-fold cross-validation
Generalization Improving generalization Cross-Validation
Venetian blind
• simple and easy to implement,
• generally safe to use if there are relatively many objects that are not in
random order.
we divide all the training data into groups with K members, and
corresponding members from each group form a subset.
…
K=4
subset 1
subset 2
subset 3
subset 4
20 samples
DataSample Variable
K-fold cross-validation
Generalization Improving generalization Cross-Validation
Contiguous Blocks
• simple and easy to implement
• safe to use when there are relative many objects in random order.
we divide all the training data into groups with N/K members (N: number
of samples), and members of each subset are contiguous.
K=4
subset 1
subset 2
20 samples
…
Generalization Improving generalization
the most obvious and simplest way to prevent over-fitting in our neural
network model is to restrict the number of free parameters they have.
• some form of validation scheme can be used to find the best number for
each given problem
Weight Restriction
# of hidden neurons
RMSECV
Generalization Improving generalization
For the iterative gradient descent based network training procedures we
have considered (e.g. batch back-propagation), the training set error will
naturally decrease with increasing numbers of epochs of training.
epoch
Error
Error of Training set
Error of Validation set
Over
FittingUnder
fitting
The error on the unseen
validation and testing data sets,
however, will start off decreasing
as the under-fitting is reduced,
but then it will eventually begin
to increase again as over-fitting
occurs.
Early Stopping
epoch
Error
Error of Validation set
Error of Training set
One potential problem with the idea of stopping early is that the validation
error may go up and down numerous times during training.
The safest approach is generally to train to convergence (or at least until it
is clear that the validation error is unlikely to fall again)
Generalization Improving generalization Early Stopping