EE-M016 2005/6: IS L3&4 1/32 v2.0
Lectures 3&4:Linear Machine Learning Algorithms
Dr Martin Brown
Room: E1k
Email: [email protected]
Telephone: 0161 306 4672
http://www.csc.umist.ac.uk/msc/intranet/EE-M016
EE-M016 2005/6: IS L3&4 2/32 v2.0
Lectures 3&4: Outline
Linear classification using the Perceptron• Classification problem• Linear classifier and decision boundary• Perceptron learning rule• Proof of convergence
Recursive linear regression using LMS• Modelling and recursive parameter estimation• Linear models and quadratic performance function• LMS and NLMS learning rules• Proof of convergence
EE-M016 2005/6: IS L3&4 3/32 v2.0
Lectures 3&4: Learning Objectives
1. Understand what classification and regression machine learning techniques are and their differences
2. Describe how linear models can be used for both classification and regression problems
3. Prove convergence of the learning algorithms for linear relationships, subject to restrictive conditions
4. Understand the restrictions of these basic proofs
Develop basic framework that will be expanded on in subsequent lectures
EE-M016 2005/6: IS L3&4 4/32 v2.0
Lecture 3&4: Resources
Classification/Perceptron
An introduction to Support Vector Machines and other kernel-based learning methods, N Cristianini, J Shawe-Taylor, CUP, 2000
Regression/LMS
Adaptive Signal Processing, Widrow & Stearns, Prentice Hall, 1985
Many other sources are available (on-line).
EE-M016 2005/6: IS L3&4 5/32 v2.0
What is Classification?
Classification is also known as (statistical) pattern recognition
The aim is to build a machine/algorithm that can assign appropriate qualitative labels to new, previously unseen quantitative data using a priori knowledge and/or information contained in a training set. The patterns to be classified are usually groups of measurements/observations, that are believed to be informative for the classification task.
Example: Face recognition
Classifierm(,x)
Training data: D = {X,y}
Priorknowledge
Predicted class label: yNew pattern: x
Design/learn
Predict ^^
EE-M016 2005/6: IS L3&4 6/32 v2.0
Classification Training Data
To supply training data for a classifier, examples must be collected that contain both positive (examples of the class) and negative (examples of other classes) instances. These are qualitative target class values and are stored as +1 and -1, for the positive and negative instances respectively. Generated by expert or by observation.
The quantitative input features should be informative
The training set should contain enough examples to be able to build statistically significant decisions
How to encode qualitative target and input features?
EE-M016 2005/6: IS L3&4 7/32 v2.0
Bayes Class Priors
Classification is all about decision making using the concept of “minimum risk”
Imagine that the training data contains 100 examples, 70 of them are class 1 (c1), 30 are class 2 (c2)
If I have to decide which class an unknown example belongs to, which decision is optimal?
Errors if decision is class 1: p(c1) =
Errors if decision is class 2: p(c2) =
Minimum risk decision is:
p(c1) & p(c2) are known as the Bayes priors, they represent the baseline performance for any classifier. They are derived from the training data as simple percentages
EE-M016 2005/6: IS L3&4 8/32 v2.0
Structure of a Linear Classifier
Given a set of quantitative features x, a linear classifier has the form:
The sgn() function is used to produce the qualitative class label (+/-1)
The class/decision boundary is determined when:
This is an (n-1)D hyperplane in feature space.
In 2-dimensional feature space:
How does the sign and magnitude of affect the decision boundary?
0sgn θxTy
00
0
θxTy
2
01
2
12
022110
xx
xx
x1
x2
2
01
2
12
xx
++
++
+ +
+
+
EE-M016 2005/6: IS L3&4 9/32 v2.0
Simple Example: Fisher’s Iris Data
Famous example of building classifiers for a problem with 3 types of Iris flowers and 4 measurements about the flower:• Sepal length and width• Petal length and width
150 examples were collected, 50 from each class
Build 3 separate classifiers, one for recognizing examples of each class
Data is shown, plotted against last two features, as well as two linear classifiers for the Setosa and Virginica classes
Calculate in lab 3&4 …
EE-M016 2005/6: IS L3&4 10/32 v2.0
Perceptron Linear ClassifierThe Perceptron linear classifier was devised by Rosenblatt in 1956
It comprises a linear classifier (as just discussed) and a simple parameter update rule of the form:
Cyclically present each training pattern {xk, yk} to the linear classifier
When an error (misclassification) is made, update the parameters:
where >0 is the learning rate.
The bias term can be included as 0 with an extra feature x0 = 1:
Continue until there are no prediction errors
Perceptron convergence theorem If the data set is linearly separable, the perceptron learning algorithm will converge to an optimal separator in a finite time
kkk
kkkk
y
y
,0
^
1,0
^
^
1
^
xθθ
kkkk y xθθ
^
1
^
EE-M016 2005/6: IS L3&4 11/32 v2.0
What does this look like?
The parameters are updated to make them more like the incorrect feature vector.
After updating:
Updated parameters are closer
to correct decision
Instantaneous Parameter Update
kkkk y xθθ
^
1
^Error-driven update:
kθ
x1, 1
x2, 2
1kθkx
ky
y, y
1
-1
2
2
^
^
1
^
kkkTk
kTkkk
Tkk
Tk
y
y
xθx
xxθxθx
2
2kx
kTk
^
θxk
Tk
^
θx
2
2kx0
^
^
^
xT^
EE-M016 2005/6: IS L3&4 12/32 v2.0
Perceptron Convergence Proof Preamble …Basic aim is to minimise the number of mis-classifications:
This is generally an NP-complete problemWe’ve assumed that there is an optimal solution with 0 errors
This is similar to Least Squares recursive estimation:Performance = i(yi-yi)2 = 4*numberOfErrorsExcept that the sgn() makes it a non-quadratic optimization problem
Updating only when there are errors is the same as: with or without errors
Sometimes drawn as a network:
xk
yk
+-
“error driven”parameter estimation
Repeatedlycycle through data set D, drawing out each sample {xk, yk}
)sgn(^^
kTkky θx
yk
kkkk y xθθ
^
1
^
kkkkk yy xθθ
^^
1
^
2
kkkk yy xθ )(^
^
^
EE-M016 2005/6: IS L3&4 13/32 v2.0
Convergence Analysis of the Perceptron (i)
If a linearly separable data set D is repeatedly presented to a Perceptron, then the learning procedure is guaranteed to converge (no errors) in a finite time
If the data set is linearly separable, there exists optimal parameters such that for all i = 1, …, lNote that are also optimal parameter vectors
Consider the positive quantity defined by, such that |||| = 1:
This is a concept known as the “classification margin”
Assume also that the feature vectors are bounded by:
iTi y)sgn( θx
θxTiii ymin
2
2
2 max iiR x
0,)( θθ
EE-M016 2005/6: IS L3&4 14/32 v2.0
Convergence Analysis of the Perceptron (ii)
To show convergence, we need to establish that at the kth iteration, when an error has occurred:
Using the update formula:
2
2
^2
2
1
^
kk θθθθ
1
2
k
k+1
2
2
2
222
2
^
2
2
22
2
^
^2
2
22
2
^
2
2
^2
2
1
^
R
y
y
y
k
Tkkkk
kTkkkk
kkkk
θθ
θxxθθ
θθxxθθ
xθθθθ
To finish proof, select
2
2R
^^
EE-M016 2005/6: IS L3&4 15/32 v2.0
Convergence Analysis of the Perceptron (iii)
To show this terminates in a finite number of iterations, simply note that:
is independent of the current training sample, so the parameter error must decrease by at least this amount at each update iteration. As the initial error is finite, 0 = 0, say, there must exist a finite number of steps before the parameter error is reduced to zero.
Note also that is proportional to the size of the feature vector (R2) and inversely proportional to the size of the margin (). Both of these will influence the number of update iterations when the Perceptron is learning
0222 R
^
EE-M016 2005/6: IS L3&4 16/32 v2.0
Example of Perceptron (i)
Consider modelling the logical AND data using a Perceptron
1
1
1
1
,
11
0110
00
yXIs the data linearly separable?
k=0, = [0.01, 0.1, 0.006] k=5, = [-0.98, 1.11, 1.01] k=18, = [-2.98, 2.11, 1.01]
x1 x1 x1
x2x2 x2
^ ^ ^
EE-M016 2005/6: IS L3&4 17/32 v2.0
Example: Parameter Trajectory (ii)
Lab exercise:Calculate by hand the first 4 iterations of the learning scheme
bias 0,k
2,k
1,k
k: data presentation index
i,k
^
^
^
EE-M016 2005/6: IS L3&4 18/32 v2.0
Classification Margin
In this proof, we assumed that there exists a single, optimal parameter vector.
In practice, when the data is linearly separable, there are an infinite number – simply requiring correct classification results in an ill-posed posed problem
The classification margin can be defined as the minimum distance of the decision boundary to a point in that class– Used in deriving Support Vector
Machines
x1
x2
x1
x2?θxT
-10
1
θ
EE-M016 2005/6: IS L3&4 19/32 v2.0
Classification Summary
Classification is the task of assigning an object, described by a feature vector, to one of a set of mutually exclusive groups
A linear classifier has a linear decision boundary
The perceptron training algorithm is guaranteed to converge in a finite time when the data set is linearly separable
The final boundary is determined by the initial values and the order of presentation of the data
EE-M016 2005/6: IS L3&4 20/32 v2.0
Definition of Regression
Regression is a (statistical) methodology that utilizes the relation between two or more quantitative variables so that one variable can be predicted from the other, or others.
Examples:• Sales of a product can be predicted by using the
relationship between sales volume and amount of advertising
• The performance of an employee can be predicted by using the relationship between performance and aptitude tests
• The size of a child’s vocabulary can be predicted by using the relationship between the vocabulary size, the child’s age and the parents’ educational input.
EE-M016 2005/6: IS L3&4 21/32 v2.0
Data generated byEstimate model parametersPredict a real value (fit a curve to the data)Predictive performanceaverage error
Regression Problem Visualisation
+ ++ +
++
+
++
+
+
++
++
+++
+
+ +
+
+
x
y, yrmse= y
),(minarg^^
yyθ f 2,0)( xfy
),(^^
θxmy 221 ),()( i iiT myf θxθ
^ ^
EE-M016 2005/6: IS L3&4 22/32 v2.0
An output of 12 with rmse/standard deviation = 1.5: Within a small region close to the query point, the average target value was 12 and the standard deviation within that region was 1.5 (variance = 2.25)
Probabilistic Prediction Output
x
y, y
+
+++ +
++
+12 (y|x) = 12(e) = 1.5
(y|x) = 12
95% of the data lies in the range +/-2= [12 +/-2*1.5]
= [9,15]
2e = 3
^
EE-M016 2005/6: IS L3&4 23/32 v2.0
Structure of a Linear Regression ModelGiven a set of features x, a linear predictor has the form:
The output is a real-valued, quantitative variable
The bias term can be included as an extra feature x0 = 1. This renames the bias parameter as 0.
Most linear control system models do not explicitly include a bias term, why is this?
Similar to the Toluca example in week 1.
by T θx
x
y, y
EE-M016 2005/6: IS L3&4 24/32 v2.0
Least Mean Squares Learning
Least Mean Squares (LMS) proposed by Widrow 1962This is a (non-optimal) sequential parameter estimation
procedure for a linear model:
NB, compared to classification, both yk and yk are quantitative variables, so the error/noise signal (yk-yk) is generally non-zero. Similar to the Perceptron, but no threshold on xT. is again the positive learning rate.
Widely used in filtering/signal processing and adaptive control applications
“Cheap” version of sequential/recursive parameter estimationThe normalised version (NLMS) was developed by Kaczmarz
in 1937
kkkkk yy xθθ
^^
1
^
^^
^
EE-M016 2005/6: IS L3&4 25/32 v2.0
Proof of LMS Convergence (i)
If a noise-free data set containing a linear relationship x->y is repeatedly presented to a linear model, then the LMS algorithm is guaranteed to update the parameters so that they converge to their optimal values, assuming the learning rate is sufficiently small.
Note: 1. Assume there is no measurement noise in the target data2. Assume the data is generated from a linear relationship3. Parameter estimation will take an infinite time to converge
to the optimal values4. Rate of convergence and stability depend on the learning
rate
EE-M016 2005/6: IS L3&4 26/32 v2.0
Proof of Convergence (ii)
To show convergence, we need to establish that at the kth iteration, when an error has occurred:
Using the update formula:
2
2
^2
2
1
^
kk θθθθ
2
2
^
2^2
2
2^2
2
2
^
^^2
2
2^2
2
2
^
2
2
^^2
2
1
^
2
2
k
kkkkkk
kTkkkkkkk
kkkkk
yyyy
yyyy
yy
θθ
xθθ
θθxxθθ
xθθθθ
2
2
2min0
k
kx
when
1
2
k
k+1
^^
EE-M016 2005/6: IS L3&4 27/32 v2.0
Example: LMS LearningConsider the “target” linear model y
= 1 - 2*x, where the inputs are drawn from a normal distribution with zero mean, unit variance
Data set consisted of 25 data points, and involved 10 cycles through the data set
=0.1
0
1
1
0
k
x
y,y
k=100
k=0
k=5^
^
^
^
^
^
EE-M016 2005/6: IS L3&4 28/32 v2.0
Stability and NLMS
To normalise the LMS algorithm and remove the dependency of on the input vector size, consider:
This learning algorithm is stable for 0<< 2 (exercise).
When =1, the NLMS algorithm has the property that the error, on that datum, after adaptation is zero, ie:
Exercise: prove this.
Is this desirable when the target contains (measurement) noise?
kkk
k
kk yy xx
θθ
^
2
2
^
1
^
kkTk y1
^
θx
EE-M016 2005/6: IS L3&4 29/32 v2.0
Regression Summary
Regression is a (statistical) technique for predicting real-valued outputs, given a quantitative feature vector
Typically, it is assumed that the dependent, target variable is corrupted by Gaussian noise, and this is unpredictable.
The aim is then to fit the underlying linear/non-linear signal.
The LMS algorithm is a simple, cheap gradient descent technique for updating the linear parameter estimates
The parameters will converge to their correct values when the target does not contain any noise, otherwise they will oscillate in a zone around the optimum.
Stability of the algorithm depends on the learning rate
EE-M016 2005/6: IS L3&4 30/32 v2.0
Lecture 3&4: Summary
This lecture has looked at basic (linear) classification and regression techniques– Investigated basic linear model structure– Proposed simple, “on-line” learning rules– Proved convergence for simple environments– Discussed the practicality of the machine learning
algorithms
While these algorithms are rarely used in this form, their structure has strongly influenced the development of more advanced techniques– Support vector machines– Multi-layer perceptrons
which will be studied in the coming weeks
EE-M016 2005/6: IS L3&4 31/32 v2.0
Laboratory 3&4: Perceptron/LMSDownload the irisClassifier.m & iris.mat Matlab files that contain a simple
GUI for displaying the Iris data and entering decision boundaries– Enter parameters that create suitable decision boundaries for both
the Setosa and Virginica classes– Which of the three classes are linearly separable?– Make sure you can translate between the classifiers’ parameters, ,
and the gradient/intercept coordinate systems. Also ensure that the output is +1 (rather than -1) in the appropriate region
Download the irisPerceptron.m and perceptron.m Matlab files that contain the Perceptron algorithm for the Iris data– Run the algorithm and note how the decision boundary changes
when a point is correctly/incorrectly classified– Modify the learning rate and note the effect it has on the
convergence rate and final values
EE-M016 2005/6: IS L3&4 32/32 v2.0
Laboratory 3&4: Perceptron/LMS (ii)
Copy and modify the irisPerceptron.m Matlab file so that it runs on the logical AND and OR classification functions (see slides 16 & 17). Each should contain 2 features and four training patterns. Make sure you can calculate the updates by hand, as required on Slide 17.
Create a Matlab implementation of example given in Slide 27 for the LMS algorithm with a simple, single input linear model
What values of causes the LMS algorithm to become unstable?
Can this ever happen with the Perceptron algorithm?
Modify this implementation to use the NLMS training rule
Verify that learning is always stable for 0 < < 2.
Complete the two (pen and paper) exercises on Slide 28.
How might this insight be used with the Perceptron algorithm to implement a dynamic learning rate?