Download - EE-M016 2005/6: IS L3&4 1/32 v2.0 Lectures 3&4: Linear Machine Learning Algorithms Dr Martin Brown Room: E1k Email: [email protected] Telephone:

EE-M016 2005/6: IS L3&4 1/32 v2.0

Lectures 3&4:Linear Machine Learning Algorithms

Dr Martin Brown

Room: E1k

Email: [email protected]

Telephone: 0161 306 4672

http://www.csc.umist.ac.uk/msc/intranet/EE-M016

EE-M016 2005/6: IS L3&4 2/32 v2.0

Lectures 3&4: Outline

Linear classification using the Perceptron• Classification problem• Linear classifier and decision boundary• Perceptron learning rule• Proof of convergence

Recursive linear regression using LMS• Modelling and recursive parameter estimation• Linear models and quadratic performance function• LMS and NLMS learning rules• Proof of convergence

EE-M016 2005/6: IS L3&4 3/32 v2.0

Lectures 3&4: Learning Objectives

1. Understand what classification and regression machine learning techniques are and their differences

2. Describe how linear models can be used for both classification and regression problems

3. Prove convergence of the learning algorithms for linear relationships, subject to restrictive conditions

4. Understand the restrictions of these basic proofs

Develop basic framework that will be expanded on in subsequent lectures

EE-M016 2005/6: IS L3&4 4/32 v2.0

Lecture 3&4: Resources

Classification/Perceptron

An introduction to Support Vector Machines and other kernel-based learning methods, N Cristianini, J Shawe-Taylor, CUP, 2000

Regression/LMS

Adaptive Signal Processing, Widrow & Stearns, Prentice Hall, 1985

Many other sources are available (on-line).

EE-M016 2005/6: IS L3&4 5/32 v2.0

What is Classification?

Classification is also known as (statistical) pattern recognition

The aim is to build a machine/algorithm that can assign appropriate qualitative labels to new, previously unseen quantitative data using a priori knowledge and/or information contained in a training set. The patterns to be classified are usually groups of measurements/observations, that are believed to be informative for the classification task.

Example: Face recognition

Classifierm(,x)

Training data: D = {X,y}

Priorknowledge

Predicted class label: yNew pattern: x

Design/learn

Predict ^^

EE-M016 2005/6: IS L3&4 6/32 v2.0

Classification Training Data

To supply training data for a classifier, examples must be collected that contain both positive (examples of the class) and negative (examples of other classes) instances. These are qualitative target class values and are stored as +1 and -1, for the positive and negative instances respectively. Generated by expert or by observation.

The quantitative input features should be informative

The training set should contain enough examples to be able to build statistically significant decisions

How to encode qualitative target and input features?

EE-M016 2005/6: IS L3&4 7/32 v2.0

Bayes Class Priors

Classification is all about decision making using the concept of “minimum risk”

Imagine that the training data contains 100 examples, 70 of them are class 1 (c1), 30 are class 2 (c2)

If I have to decide which class an unknown example belongs to, which decision is optimal?

Errors if decision is class 1: p(c1) =

Errors if decision is class 2: p(c2) =

Minimum risk decision is:

p(c1) & p(c2) are known as the Bayes priors, they represent the baseline performance for any classifier. They are derived from the training data as simple percentages

EE-M016 2005/6: IS L3&4 8/32 v2.0

Structure of a Linear Classifier

Given a set of quantitative features x, a linear classifier has the form:

The sgn() function is used to produce the qualitative class label (+/-1)

The class/decision boundary is determined when:

This is an (n-1)D hyperplane in feature space.

In 2-dimensional feature space:

How does the sign and magnitude of affect the decision boundary?

0sgn θxTy

00

0

θxTy

2

01

2

12

022110

xx

xx

x1

x2

2

01

2

12

xx

++

++

+ +

+

+

EE-M016 2005/6: IS L3&4 9/32 v2.0

Simple Example: Fisher’s Iris Data

Famous example of building classifiers for a problem with 3 types of Iris flowers and 4 measurements about the flower:• Sepal length and width• Petal length and width

150 examples were collected, 50 from each class

Build 3 separate classifiers, one for recognizing examples of each class

Data is shown, plotted against last two features, as well as two linear classifiers for the Setosa and Virginica classes

Calculate in lab 3&4 …

http://images.google.com/imgres?imgurl=home.pacbell.net/kenww/my_iris/water_loving/I_setosa_purple-web.jpg&imgrefurl=http://home.pacbell.net/kenww/my_iris/water_loving/water_loving.htm&h=318&w=400&prev=/images%3Fq%3Diris%2Bsetosa%26svnum%3D10%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8

http://images.google.com/imgres?imgurl=molly.hsc.unt.edu/~rbarton/Iris/laevs/I_virg_sDK.jpg&imgrefurl=http://molly.hsc.unt.edu/~rbarton/Iris/laevs/I_virginicaS.html&h=325&w=300&prev=/images%3Fq%3Diris%2Bvirginica%26svnum%3D10%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8

EE-M016 2005/6: IS L3&4 10/32 v2.0

Perceptron Linear ClassifierThe Perceptron linear classifier was devised by Rosenblatt in 1956

It comprises a linear classifier (as just discussed) and a simple parameter update rule of the form:

Cyclically present each training pattern {xk, yk} to the linear classifier

When an error (misclassification) is made, update the parameters:

where >0 is the learning rate.

The bias term can be included as 0 with an extra feature x0 = 1:

Continue until there are no prediction errors

Perceptron convergence theorem If the data set is linearly separable, the perceptron learning algorithm will converge to an optimal separator in a finite time

kkk

kkkk

y

y

,0

^

1,0

^

^

1

^

xθθ

kkkk y xθθ

^

1

^

EE-M016 2005/6: IS L3&4 11/32 v2.0

What does this look like?

The parameters are updated to make them more like the incorrect feature vector.

After updating:

Updated parameters are closer

to correct decision

Instantaneous Parameter Update

kkkk y xθθ

^

1

^Error-driven update:

kθ

x1, 1

x2, 2

1kθkx

ky

y, y

1

-1

2

2

^

^

1

^

kkkTk

kTkkk

Tkk

Tk

y

y

xθx

xxθxθx

2

2kx

kTk

^

θxk

Tk

^

θx

2

2kx0

^

^

^

xT^

EE-M016 2005/6: IS L3&4 12/32 v2.0

Perceptron Convergence Proof Preamble …Basic aim is to minimise the number of mis-classifications:

This is generally an NP-complete problemWe’ve assumed that there is an optimal solution with 0 errors

This is similar to Least Squares recursive estimation:Performance = i(yi-yi)2 = 4*numberOfErrorsExcept that the sgn() makes it a non-quadratic optimization problem

Updating only when there are errors is the same as: with or without errors

Sometimes drawn as a network:

xk

yk

+-

“error driven”parameter estimation

Repeatedlycycle through data set D, drawing out each sample {xk, yk}

)sgn(^^

kTkky θx

yk

kkkk y xθθ

^

1

^

kkkkk yy xθθ

^^

1

^

2

kkkk yy xθ )(^

^

^

EE-M016 2005/6: IS L3&4 13/32 v2.0

Convergence Analysis of the Perceptron (i)

If a linearly separable data set D is repeatedly presented to a Perceptron, then the learning procedure is guaranteed to converge (no errors) in a finite time

If the data set is linearly separable, there exists optimal parameters such that for all i = 1, …, lNote that are also optimal parameter vectors

Consider the positive quantity defined by, such that |||| = 1:

This is a concept known as the “classification margin”

Assume also that the feature vectors are bounded by:

iTi y)sgn( θx

θxTiii ymin

2

2

2 max iiR x

0,)( θθ

EE-M016 2005/6: IS L3&4 14/32 v2.0

Convergence Analysis of the Perceptron (ii)

To show convergence, we need to establish that at the kth iteration, when an error has occurred:

Using the update formula:

2

2

^2

2

1

^

kk θθθθ

1

2

k

k+1

2

2

2

222

2

^

2

2

22

2

^

^2

2

22

2

^

2

2

^2

2

1

^

R

y

y

y

k

Tkkkk

kTkkkk

kkkk

θθ

θxxθθ

θθxxθθ

xθθθθ

To finish proof, select

2

2R

^^

EE-M016 2005/6: IS L3&4 15/32 v2.0

Convergence Analysis of the Perceptron (iii)

To show this terminates in a finite number of iterations, simply note that:

is independent of the current training sample, so the parameter error must decrease by at least this amount at each update iteration. As the initial error is finite, 0 = 0, say, there must exist a finite number of steps before the parameter error is reduced to zero.

Note also that is proportional to the size of the feature vector (R2) and inversely proportional to the size of the margin (). Both of these will influence the number of update iterations when the Perceptron is learning

0222 R

^

EE-M016 2005/6: IS L3&4 16/32 v2.0

Example of Perceptron (i)

Consider modelling the logical AND data using a Perceptron

1

1

1

1

,

11

0110

00

yXIs the data linearly separable?

k=0, = [0.01, 0.1, 0.006] k=5, = [-0.98, 1.11, 1.01] k=18, = [-2.98, 2.11, 1.01]

x1 x1 x1

x2x2 x2

^ ^ ^

EE-M016 2005/6: IS L3&4 17/32 v2.0

Example: Parameter Trajectory (ii)

Lab exercise:Calculate by hand the first 4 iterations of the learning scheme

bias 0,k

2,k

1,k

k: data presentation index

i,k

^

^

^

EE-M016 2005/6: IS L3&4 18/32 v2.0

Classification Margin

In this proof, we assumed that there exists a single, optimal parameter vector.

In practice, when the data is linearly separable, there are an infinite number – simply requiring correct classification results in an ill-posed posed problem

The classification margin can be defined as the minimum distance of the decision boundary to a point in that class– Used in deriving Support Vector

Machines

x1

x2

x1

x2?θxT

-10

1

θ

EE-M016 2005/6: IS L3&4 19/32 v2.0

Classification Summary

Classification is the task of assigning an object, described by a feature vector, to one of a set of mutually exclusive groups

A linear classifier has a linear decision boundary

The perceptron training algorithm is guaranteed to converge in a finite time when the data set is linearly separable

The final boundary is determined by the initial values and the order of presentation of the data

EE-M016 2005/6: IS L3&4 20/32 v2.0

Definition of Regression

Regression is a (statistical) methodology that utilizes the relation between two or more quantitative variables so that one variable can be predicted from the other, or others.

Examples:• Sales of a product can be predicted by using the

relationship between sales volume and amount of advertising

• The performance of an employee can be predicted by using the relationship between performance and aptitude tests

• The size of a child’s vocabulary can be predicted by using the relationship between the vocabulary size, the child’s age and the parents’ educational input.

EE-M016 2005/6: IS L3&4 21/32 v2.0

Data generated byEstimate model parametersPredict a real value (fit a curve to the data)Predictive performanceaverage error

Regression Problem Visualisation

+ ++ +

++

+

++

+

+

++

++

+++

+

+ +

+

+

x

y, yrmse= y

),(minarg^^

yyθ f 2,0)( xfy

),(^^

θxmy 221 ),()( i iiT myf θxθ

^ ^

EE-M016 2005/6: IS L3&4 22/32 v2.0

An output of 12 with rmse/standard deviation = 1.5: Within a small region close to the query point, the average target value was 12 and the standard deviation within that region was 1.5 (variance = 2.25)

Probabilistic Prediction Output

x

y, y

+

+++ +

++

+12 (y|x) = 12(e) = 1.5

(y|x) = 12

95% of the data lies in the range +/-2= [12 +/-2*1.5]

= [9,15]

2e = 3

^

EE-M016 2005/6: IS L3&4 23/32 v2.0

Structure of a Linear Regression ModelGiven a set of features x, a linear predictor has the form:

The output is a real-valued, quantitative variable

The bias term can be included as an extra feature x0 = 1. This renames the bias parameter as 0.

Most linear control system models do not explicitly include a bias term, why is this?

Similar to the Toluca example in week 1.

by T θx

x

y, y

EE-M016 2005/6: IS L3&4 24/32 v2.0

Least Mean Squares Learning

Least Mean Squares (LMS) proposed by Widrow 1962This is a (non-optimal) sequential parameter estimation

procedure for a linear model:

NB, compared to classification, both yk and yk are quantitative variables, so the error/noise signal (yk-yk) is generally non-zero. Similar to the Perceptron, but no threshold on xT. is again the positive learning rate.

Widely used in filtering/signal processing and adaptive control applications

“Cheap” version of sequential/recursive parameter estimationThe normalised version (NLMS) was developed by Kaczmarz

in 1937

kkkkk yy xθθ

^^

1

^

^^

^

EE-M016 2005/6: IS L3&4 25/32 v2.0

Proof of LMS Convergence (i)

If a noise-free data set containing a linear relationship x->y is repeatedly presented to a linear model, then the LMS algorithm is guaranteed to update the parameters so that they converge to their optimal values, assuming the learning rate is sufficiently small.

Note: 1. Assume there is no measurement noise in the target data2. Assume the data is generated from a linear relationship3. Parameter estimation will take an infinite time to converge

to the optimal values4. Rate of convergence and stability depend on the learning

rate

EE-M016 2005/6: IS L3&4 26/32 v2.0

Proof of Convergence (ii)

To show convergence, we need to establish that at the kth iteration, when an error has occurred:

Using the update formula:

2

2

^2

2

1

^

kk θθθθ

2

2

^

2^2

2

2^2

2

2

^

^^2

2

2^2

2

2

^

2

2

^^2

2

1

^

2

2

k

kkkkkk

kTkkkkkkk

kkkkk

yyyy

yyyy

yy

θθ

xθθ

θθxxθθ

xθθθθ

2

2

2min0

k

kx

when

1

2

k

k+1

^^

EE-M016 2005/6: IS L3&4 27/32 v2.0

Example: LMS LearningConsider the “target” linear model y

= 1 - 2*x, where the inputs are drawn from a normal distribution with zero mean, unit variance

Data set consisted of 25 data points, and involved 10 cycles through the data set

=0.1

0

1

1

0

k

x

y,y

k=100

k=0

k=5^

^

^

^

^

^

EE-M016 2005/6: IS L3&4 28/32 v2.0

Stability and NLMS

To normalise the LMS algorithm and remove the dependency of on the input vector size, consider:

This learning algorithm is stable for 0<< 2 (exercise).

When =1, the NLMS algorithm has the property that the error, on that datum, after adaptation is zero, ie:

Exercise: prove this.

Is this desirable when the target contains (measurement) noise?

kkk

k

kk yy xx

θθ

^

2

2

^

1

^

kkTk y1

^

θx

EE-M016 2005/6: IS L3&4 29/32 v2.0

Regression Summary

Regression is a (statistical) technique for predicting real-valued outputs, given a quantitative feature vector

Typically, it is assumed that the dependent, target variable is corrupted by Gaussian noise, and this is unpredictable.

The aim is then to fit the underlying linear/non-linear signal.

The LMS algorithm is a simple, cheap gradient descent technique for updating the linear parameter estimates

The parameters will converge to their correct values when the target does not contain any noise, otherwise they will oscillate in a zone around the optimum.

Stability of the algorithm depends on the learning rate

EE-M016 2005/6: IS L3&4 30/32 v2.0

Lecture 3&4: Summary

This lecture has looked at basic (linear) classification and regression techniques– Investigated basic linear model structure– Proposed simple, “on-line” learning rules– Proved convergence for simple environments– Discussed the practicality of the machine learning

algorithms

While these algorithms are rarely used in this form, their structure has strongly influenced the development of more advanced techniques– Support vector machines– Multi-layer perceptrons

which will be studied in the coming weeks

EE-M016 2005/6: IS L3&4 31/32 v2.0

Laboratory 3&4: Perceptron/LMSDownload the irisClassifier.m & iris.mat Matlab files that contain a simple

GUI for displaying the Iris data and entering decision boundaries– Enter parameters that create suitable decision boundaries for both

the Setosa and Virginica classes– Which of the three classes are linearly separable?– Make sure you can translate between the classifiers’ parameters, ,

and the gradient/intercept coordinate systems. Also ensure that the output is +1 (rather than -1) in the appropriate region

Download the irisPerceptron.m and perceptron.m Matlab files that contain the Perceptron algorithm for the Iris data– Run the algorithm and note how the decision boundary changes

when a point is correctly/incorrectly classified– Modify the learning rate and note the effect it has on the

convergence rate and final values

EE-M016 2005/6: IS L3&4 32/32 v2.0

Laboratory 3&4: Perceptron/LMS (ii)

Copy and modify the irisPerceptron.m Matlab file so that it runs on the logical AND and OR classification functions (see slides 16 & 17). Each should contain 2 features and four training patterns. Make sure you can calculate the updates by hand, as required on Slide 17.

Create a Matlab implementation of example given in Slide 27 for the LMS algorithm with a simple, single input linear model

What values of causes the LMS algorithm to become unstable?

Can this ever happen with the Perceptron algorithm?

Modify this implementation to use the NLMS training rule

Verify that learning is always stable for 0 < < 2.

Complete the two (pen and paper) exercises on Slide 28.

How might this insight be used with the Perceptron algorithm to implement a dynamic learning rate?