+ All Categories
Home > Documents > Longin Jan Latecki Temple University latecki@temple

Longin Jan Latecki Temple University latecki@temple

Date post: 08-Feb-2016
Category:
Upload: sanjiv
View: 46 times
Download: 0 times
Share this document with a friend
Description:
Ch. 2: Linear Discriminants Stephen Marsland, Machine Learning: An Algorithmic Perspective .  CRC 2009 based on slides from Stephen Marsland, from Romain Thibaux (regression slides), and Moshe Sipper. Longin Jan Latecki Temple University [email protected]. w 1. w 2. w m. - PowerPoint PPT Presentation
Popular Tags:
54
Ch. 2: Linear Discriminants Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 based on slides from Stephen Marsland, from Romain Thibaux (regression slides), and Moshe Sipper Longin Jan Latecki Temple University [email protected]
Transcript
Page 1: Longin Jan Latecki Temple University latecki@temple

Ch. 2: Linear Discriminants Stephen Marsland, Machine Learning: An

Algorithmic Perspective. CRC 2009based on slides from Stephen Marsland,from Romain Thibaux (regression slides),

and Moshe Sipper

Longin Jan LateckiTemple University

[email protected]

Page 2: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

McCulloch and Pitts Neurons

x2

xm

w1

w2

wm

oh

x1

Greatly simplified biological neuronsSum the inputs

If total is less than some threshold, neuron firesOtherwise does not

Page 3: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

McCulloch and Pitts Neurons

for some threshold

The weight wj can be positive or negative Inhibitory or exitatory

Use only a linear sum of inputs Use a simple output instead of a pulse (spike train)

Page 4: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Neural Networks

Can put lots of McCulloch & Pitts neurons together

Connect them up in any way we likeIn fact, assemblies of the neurons are

capable of universal computationCan perform any computation that a normal

computer canJust have to solve for all the weights wij

Page 5: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Training Neurons

Adapting the weights is learningHow does the network know it is right?How do we adapt the weights to make the

network right more often?Training set with target outputsLearning rule

Page 6: Longin Jan Latecki Temple University latecki@temple

2.2 The Perceptron

Definition from Wikipedia:The perceptron is a binary classifier which maps its input x (a real-valued vector) to an output value f(x) (a single binary value) across the matrix:

elsebwx

xf0

01{)(

In order to not explicitly write b, we extend the input vector x by one more dimension that is always set to -1, e.g., x=(-1,x_1, …, x_7) with x_0=-1, and extend the weight vector to w=(w_0,w_1, …, w_7). Then adjusting w_0 corresponds to adjusting b.

The perceptron is considered the simplest kind of feed-forward neural network.

Page 7: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Bias Replaces Threshold

Inputs Outputs

-1

Page 8: Longin Jan Latecki Temple University latecki@temple
Page 9: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Perceptron Decision = Recall

Outputs are:

For example, y=(y_1, …, y_5)=(1, 0, 0, 1, 1) is a possible output.We may have a different function g in the place of sign, as in (2.4) in the book.

Page 10: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Perceptron Learning = Updating the Weights

We want to change the values of the weightsAim: minimise the error at the outputIf E = t-y, want E to be 0Use: Learning rate

Error

Input

Page 11: Longin Jan Latecki Temple University latecki@temple

Example 1: The Logical OR-1X_1 X_2 t

0 0 0

0 1 1

1 0 1

1 1 1

Initial values: w_0(0)=-0.05, w_1(0) =-0.02, w_2(0)=0.02, and =0.25Take first row of our training table:y_1= sign( -0.05×-1 + -0.02×0 + 0.02×0 ) = 1

w_0(1) = -0.05 + 0.25×(0-1)×-1=0.2w_1(1) = -0.02 + 0.25×(0-1)×0=-0.02w_2(1) = 0.02 + 0.25×(0-1)×0=0.02

We continue with the new weights and the second row, and so onWe make several passes over the training data.

W_0W_1

W_2

Page 12: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Decision boundary for OR perceptron

Page 13: Longin Jan Latecki Temple University latecki@temple

Perceptron Learning Applet

• http://lcn.epfl.ch/tutorial/english/perceptron/html/index.html

Page 14: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Example 2: Obstacle Avoidance with the Perceptron

LS RS

LM RM

w1 w2

w3

w4

= 0.3 = -0.01

LS RS

LM RM

Page 15: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Obstacle Avoidance with the Perceptron

LS RS LM RM

0 0 1 1

0 1 -1 1

1 0 1 -1

1 1 X X

Page 16: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Obstacle Avoidance with the Perceptron

LS RS

LM RM

w1 w2

w3

w4w1=0+0.3 * (1-1) * 0 = 0

Page 17: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Obstacle Avoidance with the Perceptron

LS RS

RM

w1 w2

w3

w4w2=0+0.3 * (1-1) * 0 = 0

And the same for w3, w4

LM

Page 18: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Obstacle Avoidance with the Perceptron

LS RS LM RM

0 0 1 1

0 1 -1 1

1 0 1 -1

1 1 X X

Page 19: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Example 1: Obstacle Avoidance with the Perceptron

LS RS

RM

w1 w2

w3

w4 w1=0+0.3 * (-1-1) * 0 = 0

LM

Page 20: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Obstacle Avoidance with the Perceptron

LS RS

RM

w1 w2

w3

w4 w1=0+0.3 * (-1-1) * 0 = 0w2=0+0.3 * ( 1-1) * 0 = 0

LM

Page 21: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Obstacle Avoidance with the Perceptron

LS RS

RM

w1 w2

w3

w4 w1=0+0.3 * (-1-1) * 0 = 0w2=0+0.3 * ( 1-1) * 0 = 0w3=0+0.3 * (-1-1) * 1 = -0.6

LM

Page 22: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Obstacle Avoidance with the Perceptron

LS RS

RM

w1 w2

w3

w4 w1=0+0.3 * (-1-1) * 0 = 0w2=0+0.3 * ( 1-1) * 0 = 0w3=0+0.3 * (-1-1) * 1 = -0.6w4=0+0.3 * ( 1-1) * 1 = 0

LM

Page 23: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Obstacle Avoidance with the Perceptron

LS RS LM RM

0 0 1 1

0 1 -1 1

1 0 1 -1

1 1 X X

Page 24: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Obstacle Avoidance with the Perceptron

LS RS

LM RM

w1 w2

w3

w4

w1=0+0.3 * ( 1-1) * 1 = 0w2=0+0.3 * (-1-1) * 1 = -0.6w3=-0.6+0.3 * ( 1-1) * 0 = -0.6w4=0+0.3 * (-1-1) * 0 = 0

Page 25: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Obstacle Avoidance with the PerceptronLS RS

LM RM

-0.01 -0.01

-0.60

-0.60

Page 26: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

2.3 Linear Separability

Outputs are:

where

cos|||||||| xwxw

and is the angle between vectors x and w.

Page 27: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Geometry of linear Separability

w

The equation of a line isw_0 + w_1*x + w_2*y=0It also means that point (x,y) is on the line

This equation is equivalent to

wx = (w_0, w_1,w_2) (1,x,y) = 0

If wx > 0, then the angle between w and x is less than 90 degree, which means thatw and x lie on the same side of the line.

Each output node of perceptron tries to separate the training dataInto two classes (fire or no-fire) with a linear decision boundary, i.e., straight line in 2D, plane in 3D, and hyperplane in higher dim.

Page 28: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Linear SeparabilityThe Binary AND Function

Page 29: Longin Jan Latecki Temple University latecki@temple

29

Gradient Descent Learning Rule

• Consider linear unit without threshold and continuous output o (not just –1,1)– y=w0 + w1 x1 + … + wn xn

• Train the wi’s such that they minimize the squared error

– E[w1,…,wn] = ½ dD (td-yd)2

where D is the set of training examples

Page 30: Longin Jan Latecki Temple University latecki@temple

30

Supervised Learning• Training and test data sets• Training set; input & target

Page 31: Longin Jan Latecki Temple University latecki@temple

31

Gradient Descent

D={<(1,1),1>,<(-1,-1),1>, <(1,-1),-1>,<(-1,1),-1>}

Gradient:E[w]=[E/w0,… E/wn]

(w1,w2)

(w1+w1,w2 +w2)w=- E[w]

wi=- E/wi

/wi 1/2d(td-yd)2

= d /wi 1/2(td-i wi xi)2

= d(td- yd)(-xi)

Page 32: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Gradient DescentError

wi=- E/wi

Page 33: Longin Jan Latecki Temple University latecki@temple

33

Incremental Stochastic Gradient Descent

• Batch mode : gradient descent

w=w - ED[w] over the entire data D

ED[w]=1/2d(td-yd)2

• Incremental mode: gradient descent

w=w - Ed[w] over individual training examples d

Ed[w]=1/2 (td-yd)2

Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if is small enough

Page 34: Longin Jan Latecki Temple University latecki@temple

34

Gradient Descent Perceptron Learning

Gradient-Descent(training_examples, )Each training example is a pair of the form <(x1,…xn),t> where (x1,…,xn) is the vector of input values, and t is the target output value, is the learning rate (e.g. 0.1)

• Initialize each wi to some small random value• Until the termination condition is met, Do

– For each <(x1,…xn),t> in training_examples Do

• Input the instance (x1,…,xn) to the linear unit and compute the output y

• For each linear unit weight wi Do

– wi= (t-y) xi

– For each linear unit weight wi Do• wi=wi+wi

Page 35: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Linear Separability The Exclusive Or (XOR) function.A B Out0 0 00 1 11 0 11 1 0

Limitations of the Perceptron

Page 36: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Limitations of the Perceptron

?W1 > 0W2 > 0W1 + W2 < 0

Page 37: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Limitations of the Perceptron?

In1

In2

In3

A B C Out0 0 1 00 1 0 11 0 0 11 1 0 0

Page 38: Longin Jan Latecki Temple University latecki@temple

2.4 Linear regression

010

2030

40

0

10

20

30

20

22

24

26

Tem

pera

ture

0 10 200

20

40

Given examples

Predict given a new point

Page 39: Longin Jan Latecki Temple University latecki@temple

0 200

20

40

010

2030

40

0

10

20

30

20

22

24

26

Tem

pera

ture

Linear regression

Prediction Prediction

Page 40: Longin Jan Latecki Temple University latecki@temple

Ordinary Least Squares (OLS)

0 200

Error or “residual”

Prediction

Observation

Sum squared error

Page 41: Longin Jan Latecki Temple University latecki@temple

Minimize the sum squared errorSum squared error

Linear equation

Linear system

Page 42: Longin Jan Latecki Temple University latecki@temple

Alternative derivation

n

d Solve the system (it’s better not to invert the matrix)

Page 43: Longin Jan Latecki Temple University latecki@temple

Beyond lines and planes

everything is the same with

still linear in

0 10 200

20

40

Page 44: Longin Jan Latecki Temple University latecki@temple

Geometric interpretation

[Matlab demo]

010

200

100

200

300

400

-10

0

10

20

Page 45: Longin Jan Latecki Temple University latecki@temple

Ordinary Least Squares [summary]

n

d

Let

For example

Let

Minimize by solving

Given examples

PredictyXXXw TT 1)(

Page 46: Longin Jan Latecki Temple University latecki@temple

Probabilistic interpretation

0 200

Likelihood

Page 47: Longin Jan Latecki Temple University latecki@temple

Summery• Perceptron and regression optimize the same target function• In both cases we compute the gradient (vector of partial

derivatives)• In the case of regression, we set the gradient to zero and

solve for vector w. As the solution we have a closed formula for w such that the target function obtains the global minimum.

• In the case of perceptron, we iteratively go in the direction of the minimum by going in the direction of minus the gradient. We do this incrementally making small steps for each data point.

Page 48: Longin Jan Latecki Temple University latecki@temple

Homework 1

• (Ch. 2.3.3) Implement perceptron in Matlab and test it on the Pmia Indian Dataset from UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/

• (Ch. 2.4.1) Implementing linear regression in Matlab and apply it to auto-mpg dataset.

Page 49: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

From Ch. 3: Testing

How do we evaluate our trained network?Can’t just compute the error on the training

data - unfair, can’t see overfittingKeep a separate testing setAfter training, evaluate on this test setHow do we check for overfitting?Can’t use training or testing sets

Page 50: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Validation

Keep a third set of data for thisTrain the network on training dataPeriodically, stop and evaluate on

validation setAfter training has finished, test on test set

This is coming expensive on data!

Page 51: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Hold Out Cross Validation

Inputs

Targets

Training Training Validation

Page 52: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Hold Out Cross Validation

Partition training data into K subsetsTrain on K-1 of subsets, validate on KthRepeat for new network, leaving out a

different subsetChoose network that has best validation

errorTraded off data for computationExtreme version: leave-one-out

Page 53: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Early Stopping

When should we stop training?Could set a minimum training error

Danger of overfittingCould set a number of epochs

Danger of underfitting or overfittingCan use the validation set

Measure the error on the validation set during training

Page 54: Longin Jan Latecki Temple University latecki@temple

161.326 Stephen Marsland

Early StoppingError

Training

Number of epochs

Validation

Time to stop training


Recommended