+ All Categories
Home > Documents > Linear Discrimination Functions

Linear Discrimination Functions

Date post: 21-Mar-2016
Category:
Upload: nicola-fanizzi
View: 247 times
Download: 0 times
Share this document with a friend
Description:
Since g( x p ) = 0, we have g( x) = w t x + w 0 = r || w|| i.e. r = g( x) Both approaches can lead to regions in which the classification is undefined classification problems for k -class problem Turns out not to be a problem in many cases because training sets become small: If R i and R j are contiguous, the boundary between them is a portion of the hyperplane H ij defined by: The resulting discriminant function is not linear in x, but it is linear in y
60
Linear Discrimination Functions Corso di Apprendimento Automatico Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Universit ` a degli Studi di Bari November 30, 2008 Corso di Apprendimento Automatico Linear Discrimination Functions
Transcript
Page 1: Linear Discrimination Functions

Linear Discrimination Functions

Corso di Apprendimento AutomaticoLaurea Magistrale in Informatica

Nicola Fanizzi

Dipartimento di InformaticaUniversita degli Studi di Bari

November 30, 2008

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 2: Linear Discrimination Functions

Outline

Linear modelsGradient descentPerceptronMinimum square error approachLinear and logistic regression

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 3: Linear Discrimination Functions

Linear Discriminant Functions I

A linear discriminant function can be written as

g(x) = w1x1 + · · ·+ wdxd = ~w t~x + w0

where~w = weight vectorw0 = bias or threshold

A 2-class linear classifier implements the following decisionrule: Decide ω1 if g(x) > 0 and ω2 if g(x) < 0

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 4: Linear Discrimination Functions

Linear Discriminant Functions II

The equation g(x) = 0 defines the decision surface thatseparates points assigned to ω1 from points assigned to ω2.

When g(x) is linear, this decision surface is a hyperplane (H).

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 5: Linear Discrimination Functions

Linear Discriminant Functions IIIH divides the feature space into 2 half spaces:R1 for 1, and R2 for 2

If x1 and x2 are both on the decision surface

~w t~x1 + w0 = ~w t~x2 + w0 ⇒ ~w t (~x1 − ~x2) = 0

w is normal to any vector lying in the hyperplane

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 6: Linear Discrimination Functions

Linear Discriminant Functions IV

If we express ~x as

~x = ~xp + r~w||~w ||

where ~xp is the normal projection of ~x onto H, and r is thealgebraic distance from ~x to the hyperplane

Since g(~xp) = 0,we have g(~x) = ~w t~x + w0 = r ||~w || i.e. r = g(~x)

||~w ||

r is signed distance:r > 0 if ~x falls in R1, r < 0 if ~x falls in R2

Distance from the origin to the hyperplane is w0||~w ||

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 7: Linear Discrimination Functions

Linear Discriminant Functions V

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 8: Linear Discrimination Functions

Multicategory Case I

2 approaches to extend the LDF approach to the multicategorycase:

ωi / not ωi Reduce the problem to c − 1 two-class problems:Problem #i : Find the functions that separatespoints assigned to ωi from those not assigned to ωi

ωi / ωj Find the c(c − 1)/2 linear discriminants, one forevery pair of classes

Both approaches can lead to regions in which the classificationis undefined

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 9: Linear Discrimination Functions

Multicategory Case II

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 10: Linear Discrimination Functions

Pairwise Classification

Idea:build model for each pair of classes,using only training data from those classesProblem:Have to solve k(k−1)

2 classification problems for k -classproblemTurns out not to be a problem in many cases becausetraining sets become small:

Assume data evenly distributed, i.e. 2nk per learning

problem for n instances in totalSuppose learning algorithm is linear in nThen runtime of pairwise classification is proportional tok(k−1)

2 × 2nk = (k − 1)n

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 11: Linear Discrimination Functions

Linear Machine I

Define c linear discriminant functions:

gi(~x) = ~w ti ~x + wi0 i = 1, . . . , c

Linear Machine classifier: x ∈ ωi if gi(~x) > gj(~x) for all i 6= jIn case of equal scores, the classification is undefined

A LM divides the feature space into c decision regions,with gi(~x) the largest discriminant if ~x is in Ri

If Ri and Rj are contiguous, the boundary between them isa portion of the hyperplane Hij defined by:

gi(~x) = gj(~x) or (~wi − ~wj)t~x + (wi0 − wj0)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 12: Linear Discrimination Functions

Linear Machine IIIt follows that ~wi − ~wj is normal to HijThe signed distance from ~x to Hij is:

gi(~x)− gj(~x)

||~wi − ~wj ||

There are c(c − 1)/2 pairs of (convex) regionsNot all regions are contiguous, and the total number ofsegments in the surfaces is often less than c(c − 1)/2

3 and 5 class problems

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 13: Linear Discrimination Functions

Generalized LDF I

The LDF can be written g(~x) = w0 +∑d

i=1 wixiBy adding d(d + 1)/2 terms involving the products of pairs ofcomponents of ~x , we obtain the quadratic discriminant function:

g(~x) = w0 +d∑

i=1

wixi +d∑

i=1

d∑j=1

wijxixj

The separating surface defined by g(~x) = 0 is a second-degreeor hyperquadric surfaceBy continuing to add terms such as wijkxixjxk we obtain theclass of polynomial discriminant functions

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 14: Linear Discrimination Functions

Generalized LDF IIThe generalized LDF is defined as

g(~x) =d∑

i=1

aiyi(~x) = ~at~y

where ~a is a d-dimensional weight vector,and yi(~x) are arbitrary functions of ~x

The resulting discriminant function is not linear in ~x , but itis linear in ~yThe functions yi(~x) map points in d-dimensional ~x-spaceto points in the d-dimensional ~y -space

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 15: Linear Discrimination Functions

Generalized LDF IIIExample: Let the quadratic discriminant function beg(~x) = a1 + a2x + a3x2

The 3-dimensional vector is then y =

1xx2

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 16: Linear Discrimination Functions

2-class Linearly-Separable Case I

g(~x) =d∑

i=0

wixi = ~at~y

where x0 = 1 and~y t = [1 ~x ] = [1 x1 · · · xd ] is an augmented feature vector and~at = [w0 ~w ] = [w0 w1 · · · wd ] is an augmented weight vector

The hyperplane decision surface H defined ~at~y = 0passes through the origin in ~y -space

The distance from any point ~y to H is given by ~at~y||~a|| = g(~x)

||~a||

Because ~a =√

(1 + ||~w ||2) this distance is less then thedistance from ~x to H

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 17: Linear Discrimination Functions

2-class Linearly-Separable Case IIProblem: find [w0 ~w ] = ~a

Suppose that we have a set of n examples {~y1, . . . , ~yn}labeled ω1 or ω2

Look for a weight vector ~a that classifies all the examplescorrectly:

~at~yi > 0 and ~yi is labeled ω1 or~at~yi < 0 and ~yi is labeled ω2

If ~a exists, the examples are linearly separable

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 18: Linear Discrimination Functions

2-class Linearly-Separable Case IIISolutions

Replacing all the examples labeled ω2 by their negatives,one can look for a weight vector ~a such that ~at~yi > 0 for allthe examples~a a.k.a. separating vector or solution vectorEach example ~yi places a constraint on the possiblelocation of a solution vector~at~yi = 0 defines a hyperplane through the origin having ~yias a normal vectorThe solution vector (if it exists) must be on the positive sideof every hyperplaneSolution Region = intersection of the n half-spaces

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 19: Linear Discrimination Functions

2-class Linearly-Separable Case IV

Any vector that lies in the solution region is a solutionvector: the solution vector (if it exists) is not uniqueAdditional requirements to find a solution vector closer tothe middle of the region (a solution that is more likely toclassify new examples correctly)Seek a unit-length weight vector that maximizes theminimum distance from the examples to the separatingplane

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 20: Linear Discrimination Functions

2-class Linearly-Separable Case V

Seek the minimum-length weight vector satisfying~at~yi ≥ b ≥ 0The solution region shrinks by margins b/||~yi ||

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 21: Linear Discrimination Functions

Gradient Descent I

Define a criterion function J(~a) that is minimized if ~a is asolution vector (~at~yi ≥ 0, ∀i = 1, . . . ,n)Start with some arbitrary vector ~a(1)

Compute the gradient vector ∇J(~a(1))

The next value ~a(2) is obtained by moving a distance from~a(1) in the direction of steepest descenti.e. along the negative of the gradient

In general, ~a(k + 1) is obtained from ~a(k) using

~a(k + 1)← ~a(k)− η(k)∇J(~a(k))

where η(k) is the learning rate

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 22: Linear Discrimination Functions

Gradient Descent II

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 23: Linear Discrimination Functions

Gradient Descent & Delta Rule I

To understand, consider a simpler linear machine (a.k.a. unit),where

o = w0 + w1x1 + · · ·+ wnxn

Let’s learn wi ’s that minimize the squared error

E [~w ] ≡ 12

∑d∈D

(~td − ~od )2

where:D is set of training examples 〈~x , t〉t is the target output value

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 24: Linear Discrimination Functions

Gradient Descent & Delta Rule IIGradient

∇E [~w ] ≡[∂E∂w0

,∂E∂w1

, · · · ∂E∂wn

]Training rule:

∆~w = −η∇E [~w ]

i.e.,

∆wi = −η ∂E∂wi

Note that η is constant

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 25: Linear Discrimination Functions

Gradient Descent & Delta Rule III

∂E∂wi

=∂

∂wi

12

∑d

(td − od )2

=12

∑d

∂wi(td − od )2

=12

∑d

2(td − od )∂

∂wi(td − od )

=∑

d

(td − od )∂

∂wi(td − ~w · ~xd )

∂E∂wi

=∑

d

(td − od )(−xid )

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 26: Linear Discrimination Functions

Basic GRADIENT-DESCENT Algorithm

GRADIENT-DESCENT(D, η)D: training set,η: learning rate (e.g. .5)

Initialize each wi to some small random valueuntil the termination condition is met do

Initialize each ∆wi to zero.for each 〈~x , t〉 ∈ D do

Input the instance ~x to the unit and compute the output ofor each wi do

∆wi ← ∆wi + η(t − o)xi

for each weight wi do

wi ← wi + ∆wi

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 27: Linear Discrimination Functions

Incremental (Stochastic) GRADIENT DESCENT I

Approximation of the standard GRADIENT-DESCENT

Batch GRADIENT-DESCENT:Do until satisfied

1 Compute the gradient ∇ED[~w ]

2 ~w ← ~w − η∇ED[~w ]

Incremental GRADIENT-DESCENT:Do until satisfied

For each training example d in D1 Compute the gradient ∇Ed [~w ]2 ~w ← ~w − η∇Ed [~w ]

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 28: Linear Discrimination Functions

Incremental (Stochastic) GRADIENT DESCENT II

ED[~w ] ≡ 12

∑d∈D

(td − od )2

Ed [~w ] ≡ 12

(td − od )2

Training rule (delta rule):

∆wi ← η(t − o)xi

similar to perceptron training rule, yet unthresholdedconvergence is only asymptotically guaranteedlinear separability is no longer needed !

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 29: Linear Discrimination Functions

Standard vs. Stochastic GRADIENT-DESCENT

Incremental-GD can approximate Batch-GD arbitrarily closely ifη made small enough

error summed over all examples before summing updatedupon each examplestandard GD more costly per update step and can employlarger ηstochastic GD may avoid falling in local minima because ofusing Ed instead of ED

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 30: Linear Discrimination Functions

Newton’s Algorithm

J(~a) ' J(~a(k)) +∇J t (~a− ~a(k)) +12

(~a− ~a(k))tH(~a− ~a(k))

where H = ∂2J∂ai∂aj

is the Hessian matrix

Choose ~a(k + 1) to minimize this function:~a(k + 1)← ~a(k)− H−1∇J(~a)

Greater improvement per step than GD but not applicablewhen H is singularTime complexity O(d3)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 31: Linear Discrimination Functions

Perceptron I

Assumption:data is linearly separable

Hyperplane:∑d

i=0 wixi = 0assuming that there is a constant attribute x0 = 1 (bias)

Algorithm for learning separating hyperplane:perceptron learning rule

Classifier:If∑d

i=0 wixi > 0 then predict ω1 (or +1),otherwise predict ω2 (or −1)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 32: Linear Discrimination Functions

Perceptron II

Thresholded output

o(x1, . . . , xn) =

{+1 if w0 + w1x1 + · · ·+ wdxd > 0−1 otherwise.

Simpler vector notation: o(~x) = sgn(~x) =

{+1 if ~w~x > 0−1 otherwise.

Space of the hypotheses: {~w | ~w ∈ Rn}

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 33: Linear Discrimination Functions

Decision Surface of a Perceptron

Can represent some useful functionsWhat weights represent g(x1, x2) = AND(x1, x2)?

But some functions not representablee.g., not linearly separable (XOR)Therefore, we’ll want networks of these...

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 34: Linear Discrimination Functions

Perceptron Training Rule I

Perceptron criterion function: J(~a) =∑

~y∈Y (~a)(−~at~y)

where Y (~a) is the set of examples misclassified by ~aIf no samples are misclassified, Y (~a) is empty andJ(~a) = 0 (i.e. ~a is a solution vector)J(~a) ≥ 0, since ~at~yi ≤ 0 if ~yi is misclassified

Geometrically, J(~a) is proportional to the sum of thedistances from the misclassified samples to the decisionboundary

Since ∇J =∑

~y∈Y (~a)(−~y) the update rule becomes

~a(k + 1)← ~a(k) + η(k)∑

~y∈Yk (~a)

~y

where Y (~a) is the set of examples misclassified by ~a(k)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 35: Linear Discrimination Functions

Perceptron Training Rule II

wi ← wi + ∆wi

where∆wi = η(t − o)xi

Where:t = c(~x) target valueo perceptron outputη small constant (e.g., .1) learning rate

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 36: Linear Discrimination Functions

Perceptron Training Rule IIIPerceptron Learning Rule

Set all weights wi to zerodo

for each instance x in the training dataif x is classified incorrectly by the perceptron

if x belongs to ω1 add it to ~welse subtract it from ~w

until all instances in the training data are classified correctlyreturn ~w

Can prove it will convergeIf training data is linearly separableand η sufficiently small

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 37: Linear Discrimination Functions

Perceptron Training Rule IV

η = 1. Sequence of misclassified samples: ~y2, ~y3, ~y1, ~y3

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 38: Linear Discrimination Functions

Perceptron Training Rule V

Why does this work?Consider situation where an instance pertaining to the firstclass has been added:

(w0 + x0)x0 + (w1 + x1)x1 + (w2 + x2)x2 + . . .+ (wd + xd )xd

This means output for a has increased by:

x0x0 + x1x1 + x2x2 + . . .+ xdxd

always positive,thus the hyperplane has moved into the correct direction(and output decreases for instances of other class)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 39: Linear Discrimination Functions

Fixed-Increment Single-Sample Perceptron

Perceptron({~y (k)}nk=1): weight vector

input: {~y (k)}nk=1 training examplesbegin initialize ~a, k = 0

do k ← (k + 1) mod nif ~y (k) is misclassified by the model based on ~athen ~a← ~a + ~y (k)

until all examples properly classifiedreturn ~a

end

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 40: Linear Discrimination Functions

Comments

The perceptron algorithm adjusts the parameters onlywhen it encounters an error, i.e. a misclassified trainingexampleCorrectly classified examples can be ignoredThe learning rate η can be chosen arbitrarily,it will only impact on the norm of the final ~w(and the corresponding magnitude of w0)The final weight vector ~w is a linear combination of trainingpoints

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 41: Linear Discrimination Functions

Linear Models: WINNOW

Another mistake-driven algorithm for finding a separatinghyperplaneAssumes binary data (i.e. attribute values are either zeroor one)Difference: multiplicative updates instead of additiveupdatesWeights are multiplied by a user-specified parameter α > 1(or its inverse)Another difference: user-specified threshold parameter θPredict first class if w0 + w1x1 + w2x2 + · · ·+ wkxk > θ

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 42: Linear Discrimination Functions

The Algorithm I

WINNOW

while some instances are misclassifiedfor each instance a in the training data

classify a using the current weightsif the predicted class is incorrect

if a belongs to the first classfor each xi that is 1, multiply wi by α(if xi is 0, leave wi unchanged)

otherwisefor each xi that is 1, divide wi by α(xi is 0, leave wi unchanged)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 43: Linear Discrimination Functions

The Algorithm II

WINNOW is very effective in homing in on relevant features(it is attribute efficient)

Can also be used in an on-line setting in which newinstances arrive continuously (like the perceptronalgorithm)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 44: Linear Discrimination Functions

Balanced WINNOW I

WINNOW doesn’t allow negative weights and this can be adrawback in some applicationsBALANCED WINNOW maintains two weight vectors, one foreach class: w+ and w−

Instance is classified as belonging to the first class (of twoclasses) if: (w+

0 − w−0 ) + (w+1 − w−1 )x1 + (w+

2 − w−2 )x2 +· · ·+ (w+

k − w−k )xk > θ

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 45: Linear Discrimination Functions

Balanced WINNOW II

BALANCED WINNOW

while some instances are misclassifiedfor each instance a in the training data

classify a using the current weightsif the predicted class is incorrect

if a belongs to the first classfor each xi that is 1,

multiply w+i by α and divide w−i by α

(if xi is 0, leave w+i and w−i unchanged)

otherwisefor each xi that is 1,

multiply w−i by α and divide w+i by α

(if xi is 0, leave w+i and w−i unchanged)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 46: Linear Discrimination Functions

Nonseparable Case

The Perceptron is an error correcting procedure convergeswhen the examples are linearly separableEven if a separating vector is found for the trainingexamples, it does not follow that the resulting classifier willperform well on independent test dataTo ensure that the performance on training and test datawill be similar, many training samples should be used.Sufficiently large training samples are almost certainly nonlinearly separableNo weight vector can correctly classify every example in anonseparable set

The corrections may never cease if set is nonseparable

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 47: Linear Discrimination Functions

Learning rate

If we choose η(k)→ 0 as k →∞ then performance can beacceptable on nonseparable problems while preserving theability to find a solution on separable problemsThe rate at which η(k) approaches zero is important:

Too slow: result will be sensitive to those examples that render the setnonseparable

Too fast: may converge prematurely with sub-optimal results

η(k) can be considered as a function of recentperformance, decreasing it as performance improves: e.g.η(k)← η/k

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 48: Linear Discrimination Functions

Minimum Squared Error Approach I

Minimum Squared Error (MSE)It trades the ability to obtain a separating vector for goodperformance on both separable and nonseparableproblemsPreviously, we sought a weight vector ~a making all of theinner products ~at~y ≥ 0In the MSE procedure, one tries to make ~at~yi = bi ,where bi are some arbitrarily specified positive constants

Using matrix notation: Y~a = ~bIf Y is nonsingular, then ~a = Y−1~bUnfortunately Y is not a square matrix, usually with morerows than columnsWhen there are more equations than unknowns, ~a isoverdetermined, and ordinarily no exact solution exists.

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 49: Linear Discrimination Functions

Minimum Squared Error Approach II

We can seek a weight vector ~a that minimizes somefunction of an error vector ~e = Y~a− ~bMinimizing the squared length of the error vector isequivalent to minimizing the sum-of-squared-error criterionfunction

J(~a) = ||Y~a− ~b||2 =n∑

i=1

(~at~yi − bi)2

whose gradient is

∇J = 2n∑

i=1

(~at~yi − bi)~yi = 2Y t (Y~a− ~b)

Setting the gradient equal to zero, the following necessarycondition holds: Y tY~a = Y t~b

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 50: Linear Discrimination Functions

Minimum Squared Error Approach III

Y tY is a square matrix which is often nonsingular.Therefore, solving for ~a:

~a = (Y tY )−1Y t~b = Y +~b

where Y + = (Y tY )−1Y t is the pseudoinverse of Y

Y + can be written also as limε→0(Y tY + εI)−1Y t

and it can be shown that this limit always exists, hence

~a = Y +~b

the MSE solution to the problem Y~a = ~b

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 51: Linear Discrimination Functions

Widrow-Hoff Procedure a.k.a. LMS

The criterion function J(~a) = ||Y~a− ~b||2 could beminimized by a gradient descent procedureAdvantages:

Avoids the problems that arise when Y tY is singularAvoids the need for working with large matrices

Since ∇J = 2Y t (Y~a− ~b) a simple update rule would be{~a(1) arbitrary~a(k + 1) = ~a(k) + η(k)(Y~a− ~b)

or, if we consider the samples sequentially{~a(1) arbitrary~a(k + 1) = ~a(k) + η(k)

[bk − ~a(k)t~y(k)

]~y(k)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 52: Linear Discrimination Functions

Widrow-Hoff or LMS Agorithm

LMS({~yi}ni=1)

input {~yi}ni=1: training examplesbeginInitialize ~a, ~b, θ, η(·), k ← 0

do k ← k + 1 mod n~a← ~a + η(k)(bk − ~a(k)t~y(k))~y(k)

until |η(k)(bk − ~a(k)t~y(k))~y(k)| < θreturn ~aend

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 53: Linear Discrimination Functions

Linear Regression

Standard technique for numeric predictionOutcome is linear combination of attributes:

x = w0 + w1x1 + w2x2 + · · ·+ wdxd

Weights are calculated from the training datastandard math algorithms ~w

Predicted value for first training instance ~x (1)

w0 + w1x (1)1 + w2x (1)

2 + · · ·+ wdx (1)d =

d∑j=0

wjx(1)j

assuming extended vectors with x0 = 1

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 54: Linear Discrimination Functions

Probabilistic Classification

Any regression technique can be used for classificationTraining: perform a regression for each classsetting the output to 1 for training instances that belong tothe class and 0 for those that don’tPrediction: predict class corresponding to model withlargest output value (membership value)

Problem: membership values are not in [0,1] range,so aren’t proper probability estimates

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 55: Linear Discrimination Functions

Logistic Regression I

Logit transformation

Builds a linear model for a transformed target variable

Assume we have two classes

Logistic regression replaces the target

Pr(1 | ~x)

by this target

log(

Pr(1 | ~x)

1− Pr(1 | ~x)

)

Transformation maps [0,1] to (−∞,+∞)

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 56: Linear Discrimination Functions

Logistic Regression II

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 57: Linear Discrimination Functions

Example: Logistic Regression Model

Resulting model: Pr(1 | ~x) = 1/1 + e−(w0+w1x1+w2x2+···+wd xd )

Example: Model with w0 = 0.5 and w1 = 1:

Parameters induced from data using maximum likelihood

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 58: Linear Discrimination Functions

Maximum Likelihood

Aim: maximize probability of training data wrt parametersCan use logarithms of probabilities and maximizelog-likelihood of model:∑n

i=1(1− x (i)) log (1− Pr(1 | ~x)) + x (i) log (1− Pr(1 | ~x))

where the x (i) are either 0 or 1Weights wi need to be chosen to maximize log-likelihood

relatively simple method:iteratively re-weighted least squares

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 59: Linear Discrimination Functions

Summary

Perceptron training rule guaranteed to succeed ifTraining examples are linearly separableSufficiently small learning rate η

Linear unit training rule uses gradient descentGuaranteed to converge to hypothesis with minimumsquared errorGiven sufficiently small learning rate ηEven when training data contains noiseEven when training data not separable by H

Corso di Apprendimento Automatico Linear Discrimination Functions

Page 60: Linear Discrimination Functions

Credits

R. Duda, P. Hart, D. Stork: Pattern Classification, WileyT. M. Mitchell: Machine Learning, McGraw HillI. Witten & E. Frank: Data Mining: Practical MachineLearning Tools and Techniques, Morgan Kaufmann

Corso di Apprendimento Automatico Linear Discrimination Functions


Recommended