Download - 1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.

1

Linear Methods for Classification

Lecture Notes for CMPUT 466/551

Nilanjan Ray

2

Linear Classification

• What is meant by linear classification?– The decision boundaries in the in the feature

(input) space is linear

• Should the regions be contiguous?

R1 R2

R3R4

X1

X2

Piecewise linear decision boundaries in 2D input space

3

Linear Classification…

• There is a discriminant function k(x) for

each class k

• Classification rule:

• In higher dimensional space the decision

boundaries are piecewise hyperplanar

• Remember that 0-1 loss function led to the

classification rule:

• So, can serve as k(x)

)}(maxarg:{ xkxR jj

k

)}|(maxarg:{ xXjGPkxRj

k

)|( XkGP

4

Linear Classification…

• All we require here is the class boundaries {x:k(x) = j(x)} be linear for every (k, j) pair

• One can achieve this if k(x) themselves are linear or any monotone transform of k(x) is linear– An example:

xxXGP

xXGP

xxXGP

x

xxXGP

T

T

T

T

0

0

0

0

])|2(

)|1(log[

)exp(1

1)|2(

)exp(1

)exp()|1(

Linear

So that

5

Linear Classification as a Linear Regression

)())(1()),((ˆ321

12121 TTTTT xxxxxxxY YXXX

535251

434241

333231

232221

131211

5251

4241

3231

2221

1211

,

1

1

1

1

1

yyy

yyy

yyy

yyy

yyy

xx

xx

xx

xx

xx

YX

321213

221212

121211

)1())((ˆ

)1())((ˆ

)1())((ˆ

xxxxY

xxxxY

xxxxY

2D Input space: X = (X1, X2)

Number of classes/categories K=3, So output Y = (Y1, Y2, Y3)

Training sample, size N=5,

Regression output:

Each row hasexactly one 1indicating thecategory/class

Indicator Matrix

Or, Classification rule:

))((ˆmaxarg))((ˆ2121 xxYxxG k

k

6

The Masking

3213 )1(ˆ xxY

2212 )1(ˆ xxY

Linear regression of the indicator matrix can lead to masking

LDA can avoid this masking

2D input space and three classes Masking

1211 )1(ˆ xxY

Viewing direction

7

Linear Discriminant Analysis

K

lll

kk

xf

xfxXkG

1

)(

)()|Pr(

Essentially minimum error Bayes’ classifier

Assumes that the conditional class densities are (multivariate) Gaussian

Assumes equal covariance for every class

Posterior probability

k is the prior probability for class k

fk(x) is class conditional density or likelihood density

Application ofBayes rule

))()(2

1exp(

||)2(

1)( 1

2/12/ kT

kpk xxxf

ΣΣ

8

LDA…

)2

1(log)

2

1(log

loglog)|Pr(

)|Pr(log

1111l

Tll

Tlk

Tkk

Tk

l

k

l

k

xx

f

f

xXlG

xXkG

ΣΣΣΣ

)(xl)(xk

)(maxarg)(ˆ xxG kk

)|Pr(maxarg)(ˆ xXkGxGk

Classification rule:

is equivalent to:

The good old Bayes classifier!

9

LDA…

kkg ik Nxi

/ˆ

NNkk /ˆ

)/()ˆ)(ˆ(ˆ1

KNxxK

k g

Tkiki

i Σ

Training data utilized to estimate

Prior probabilities:

Means:

Covariance matrix:

When are we going to use the training data?

Nixg ii :1),,( Total N input-output pairs Nk number of pairs in class k Total number of classes: K

10

LDA: Example

LDA was able to avoid masking here

11

Quadratic Discriminant Analysis

• Relaxes the same covariance assumption– class conditional probability densities (still multivariate Gaussians) are allowed to have different covariant matrices

• The class decision boundaries are not linear rather quadratic

|)|log2

1)()(

2

1(log|)|log

2

1)()(

2

1(log

loglog)|Pr(

)|Pr(log

11lll

Tllkkk

Tkk

l

k

l

k

xxxx

f

f

xXlG

xXkG

ΣΣΣΣ

)(xl)(xk

12

QDA and Masking

Better than Linear Regression in terms of handling masking:

Usually computationally more expensive than LDA

13

Fisher’s Linear Discriminant[DHS]

From training set we want to find out a direction where the separationbetween the class means is high and overlap between the classes is small

14

Fisher’s LD…

w

xwT

x

Projection of a vector x on a unit vector w:

Geometric interpretation:

xwT

From training set we want to find out a direction w where the separationbetween the projections of class means is high and

the projections of the class overlap is small

15

Fisher’s LD…

21 2

21

1

1,

1

Rxi

Rxi

ii

xN

mxN

m

22

211

1

21

1~,1~ mwxw

Nmmwxw

Nm T

Rxi

TT

Rxi

T

ii

)(~~1212 mmwmm T

wSwwmxmxwmwxwmys

wSwwmxmxwmwxwmys

T

Rx

Tii

T

Rx

Ti

T

Rxyi

T

Rx

Tii

T

Rx

Ti

T

Rxyi

iiii

iiii

2222

2:

22

22

1112

1:

21

21

222

111

))(()()~(~

))(()()~(~

Class means:

Projected class means:

Difference between projected class means:

Scatter of projected data (this will indicate overlap between the classes):

16

Fisher’s LD…

wSw

wSw

ss

mmwr

wT

BT

2

22

1

212

~~)~~(

)(

TB

w

mmmmS

SSS

))(( 1212

21

)( 121 mmSw w

Ratio of difference of projected means over total scatter:

where

We want to maximize r(w). The solution is

Rayleigh quotient

17

Fisher’s LD: Classifier

))(2

1)(()(

2

1)~~(

2

1)( 2112

12121 mmxmmSmmwxwmmxwxy w

TTT

Classification rule: x in R2 if y(x)>0, else x in R1, where

So far so good. However, how do we get the classifier?

All we know at this point is that the direction )( 121 mmSw w

separates the projected data very well

Since we know that the projected class means are well separated, we can choose average of the two projected means as a thresholdfor classification

18

Fisher’s LD and LDA

They become same when

(1) Prior probabilities are same

(2) Common covariance matrix for the class conditional densities

(3) Both class conditional densities are multivariate Gaussian

Ex. Show that Fisher’s LD classifier and LDA produce thesame rule of classification given the above assumptions

Note: (1) Fisher’s LD does not assume Gaussian densities (2) Fisher’s LD can be used in dimension reduction for a multiple class scenario

19

Logistic Regression

• The output of regression is the posterior probability i.e., Pr(output | input)

• Always ensures that the sum of output variables is 1 and each output is non-negative

• A linear classification method• We need to know about two concepts to

understand logistic regression– Newton-Raphson method– Maximum likelihood estimation

20

Newton-Raphson Method

0)( 1 nxf

)(

)()( 11

n

nnnn xf

xfxfxx

)()()()( 11 nnnnn xfxxxfxf

)(

)(1

n

nnn xf

xfxx

A technique for solving non-linear equation f(x)=0

Taylor series:

After rearrangement:

If xn+1 is a root or very close to the root, then:

So:

Rule for iterationNeed an initial guess x0

21

Newton-Raphson in Multi-dimensions

Njxx

fxfxxf

N

kk

k

jjj ,...,1,)()(

1

0),,,(

0),,,(

0),,,(

21

212

211

NN

N

N

xxxf

xxxf

xxxf

We want to solve the equations:

Taylor series:

After some rearrangement etc.the rule for iteration:(Need an initial guess)

),,,(

),,,(

),,,(

21

212

211

1

21

2

2

2

1

2

1

2

1

1

1

1

12

11

1

12

11

nN

nnN

nN

nn

nN

nn

N

NNN

N

N

nN

n

n

nN

n

n

xxxf

xxxf

xxxf

x

f

x

f

x

f

x

f

x

f

x

fx

f

x

f

x

f

x

x

x

x

x

x

Jacobian matrix

22

Newton-Raphson : Example

0)sin(),(

0)cos(),(32

211212

221211

xxxxxf

xxxxfSolve:

32

211

22

1

1

2211

21

2

11

2

11

)()()sin(

)cos()(

)(32)cos(

)sin(2nnn

nn

nnn

nn

n

n

n

n

xxx

xx

xxx

xx

x

x

x

x

Iteration ruleneed initial guess

23

Maximum Likelihood Parameter Estimation

)2

)(exp(

2

1),;(

2

2

x

xp

N

i

ixL1

2

2

)2

)(exp(

2

1),(

),(maxarg)ˆ,ˆ(,

L

Let’s start with an example. We want to find out the unknown parameters mean and standard deviation of a Gaussian pdf, given N independent samples from it.

Samples: x1,….,xN

Form the likelihood function:

Estimate the parameters that maximize the likelihood function

Let’s find out )ˆ,ˆ(

24

Logistic Regression Model

1

10

1

10

0

)exp(1

1)|Pr(

1,,1,)exp(1

)exp()|Pr(

K

l

Tll

K

l

Tll

Tkk

xxXKG

Kkx

xxXkG

The method directly models the posterior probabilities as the output of regression

Note that the class boundaries are linear

How can we show this linear nature?

What is the discriminant function for every class in this model?

x is p-dimensional input vector

k is a p-dimensional vector for each k

Total number of parameters is (K-1)(p+1)

25

Logistic Regression Computation

Let’s fit the logistic regression model for K=2, i.e., number of classes is 2

N

ii

Tii

Ti

N

i iTii

Ti

N

iiiii

N

iii

xyxy

xyxy

xXGyxXGy

xXyGl

1

1

1

1

)))exp(1log()1((

))exp(1

1log)1((

))|0log(Pr()1())|1log(Pr(

)}|Pr({log)(

Training set: (xi, gi), i=1,…,N

Log-likelihood:

We want to maximize the log-likelihood in order to estimate

xi are (p+1)-dimensional input vector with leading entry 1 is a (p+1)-dimensional vectoryi = 1 if gi =1; yi = 0 if gi =2

26

Newton-Raphson for LR

0))exp(1

)exp((

)(

1

N

iiT

T

i xx

xy

l

(p+1) Non-linear equations to solve for (p+1) unknowns

Solve by Newton-Raphson method:

,)(

)])(

Jacobian([ 1-

ll

N

i iT

iTi

TTii xx

xxx

l

1

))exp(1

1)(

)exp(1

)exp((-)

)(Jacobian(where,

Newton-Raphson for LR…

27

),()( 1 pyXWXX TT

WXXl

pyXxx

xy

l

T

TN

iiT

T

i

))(

(Jacobian

)())exp(1

)exp((

)(

1

So, NR rule becomes:

,,

1

2

1

)1(

2

1

byNNpbyN

TN

T

T

y

y

y

y

x

x

x

X

,

))exp(1/()exp(

))exp(1/()exp(

))exp(1/()exp(

1

22

11

byNNT

NT

TT

TT

xx

xx

xx

p

)))exp((1

11)(

))exp((1

)exp((

iT

iTi

T

xx

x

W is a N-by-N diagonal matrix with ith diagonal entry:

Newton-Raphson for LR…

28

• Newton-Raphson–

– Adjusted response

– Iteratively reweighted least squares (IRLS)

WzXWXX

pyWXWXWXX

pyXWXX

TT

oldTT

TToldnew

1

11

1

)(

))(()(

)()(

)(1 pyWXz old

)()(minarg

)()(minarg

1 pyWpy

XzWXz

T

TTTnew

Example: South African Heart Disease

29

Example: South African Heart Disease…

30

After data fitting in the logistic regression model:

)043.0001.0035.0939.0185.008.0006.0130.4exp(1

)043.0001.0035.0939.0185.008.0006.0130.4exp()|Pr(

agealcoholobesityfamhistldltobacosbp

agealcoholobesityfamhistldltobacosbp

xxxxxxx

xxxxxxxxyesMI

Coefficient Std. Error Z Score

(Intercept) -4.130 0.964 -4.285

sbp 0.006 0.006 1.023

tobacco 0.080 0.026 3.034

ldl 0.185 0.057 3.219

famhist 0.939 0.225 4.178

obesity -0.035 0.029 -1.187

alcohol 0.001 0.004 0.136

age 0.043 0.010 4.184

Example: South African Heart Disease…

31

After ignoring negligible coefficients:

)044.0924.0168.0081.0204.4exp(1

)044.0924.0168.0081.0204.4exp()|Pr(

agefamhistldltobaco

agefamhistldltobaco

xxxx

xxxxxyesMI

What happened to systolic blood pressure? Obesity?

Multi-Class Logistic Regression

32

)1(

2

1

)1)(1()1(

1)1)(1()1(

0)1(

2

20

1

11

10

1

1

1

where,~

,~

pbyN

TN

T

T

pKbyKN

bypKpK

K

p

p

x

x

x

X

X

X

X

X

X

)~~(~

)~~~

(~~ 1 pyXXWX TT NR update:

Multi-Class LR…

33

.11,

)(

)(

)(

where,~ 2

1

1

2

1

Kk

kg

kg

kg

y

y

y

y

y

N

k

K

is a N(K-1) dimension vector:y~

p~ is a N(K-1) dimension vector:

.11,

))exp(1/()exp(

))exp(1/()exp(

))exp(1/()exp(

where,~

1

100

1

12020

1

11010

1

2

1

Kk

xx

xx

xx

p

p

p

p

p

K

lNllNkk

K

lllkk

K

lllkk

k

K

(z) is a indicator function:

.otherwise,0

0if,1)(

zz

MC-LR…

34

).))exp((1

)exp()(

))exp((1

)exp((isentry diagonal thethen,if

),))exp((1

)exp(1)(

))exp((1

)exp((isentry diagonal thethen,if

matrix,diagonalan is,1,1,where

,

1-

10

01-

10

0th

1-

10

01-

10

0th

)1()1()1)(1(2)1(1)1(

)1(22221

)1(11211

K

li

Tll

iTmm

K

li

Tll

iTkk

K

li

Tll

iTkk

K

li

Tll

iTkk

km

KNbyKNKKKK

K

K

x

x

x

ximk

x

x

x

ximk

NbyNKmkW

WWW

WWW

WWW

W

LDA vs. Logistic Regression

• LDA (Generative model)– Assumes Gaussian class-conditional densities and a common covariance– Model parameters are estimated by maximizing the full log likelihood, parameters

for each class are estimated independently of other classes, Kp+p(p+1)/2+(K-1) parameters

– Makes use of marginal density information Pr(X)– Easier to train, low variance, more efficient if model is correct– Higher asymptotic error, but converges faster

• Logistic Regression (Discriminative model)– Assumes class-conditional densities are members of the (same) exponential

family distribution– Model parameters are estimated by maximizing the conditional log likelihood,

simultaneous consideration of all other classes, (K-1)(p+1) parameters– Ignores marginal density information Pr(X)– Harder to train, robust to uncertainty about the data generation process– Lower asymptotic error, but converges more slowly

Generative vs. Discriminative Learning

Generative Discriminative

Example Linear Discriminant Analysis

Logistic Regression

Objective Functions Full log likelihood: Conditional log likelihood

Model Assumptions Class densities:

e.g. Gaussian in LDA

Discriminant functions

Parameter Estimation “Easy” – One single sweep “Hard” – iterative optimization

Advantages More efficient if model correct, borrows strength from p(x)

More flexible, robust because fewer assumptions

Disadvantages Bias if model is incorrect May also be biased. Ignores information in p(x)

i

ii yxp ),(log i

ii xyp )|(log

)|( kyxp )(xk