1
Linear Methods for Classification
Lecture Notes for CMPUT 466/551
Nilanjan Ray
2
Linear Classification
• What is meant by linear classification?– The decision boundaries in the in the feature
(input) space is linear
• Should the regions be contiguous?
R1 R2
R3R4
X1
X2
Piecewise linear decision boundaries in 2D input space
3
Linear Classification…
• There is a discriminant function k(x) for
each class k
• Classification rule:
• In higher dimensional space the decision
boundaries are piecewise hyperplanar
• Remember that 0-1 loss function led to the
classification rule:
• So, can serve as k(x)
)}(maxarg:{ xkxR jj
k
)}|(maxarg:{ xXjGPkxRj
k
)|( XkGP
4
Linear Classification…
• All we require here is the class boundaries {x:k(x) = j(x)} be linear for every (k, j) pair
• One can achieve this if k(x) themselves are linear or any monotone transform of k(x) is linear– An example:
xxXGP
xXGP
xxXGP
x
xxXGP
T
T
T
T
0
0
0
0
])|2(
)|1(log[
)exp(1
1)|2(
)exp(1
)exp()|1(
Linear
So that
5
Linear Classification as a Linear Regression
)())(1()),((ˆ321
12121 TTTTT xxxxxxxY YXXX
535251
434241
333231
232221
131211
5251
4241
3231
2221
1211
,
1
1
1
1
1
yyy
yyy
yyy
yyy
yyy
xx
xx
xx
xx
xx
YX
321213
221212
121211
)1())((ˆ
)1())((ˆ
)1())((ˆ
xxxxY
xxxxY
xxxxY
2D Input space: X = (X1, X2)
Number of classes/categories K=3, So output Y = (Y1, Y2, Y3)
Training sample, size N=5,
Regression output:
Each row hasexactly one 1indicating thecategory/class
Indicator Matrix
Or, Classification rule:
))((ˆmaxarg))((ˆ2121 xxYxxG k
k
6
The Masking
3213 )1(ˆ xxY
2212 )1(ˆ xxY
Linear regression of the indicator matrix can lead to masking
LDA can avoid this masking
2D input space and three classes Masking
1211 )1(ˆ xxY
Viewing direction
7
Linear Discriminant Analysis
K
lll
kk
xf
xfxXkG
1
)(
)()|Pr(
Essentially minimum error Bayes’ classifier
Assumes that the conditional class densities are (multivariate) Gaussian
Assumes equal covariance for every class
Posterior probability
k is the prior probability for class k
fk(x) is class conditional density or likelihood density
Application ofBayes rule
))()(2
1exp(
||)2(
1)( 1
2/12/ kT
kpk xxxf
ΣΣ
8
LDA…
)2
1(log)
2
1(log
loglog)|Pr(
)|Pr(log
1111l
Tll
Tlk
Tkk
Tk
l
k
l
k
xx
f
f
xXlG
xXkG
ΣΣΣΣ
)(xl)(xk
)(maxarg)(ˆ xxG kk
)|Pr(maxarg)(ˆ xXkGxGk
Classification rule:
is equivalent to:
The good old Bayes classifier!
9
LDA…
kkg ik Nxi
/ˆ
NNkk /ˆ
)/()ˆ)(ˆ(ˆ1
KNxxK
k g
Tkiki
i Σ
Training data utilized to estimate
Prior probabilities:
Means:
Covariance matrix:
When are we going to use the training data?
Nixg ii :1),,( Total N input-output pairs Nk number of pairs in class k Total number of classes: K
10
LDA: Example
LDA was able to avoid masking here
11
Quadratic Discriminant Analysis
• Relaxes the same covariance assumption– class conditional probability densities (still multivariate Gaussians) are allowed to have different covariant matrices
• The class decision boundaries are not linear rather quadratic
|)|log2
1)()(
2
1(log|)|log
2
1)()(
2
1(log
loglog)|Pr(
)|Pr(log
11lll
Tllkkk
Tkk
l
k
l
k
xxxx
f
f
xXlG
xXkG
ΣΣΣΣ
)(xl)(xk
12
QDA and Masking
Better than Linear Regression in terms of handling masking:
Usually computationally more expensive than LDA
13
Fisher’s Linear Discriminant[DHS]
From training set we want to find out a direction where the separationbetween the class means is high and overlap between the classes is small
14
Fisher’s LD…
w
xwT
x
Projection of a vector x on a unit vector w:
Geometric interpretation:
xwT
From training set we want to find out a direction w where the separationbetween the projections of class means is high and
the projections of the class overlap is small
15
Fisher’s LD…
21 2
21
1
1,
1
Rxi
Rxi
ii
xN
mxN
m
22
211
1
21
1~,1~ mwxw
Nmmwxw
Nm T
Rxi
TT
Rxi
T
ii
)(~~1212 mmwmm T
wSwwmxmxwmwxwmys
wSwwmxmxwmwxwmys
T
Rx
Tii
T
Rx
Ti
T
Rxyi
T
Rx
Tii
T
Rx
Ti
T
Rxyi
iiii
iiii
2222
2:
22
22
1112
1:
21
21
222
111
))(()()~(~
))(()()~(~
Class means:
Projected class means:
Difference between projected class means:
Scatter of projected data (this will indicate overlap between the classes):
16
Fisher’s LD…
wSw
wSw
ss
mmwr
wT
BT
2
22
1
212
~~)~~(
)(
TB
w
mmmmS
SSS
))(( 1212
21
)( 121 mmSw w
Ratio of difference of projected means over total scatter:
where
We want to maximize r(w). The solution is
Rayleigh quotient
17
Fisher’s LD: Classifier
))(2
1)(()(
2
1)~~(
2
1)( 2112
12121 mmxmmSmmwxwmmxwxy w
TTT
Classification rule: x in R2 if y(x)>0, else x in R1, where
So far so good. However, how do we get the classifier?
All we know at this point is that the direction )( 121 mmSw w
separates the projected data very well
Since we know that the projected class means are well separated, we can choose average of the two projected means as a thresholdfor classification
18
Fisher’s LD and LDA
They become same when
(1) Prior probabilities are same
(2) Common covariance matrix for the class conditional densities
(3) Both class conditional densities are multivariate Gaussian
Ex. Show that Fisher’s LD classifier and LDA produce thesame rule of classification given the above assumptions
Note: (1) Fisher’s LD does not assume Gaussian densities (2) Fisher’s LD can be used in dimension reduction for a multiple class scenario
19
Logistic Regression
• The output of regression is the posterior probability i.e., Pr(output | input)
• Always ensures that the sum of output variables is 1 and each output is non-negative
• A linear classification method• We need to know about two concepts to
understand logistic regression– Newton-Raphson method– Maximum likelihood estimation
20
Newton-Raphson Method
0)( 1 nxf
)(
)()( 11
n
nnnn xf
xfxfxx
)()()()( 11 nnnnn xfxxxfxf
)(
)(1
n
nnn xf
xfxx
A technique for solving non-linear equation f(x)=0
Taylor series:
After rearrangement:
If xn+1 is a root or very close to the root, then:
So:
Rule for iterationNeed an initial guess x0
21
Newton-Raphson in Multi-dimensions
Njxx
fxfxxf
N
kk
k
jjj ,...,1,)()(
1
0),,,(
0),,,(
0),,,(
21
212
211
NN
N
N
xxxf
xxxf
xxxf
We want to solve the equations:
Taylor series:
After some rearrangement etc.the rule for iteration:(Need an initial guess)
),,,(
),,,(
),,,(
21
212
211
1
21
2
2
2
1
2
1
2
1
1
1
1
12
11
1
12
11
nN
nnN
nN
nn
nN
nn
N
NNN
N
N
nN
n
n
nN
n
n
xxxf
xxxf
xxxf
x
f
x
f
x
f
x
f
x
f
x
fx
f
x
f
x
f
x
x
x
x
x
x
Jacobian matrix
22
Newton-Raphson : Example
0)sin(),(
0)cos(),(32
211212
221211
xxxxxf
xxxxfSolve:
32
211
22
1
1
2211
21
2
11
2
11
)()()sin(
)cos()(
)(32)cos(
)sin(2nnn
nn
nnn
nn
n
n
n
n
xxx
xx
xxx
xx
x
x
x
x
Iteration ruleneed initial guess
23
Maximum Likelihood Parameter Estimation
)2
)(exp(
2
1),;(
2
2
x
xp
N
i
ixL1
2
2
)2
)(exp(
2
1),(
),(maxarg)ˆ,ˆ(,
L
Let’s start with an example. We want to find out the unknown parameters mean and standard deviation of a Gaussian pdf, given N independent samples from it.
Samples: x1,….,xN
Form the likelihood function:
Estimate the parameters that maximize the likelihood function
Let’s find out )ˆ,ˆ(
24
Logistic Regression Model
1
10
1
10
0
)exp(1
1)|Pr(
1,,1,)exp(1
)exp()|Pr(
K
l
Tll
K
l
Tll
Tkk
xxXKG
Kkx
xxXkG
The method directly models the posterior probabilities as the output of regression
Note that the class boundaries are linear
How can we show this linear nature?
What is the discriminant function for every class in this model?
x is p-dimensional input vector
k is a p-dimensional vector for each k
Total number of parameters is (K-1)(p+1)
25
Logistic Regression Computation
Let’s fit the logistic regression model for K=2, i.e., number of classes is 2
N
ii
Tii
Ti
N
i iTii
Ti
N
iiiii
N
iii
xyxy
xyxy
xXGyxXGy
xXyGl
1
1
1
1
)))exp(1log()1((
))exp(1
1log)1((
))|0log(Pr()1())|1log(Pr(
)}|Pr({log)(
Training set: (xi, gi), i=1,…,N
Log-likelihood:
We want to maximize the log-likelihood in order to estimate
xi are (p+1)-dimensional input vector with leading entry 1 is a (p+1)-dimensional vectoryi = 1 if gi =1; yi = 0 if gi =2
26
Newton-Raphson for LR
0))exp(1
)exp((
)(
1
N
iiT
T
i xx
xy
l
(p+1) Non-linear equations to solve for (p+1) unknowns
Solve by Newton-Raphson method:
,)(
)])(
Jacobian([ 1-
ll
N
i iT
iTi
TTii xx
xxx
l
1
))exp(1
1)(
)exp(1
)exp((-)
)(Jacobian(where,
Newton-Raphson for LR…
27
),()( 1 pyXWXX TT
WXXl
pyXxx
xy
l
T
TN
iiT
T
i
))(
(Jacobian
)())exp(1
)exp((
)(
1
So, NR rule becomes:
,,
1
2
1
)1(
2
1
byNNpbyN
TN
T
T
y
y
y
y
x
x
x
X
,
))exp(1/()exp(
))exp(1/()exp(
))exp(1/()exp(
1
22
11
byNNT
NT
TT
TT
xx
xx
xx
p
)))exp((1
11)(
))exp((1
)exp((
iT
iTi
T
xx
x
W is a N-by-N diagonal matrix with ith diagonal entry:
Newton-Raphson for LR…
28
• Newton-Raphson–
– Adjusted response
– Iteratively reweighted least squares (IRLS)
WzXWXX
pyWXWXWXX
pyXWXX
TT
oldTT
TToldnew
1
11
1
)(
))(()(
)()(
)(1 pyWXz old
)()(minarg
)()(minarg
1 pyWpy
XzWXz
T
TTTnew
Example: South African Heart Disease
29
Example: South African Heart Disease…
30
After data fitting in the logistic regression model:
)043.0001.0035.0939.0185.008.0006.0130.4exp(1
)043.0001.0035.0939.0185.008.0006.0130.4exp()|Pr(
agealcoholobesityfamhistldltobacosbp
agealcoholobesityfamhistldltobacosbp
xxxxxxx
xxxxxxxxyesMI
Coefficient Std. Error Z Score
(Intercept) -4.130 0.964 -4.285
sbp 0.006 0.006 1.023
tobacco 0.080 0.026 3.034
ldl 0.185 0.057 3.219
famhist 0.939 0.225 4.178
obesity -0.035 0.029 -1.187
alcohol 0.001 0.004 0.136
age 0.043 0.010 4.184
Example: South African Heart Disease…
31
After ignoring negligible coefficients:
)044.0924.0168.0081.0204.4exp(1
)044.0924.0168.0081.0204.4exp()|Pr(
agefamhistldltobaco
agefamhistldltobaco
xxxx
xxxxxyesMI
What happened to systolic blood pressure? Obesity?
Multi-Class Logistic Regression
32
)1(
2
1
)1)(1()1(
1)1)(1()1(
0)1(
2
20
1
11
10
1
1
1
where,~
,~
pbyN
TN
T
T
pKbyKN
bypKpK
K
p
p
x
x
x
X
X
X
X
X
X
)~~(~
)~~~
(~~ 1 pyXXWX TT NR update:
Multi-Class LR…
33
.11,
)(
)(
)(
where,~ 2
1
1
2
1
Kk
kg
kg
kg
y
y
y
y
y
N
k
K
is a N(K-1) dimension vector:y~
p~ is a N(K-1) dimension vector:
.11,
))exp(1/()exp(
))exp(1/()exp(
))exp(1/()exp(
where,~
1
100
1
12020
1
11010
1
2
1
Kk
xx
xx
xx
p
p
p
p
p
K
lNllNkk
K
lllkk
K
lllkk
k
K
(z) is a indicator function:
.otherwise,0
0if,1)(
zz
MC-LR…
34
).))exp((1
)exp()(
))exp((1
)exp((isentry diagonal thethen,if
),))exp((1
)exp(1)(
))exp((1
)exp((isentry diagonal thethen,if
matrix,diagonalan is,1,1,where
,
1-
10
01-
10
0th
1-
10
01-
10
0th
)1()1()1)(1(2)1(1)1(
)1(22221
)1(11211
K
li
Tll
iTmm
K
li
Tll
iTkk
K
li
Tll
iTkk
K
li
Tll
iTkk
km
KNbyKNKKKK
K
K
x
x
x
ximk
x
x
x
ximk
NbyNKmkW
WWW
WWW
WWW
W
LDA vs. Logistic Regression
• LDA (Generative model)– Assumes Gaussian class-conditional densities and a common covariance– Model parameters are estimated by maximizing the full log likelihood, parameters
for each class are estimated independently of other classes, Kp+p(p+1)/2+(K-1) parameters
– Makes use of marginal density information Pr(X)– Easier to train, low variance, more efficient if model is correct– Higher asymptotic error, but converges faster
• Logistic Regression (Discriminative model)– Assumes class-conditional densities are members of the (same) exponential
family distribution– Model parameters are estimated by maximizing the conditional log likelihood,
simultaneous consideration of all other classes, (K-1)(p+1) parameters– Ignores marginal density information Pr(X)– Harder to train, robust to uncertainty about the data generation process– Lower asymptotic error, but converges more slowly
Generative vs. Discriminative Learning
Generative Discriminative
Example Linear Discriminant Analysis
Logistic Regression
Objective Functions Full log likelihood: Conditional log likelihood
Model Assumptions Class densities:
e.g. Gaussian in LDA
Discriminant functions
Parameter Estimation “Easy” – One single sweep “Hard” – iterative optimization
Advantages More efficient if model correct, borrows strength from p(x)
More flexible, robust because fewer assumptions
Disadvantages Bias if model is incorrect May also be biased. Ignores information in p(x)
i
ii yxp ),(log i
ii xyp )|(log
)|( kyxp )(xk