Linear Classification MethodsLinear Odds Models
Comparison
Lecture 5: LDA and Logistic Regression
Hao Helen Zhang
Fall, 2017
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 1 / 37
Linear Classification MethodsLinear Odds Models
Comparison
Outline
Two Popular Linear Models for Classification
Linear Discriminant Analysis (LDA)Logistic Regression Models
Take-home message:
Both LDA and Logistic regression models rely on thelinear-odd assumption, indirectly or directly. However, theyestimate the coefficients in a different manner.
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 2 / 37
Linear Classification MethodsLinear Odds Models
Comparison
Linear Classifier
Linear methods: The decision boundary is linear.Common linear classification methods:
Linear regression methods (covered in Lecture 3)
Linear log-odds (logit) models
Linear logistic modelsLinear discriminant analysis (LDA)
separating hyperplanes (introduced later)
perceptron model (Rosenblatt 1958)Optimal separating hyperplane (Vapnik 1996) – SVMs
From now on, we assume equal costs (by default).
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 3 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Odds, Logit, and Linear Odds Models Linear
Some terminologies
Call the term Pr(Y=1|X=x)
Pr(Y=0|X=x)is called odds
Call log Pr(Y=1|X=x)
Pr(Y=0|X=x)log of the odds, or logit function
Linear odds models assume: the logit is linear in x, i.e.,
logPr(Y = 1|X = x)
Pr(Y = 0|X = x)= β0 + βT1 x.
Examples: LDA, Logistic regression
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 4 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Classifier Based on Linear Odd Models
From the linear odds, we can obtain posterior class probabilities
Pr(Y = 1|x) =exp (β0 + βT1 x)
1 + exp (β0 + βT1 x)
Pr(Y = 0|x) =1
1 + exp (β0 + βT1 x)
Assuming equal costs, the decision boundary is given by
{x|P(Y = 1|x) = 0.5} = {x|β0 + βT1 x = 0},
which can be interpreted as “zero log-odds”
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 5 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
0.0 0.2 0.4 0.6 0.8 1.0
−4−2
02
4logit function log[s/(1−s)]
s
g(s)
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
logistic curve exp(t)/(1+exp(t))
t
p(t)
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 6 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Linear Discriminant Analysis (LDA)
LDA assumes
Assume each class density is multivariate Gaussian, i.e.,
X |Yj ∼ N(µj ,Σj), j = 0, 1.
Equal covariance assumption
Σj = Σ, j = 0, 1.
In other words,
both classes are from Gaussian and they have the samecovariance matrix.
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 7 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Linear Discriminant Function
Under the mixture Gaussian assumption, the log-odd is
logPr(Y = 1|X = x)
Pr(Y = 0|X = x)
= logπ1π0− 1
2(µ1 + µ0)TΣ−1(µ1 − µ0) + xTΣ−1(µ1 − µ0)
Under equal costs, the LDA classifies to “1” if and only if[log
π1π0− 1
2(µ1 + µ0)TΣ−1(µ1 − µ0)
]+ xTΣ−1(µ1 − µ0) > 0.
It has a linear boundary {x : β0 + xTβ1 = 0}, with
β0 = logπ1π0− 1
2(µ1 + µ0)TΣ−1(µ1 − µ0),
β1 = Σ−1(µ1 − µ0).
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 8 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Parameter Estimation in LDA
In practice, π1, π0, µ0, µ1,Σ are unknown
We estimate the parameters from the training data, usingMLE or the moment estimator
π̂j = nj/n, where nk is the size size of class j .µ̂j =
∑Yi=j xi/nj for j = 0, 1.
The sample covariance matrix is
Sj =1
nj − 1
∑Yi=j
(xi − µ̂j)(xi − µ̂j)T
(Unbiased) pooled sample covariance is a weighted average
Σ̂ =n0 − 1
(n0 − 1) + (n1 − 1)S0 +
n1 − 1
(n0 − 1) + (n1 − 1)S1
=1∑
j=0
∑Yi=j
(xi − µ̂j)(xi − µ̂j)T/(n − 2)
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 9 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
R code for LDA Fitting (I)
There are two ways to call the function “lda”. The first way is touse a formula and an optional data frame.
library(MASS)
lda(formula, data,subset)
Arguments:
formula: the form “groups ∼ x1 + x2 + . . .”, where theresponse is the grouping factor and the right hand sidespecifies the (non-factor) discriminators.
data: data frame from which variables specified
subset: An index vector specifying the cases to be used in thetraining sample.
Output:
an object of class “lda” with multiple components
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 10 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
R code for LDA Fitting (II)
The second way is to use a matrix and group factor as the first twoarguments.
library(MASS)
lda(x, grouping, prior = proportions, CV = FALSE)
Arguments:
x: a matrix or data frame or Matrix containing predictors.
grouping: a factor specifying the class for each observation.
prior: the prior probabilities of class membership. Ifunspecified, the class proportions for the training set are used.
Output:
If CV = TRUE, the return value is a list with components“class” (the MAP classification, a factor) and “posterior”(posterior probabilities for the classes).
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 11 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
R code for LDA Prediction
We use the “predict” or “predict.lda” function to classifymultivariate observations with lda
predict(object, newdata, ...)
Arguments:
object: object of class “lda”
newdata: data frame of cases to be classified or, if “object”has a formula, a data frame with columns of the same namesas the variables used.
Output:
a list with the components “class” (the MAP classification, afactor) and “posterior” (posterior probabilities for the classes)
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 12 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Fisher’s Iris Data (Three-Classification Problems)
Fisher (1936) “The use of multiple measurements in taxonomicproblems”.
Three species: Iris setosa, Iris versicolor, Iris verginica
Four features: the length and the width of the sepals andpetals, in centimeters.
50 samples from each species
In the following analysis, we randomly select 50% of the datapoints as the training set, and the rest as the test set.
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 13 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 14 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Illustration 1
Iris <- data.frame(rbind(iris3[,,1], iris3[,,2],
iris3[,,3]), Sp = rep(c("s","c","v"), rep(50,3)))
train <- sample(1:150, 75)
table(Iris$Sp[train])
z <- lda(Sp ~ ., Iris, prior = c(1,1,1)/3, subset = train)
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 15 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Training Error and Test Error
#training error rate
ytrain <- predict(z, Iris[train, ])$class
table(ytrain, Iris$Sp[train])
train_err <- mean(ytrain!=Iris$Sp[train])
#test error rate
ytest <- predict(z, Iris[-train, ])$class
table(ytest, Iris$Sp[-train])
test_err <- mean(ytest!=Iris$Sp[-train])
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 16 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Illustration 2
tr <- sample(1:50, 25)
train <- rbind(iris3[tr,,1], iris3[tr,,2], iris3[tr,,3])
test <- rbind(iris3[-tr,,1], iris3[-tr,,2], iris3[-tr,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
z <- lda(train, cl)
ytest <- predict(z, test)$class
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 17 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Logistic Regression
Model assumption: the log-odd is linear in x.
logPr(Y = 1|X = x)
Pr(Y = 0|X = x)= β0 + βT
1 x.
Define
p(x;β) ≡=exp (β0 + βT
1 x)
1 + exp (β0 + βTl x)
Write β = (β0,β1). The posterior class probabilities can becalculated as
p1(x) = Pr(Y = 1|x) = p(x;β),
p0(x) = Pr(Y = 0|x) = 1− p(x;β).
The classification boundary is given by: {x : β0 + βT1 x = 0}.
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 18 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Model Interpretation
µ = E (Y |X) = P(Y = 1|X)
g(µ) = log[µ/(1− µ)] = β0 + βT1 X
connects µ with the linear predictor β0 + βT1 X. We call g asthe link function
Var(Y |X) = µ(1− µ).
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 19 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Maximum Likelihood Estimate (MLE) for Logistic Models
The joint conditional likelihood of yi given xi is
l(β) =n∑
i=1
log pyi (x;β),
wherepy (β; x) = p(x;β)y [1− p(x;β)]1−y .
In details,
l(β) =n∑
i=1
{yi log p(xi ;β) + (1− yi ) log[1− p(xi ,β)]}
=n∑
i=1
{yi (β10 + βT1 xi )− log[1 + exp (β10 + βT
1 xi )].
.
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 20 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Score Equations
For simplicity, now assume xi has 1 in its first component.
∂l(β)
∂β=
n∑i=1
xi [yi − p(xi ;β)] = 0,
Totally (d+1) nonlinear equations. The first equation
n∑i=1
yi =n∑
i=1
p(xi ;β),
expected number of 1’s = observed number in sample.
y = [y1, · · · , yn]T
p = [p(x1;βold), · · · , p(xn;βold)]T .
W = diag{p(xi ;β
old)(1− p(xi ;βold))
}.
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 21 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Newton-Raphson Algorithm
The second-derivative (Hessian) matrix
∂2l(β)
∂β∂βT= −
n∑i=1
xixTi p(xi ;β)[1− p(xi ;β)].
1 Choose an initial value β0
2 Update β by
βnew = βold −[∂2l(β)
∂β∂βT
]−1βold
∂l(β)
∂β βold
Using matrix notations, we have
∂l(β)
∂β= XT (y − p),
∂2l(β)
∂β∂βT= −XTWX .
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 22 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Iteratively Re-weighted Least Squares (IRLS)
Newton-Raphson step
βnew = βold + (XTWX )−1XT (y − p)
= (XTWX )−1XTW(Xβold + W−1(y − p)
)= (XTWX )−1XTW z,
where we defined the adjusted response
z = Xβold + W−1(y − p)
Repeatedly solve weighted least squares till convergence.
βnew = arg minβ
(z− Xβ)TW (z− Xβ),
Weight W , response z, and p change in each iteration.
The algorithm can be generalized to K ≥ 3 case.
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 23 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Quadratic Approximations
β̂ satisfies a self-consistency relationship: it solves a weighted leastsquare fit with response
zi = xTi β̂ +(yi − p̂i )
p̂i (1− p̂i )
and the weight wi = p̂i (1− p̂i ).
β = 0 seems to be a good starting value
Typically the algorithm converges, but it is never guaranteed.
The weighted residual sum-of-squared is Pearson chi-squarestatistic
n∑i=1
(yi − p̂i )2
p̂i (1− p̂i ),
a quadratic approximation to the deviance.
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 24 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
Statistical Inferences
Using the weighted least squares formulation, we have
Asymptotic likelihood theory says: if the model is correct, β̂ isconsistent
Using central limit theorem, the distribution of β̂ converges toN(β, (XTWX )−1)
Model building is costly due to iterations, popular shortcuts:
For inclusion of a term, use Rao score test.For exclusion of a term, use Wald test.
Neither of these two algorithms require iterative fitting, andare based on the maximum likelihood fit of the current model.
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 25 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
R Code for Logistic Regression
logist <- glm(formula, family, data, subset, ...)
predict(logist)
Arguments:
formula: an object of class “formula”
family: a description of the error distribution and link functionto be used in the model.
data: an optional data frame, list or environment containingthe variables in the model.
subset: an optional vector specifying a subset of observationsto be used in the fitting process.
Output:
returns an object of class inheriting from “glm” and “lm”
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 26 / 37
Linear Classification MethodsLinear Odds Models
Comparison
LDALogistics Regression
R Code for Logistic Prediction
logist <- glm(y~x, data, family=binomial(link="logit"))
predict(logist, newdata)
summary(logist)
anova(logist)
The function “predict” gives the predicted values, newdata isa data frame
The function “summary” is used to obtain or print a summaryof the results
The function “anova” produces an analysis of variance table.
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 27 / 37
Linear Classification MethodsLinear Odds Models
Comparison
Relationship between LDA and Least Squares (LS)
For two-class problems, both LDA and least squares fit a linearboundary β0 + βTx. Their solutions have the followingrelationship:
The least square regression coefficient β̂ is proportional to theLDA direction, i.e.,
β̂ ∝ Σ̂−1(µ̂1 − µ̂0),
where Σ̂ is the pooled sample covariance matrix, and µ̂k isthe sample mean of points from class k , k = 0, 1. In otherwords, their slope coefficients are identical, up to a scalarmultiple. (Exercise 4.2 in textbook)
The LS intercept β̂0 is generally different from that of LDA,unless n1 = n0.
In general, they have different decision rules (unless n1 = n2).Hao Helen Zhang Lecture 5: LDA and Logistic Regression 28 / 37
Linear Classification MethodsLinear Odds Models
Comparison
Common Feature of LDA and Logistic Regression Models
For both LDA and logistic regression, the logit has a linear form
logPr(Y = 1|x)
Pr(Y = 0|x)= β0 + βT
1 x.
Or equivalently, for both estimators, their posterior classprobability can be expressed in the form of
Pr(Y = 1|x) =exp (β0 + βT
1 x)
1 + exp (β0 + βT1 x)
.
They have exactly same forms. Are they same estimators?
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 29 / 37
Linear Classification MethodsLinear Odds Models
Comparison
Major Differences of LDA and Logistsics Regresion
Main differences of two estimators include
Where is the linear-logit from?
What assumptions are made on the data distribution?(Difference on data distribution)
How to estimate the linear coefficients?(Difference in parameter estimation)
Any assumption on the marginal density of X?(flexibility of model)
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 30 / 37
Linear Classification MethodsLinear Odds Models
Comparison
Where is Linear-logit from?
For LDA, the linear logit is due to the equal-covariance Gaussianassumption on data
logPr(Y = 1|x)
Pr(Y = 0|x)= log
π1π0− 1
2(µ1 + µ0)TΣ−1(µ1 + µ0)
+xTΣ−1(µ1 − µ0)
= β0 + βT1 x
For Logistic model, the linear logit is due to construction
logPr(Y = 1|x)
Pr(Y = 0|x)= β0 + βT
1 x
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 31 / 37
Linear Classification MethodsLinear Odds Models
Comparison
Difference in Marginal Density Assumption
The assumptions on Pr(X):
The logistic model leaves the marginal density of X arbitraryand unspecified.
The LDA model assumes a Gaussian density
Pr(X) =1∑
j=0
πjφ(X;µj ,Σ)
Conclusion: The logistic model makes less assumptions about thedata, and hence is more general.
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 32 / 37
Linear Classification MethodsLinear Odds Models
Comparison
Difference in Parameter Estimation
Logistic regression
Maximizing the conditional likelihood, the multinomiallikelihood with probabilities Pr(Y = k |X)
The marginal density Pr(X) is totally ignored (fullynonparametric using the empirical distribution function whichplaces 1/n at each observation)
LDA
Maximizing the full log-likelihood based on the joint density
Pr(X,Y = j) = φ(X;µj ,Σ)πj ,
Standard MLE theory leads to estimators µ̂j , Σ̂, π̂j
Marginal density does play a role
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 33 / 37
Linear Classification MethodsLinear Odds Models
Comparison
More Comments
LDA is easier to compute than logistic regression.
If the true fk(x)’s are Gaussian, LDA is better.
Logistic regression may lose efficiency around 30%asymptotically in error rate (by Efron 1975)
Robustness?
LDA uses all the points to estimate the covariance matrix;more information but not robust against outliersLogistic regression down-weights points far from decisionboundary (recall the weight is pi (1− pi ); more robust and safer
In practice, these two methods often give similar results (forapproximately normal distributed data)
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 34 / 37
Linear Classification MethodsLinear Odds Models
Comparison
Two-dimensional Linear Example
Consider the following scenarios for a two-class problem:
π1 = π0 = 0.5.
Class 1: X ∼ N2((1, 1)T , 4I)
Class 2: X ∼ N2((−1,−1)T , 4I)
The Bayes boundary is
{x : x1 − x2 = 0}.
We generate
the “training set” of n = 200, 400 to fit the classifier
the “testing set” of n′ = 2000 “ to evaluate its predictionperformance.
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 35 / 37
Linear Classification MethodsLinear Odds Models
Comparison
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
−4 −2 0 2 4
−4−2
02
4
Training Scatter Plot
x1
x2
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
−4 −2 0 2 4
−4−2
02
4
Bayes and Linear Classifiers
x1
x2
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
baylin
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
−4 −2 0 2 4
−4−2
02
4
Logistic and LDA Classifiers
x1
x2
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
loglda
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
−4 −2 0 2 4
−4−2
02
4
All Linear Classifiers
x1
x2
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
baylinloglda
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 36 / 37
Linear Classification MethodsLinear Odds Models
Comparison
Performance of Various Classifiers (n=200)
Bayes Linear Logistic LDA
Train Error 0.250 0.255 0.255 0.255Test Error 0.240 0.247 0.242 0.247
Fitted classification boundaries:Linear & LDA: x2 = −0.48− 1.73x1Logistic: x2 = −0.32− 1.80x1.
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 37 / 37
Linear Classification MethodsLinear Odds Models
Comparison
●
●●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
−4 −2 0 2 4
−4−2
02
4
Training Scatter Plot
x1
x2
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
−4 −2 0 2 4
−4−2
02
4
Bayes and Linear Classifiers
x1
x2
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
baylin
●
●●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
−4 −2 0 2 4
−4−2
02
4
Logistic and LDA Classifiers
x1
x2
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
loglda
●
●●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
−4 −2 0 2 4
−4−2
02
4
All Linear Classifiers
x1
x2
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
baylinloglda
Hao Helen Zhang Lecture 5: LDA and Logistic Regression 38 / 37