Linear Classification Rules
Stat 894Spring 2005
Linear Methodsn Features X = X1, X2, …, Xp
n OUTPUT Y: Codes, Labels {1,0}n LINEAR Decision Boundary in Feature Space
n
n Could be non-linear in original spacen Features: Any arbitrary (known) functions of
measured attributesn Transformations of Quantitative attributesn Basis expansions
n Polynomials, Radial Basis function
n partitions the feature-space into two
∑=
+=p
jjjXf
10)( ββX
( ) 0f x =
Generative Modeln : Given Y= y, Prob.
Distribution (Class conditional) of inputs x
n Class Conditionals and Prior prob. knownn Optimal rule (Bayes rule)n Classify a pattern with
input x to class 1 if
n Threshold t depends on misclassification cost and prior prob.
n Bayes Rule:n Posterior prob.
n Given misclassification costs, optimal decision Minimizes Expected Posterior Lossn All Misclassification costs
equal: Maximize posterior probability
n Class conditionals not fully specifiedn Learn a proxy for Bayes
rule from training set examples
1
0
( )( )
f xt
f x>
( )yf x
( | ) ( | )G j x P Y j X x= = =
( )f x
Class Conditionals Known-Hypothetical densities
FIGURE 1: Probability density of measuring particular feature value x given the pattern is in category . [From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.]
iω
Posterior Probabilities Bayes Rule (2)
FIGURE 2: Posterior probabilities for particular prior prob.,
and for class-conditional prob. densities shown in Fig. 1. Given a pattern measured to have feature value x = 14, and
From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.
2( | 14) .08P Xω = =
1( | 14) .92P Xω = =
1( ) 2 / 3P ω =
2( ) 1 / 3P ω =
Likelihood Ratio Threshold (3)
FIGURE 3: The likelihood ratio for the distributions in Fig 1. For the zero-one loss, our decision boundaries are determined by the threshold . If our loss function penalizes misclassifyingpatterns as , more than the converse, the threshold increases to . Hence R1 becomes smaller.
n From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification
1 2( | ) / ( | )p x p xω ω
2ω1ω
aθbθ
Decision Boundaries
FIGURE 4. If the covariance matrices for two distributions are equal and proportional to the identity matrix, then the distributions are spherical in d dimensions. The boundary is a affine hyperplane of d-1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensional examples, we indicate and the boundaries for the case . . In the three-dim. case, the grid plane separates R1 from R2.
n From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.
( | )ip x ω1 2( ) ( )P Pω ω=
Decision Boundaries: Prior Probability Effect
FIGURE 5. As the priors are changed, the decision boundary shifts. For sufficiently disparate priors the boundary will not lie between the means of these one-, and two-dimensional spherical Gaussian distributions.
From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.
Global Linear Rules – 2 classesn Linear Regressionn LDA: Bayes Rule
n Normal: different means, same covariance matrix
n
n QDA: Bayes Rulen Normal: different
means and covariance matrices
n Logistic Regressionn LDA format
n Model or its monotone function as a linear function of xn Estimate coefficients
using Generalized Linear Modeln Iterative algorithm
finds MLE of parameters
01 0/
t xf f eβ β+=
(1| ) / (2 | )G x G x
Linear Boundaries in Feature Space: Non-Linear in original Space
Estimation Method:Regression
n Linear Regressionn Use y = 1 class 1, 0 otherwisen Estimate coefficient vector
n Minimize RSS(β) =
n Classify x as class 1 if
∑ ∑∑= ==
−−=−N
i
p
jjiji
N
iii xyxfy
1 1
20
1
2 )())(( ββ
0ˆ( ) Tf x x tβ= >
Linear Discriminant Analysisn Classify x as class 1
n 0-1 loss, prior ½:
n If W=I, Euclidean distance between x and centroids
n If not, Karhunen-Loevetransformation:n De-correlates inputs, equal
variancesn Spherical Coordinatesn Euclidean Distance in new
coordinates
n Decision boundary orthogonal to vector joining the centroids
n If , boundary bisects line joining two centroidsn Classify x to the class with
closest centeroid
1 11 1 0 1 0 1 0 0
1( ) ( ) ( ) ( )
2T Tx W W tµ µ µ µ µ µ µ− −− − < − − −
1 / 2W x−
0 0t =0 0t =
LDA: Canonical VariablesBest Discriminating Directions
n B = Between class covariance matrixn Cov. Matrix of class meansn measure of pair-wise
distances between centroidsn W = Same Within-class
covariance matrixn Measures variability and the
extent of ellipsoidal shape (departure from spherical) of inputs within a class
n K-L transformation converts these inputs into spherical point cloud (normalized and de-correlated)
n Best Discriminating Directionn Maximize
n or Maximize
n Optimal solution:n First PC of
n If W =I, first PC of Bn Max separation of data in
direction orthogonal to
T
T
a Baa Wa
subject to 1T Ta Ba a Wa =
pa R∈
1/2 1 / 2W BW− −
a
Centroids Spread vs Best Discriminant Direction
Multiple Classes K 3n Linear Regression
n Extend Y to Indicator vector of the k classesn Same as one class
versus othersn Classify into class with
the largest fitted component n Masking Problem
n Some class may never dominate
n Pair-wise classificationn Find boundaries between
each pair of classesn Label each regionn Some boundaries get
removed
n LDA or QDA
≥
LDA for K 3 Classesn Pick (K-1) canonical variates
n Each successively orthogonal to the previous ones that maximizes the ratio of between to within class sums of squares (variances)
n Project means (centroids) and Examples orthogonally onto the space determined by these directions
n Form decision boundaries from the centroids expressed in the canonical coordinates. n An unlabeled point is classified into the class with
closest centroid
≥
Masking Effect in Linear Regression Approach
LDA: 11 Classes Example
Bayes Decision Boundaries/ Examples Learned Boundaries
LDA in Feature Space vs QDA
Two-Dimensional Projections of LDA Directions
LDA and Dimension Reduction
LDA Classification Boundaries in Reduced Subspace
Regularizationof QDA covariance matrices
n If number of classes K is large, the number of unknown parameters (=K p(p+1)/2) in the K covariance matrices Σk is very large.
n May get better predictions by shrinking within class covariance matrix estimates toward a common covariance matrix Σ used in LDA
n The shrunken estimates are known to perform better than the unregularized estimates, the usual MLEsn Estimate the mixing coefficient by cross-validation
ˆ ˆ ˆ( ) (1 )k kα α αΣ = Σ + − Σ
α
Regularized QDA: Vowel Data
Logistic Regressionn Model the posterior
prob. of K classes via linear functions of x, that sum to 1
n (K-1) log-odds of each class compared to a reference class (say K) modeled as linear functions of x, with unknown parameters
n Thus
n Given the class prob., a multinomial distribution for the training set.
n Estimate the unknown parametersn Max Likelihoodn Generalized linear model
n Classify the object into the class with maximum posterior prob.
n K=2, model simple (only 2 parameters)
0
( | )log
( | )T
j j
G j xx
G K xβ β= +
101
1( | )
1 exp( )K T
j jl
G K xxβ β
−
=
=+ +∑
Model (Variable) Selection
n Best model selection vian Sequential Likelihood Ratios (~deviance) n Information criteria (AIC or BIC) based
methodsn Significance of “t-values” of coefficients can
sometimes lead to meaningless conclusionsn Correlated inputs can lead to “non-monotone” t-
statistic in Logistic Regression
n Graphical techniques can be very helpful
South African Heart Disease Data
LDA vs. Logistic Regressionn Both models similar
n Linear posterior log-odds
n Linear posterior prob
n LDA uses generative class conditional modelsn Maximizes log-likelihood
based on joint density
n Logistic Regressionn Fewer assumptions
n Directly models the posterior log-odds
n Marginal density of X is left unspecified
n Maximizes conditional log-likelihood
n When class conditionals actually Gaussiann Additional assumption
provides better estimatesn Loss of efficiency ~30%
0
( | )log
( | )T
j j
G j xx
G K xβ β= +
01
01
exp( )( | )
1 exp( )
Tk k
K Tj jl
xG k x
x
β β
β β−
=
+=
+ +∑
Local Linear Rules vs. Separating hyperplanes
n Lines that minimize misclassification error in the training data (e.g., 1–NN)n Computationally hardn Typically not great on test data
n If two classes are perfectly separable with a linear boundary in feature spacen Different algorithms can find this boundary
n Perceptron: Early form of Neural Networksn Maximal Margin Method: SVM Principle
Hyperplanes?n Green line defines a
hyperplane (affine) set L: in
n For ,n Vector normal to surface
L:n For anyn (Signed) distance of any
x to L:
2R
1 2,x x L∈ 1 2( ) 0T x xβ − =
* /β β β=
0 0 0, Tx L xβ β∈ = −
( ) 0f x =
0 0
1* ( ) ( )T Tx x xβ β β
β− = +
( ) / ( )f x f x′=
Perceptron Algorithm:A Toy Example
Perceptron Algorithm
n Givenn Linearly separable training set {(xi,yi)} , i = 1,2,…,n ; yi =1 or -1n R = max || xi || , i = 1,2,…,n ; Learning rate ρ > 0
n Find: hyperplane w’x + b = 0 such that yi(w’x i + b) > 0, i = 1,2,…,n n Initialize
n w0 = 0 (normal vector to hyperplane); b0 = 0 (intercept of hyperplane)n k = 0 (counts updates of the hyperplane)
n Repeatn For I = 1 to n
n If yi(w’x + b) <= 0 (mistake), then n wk+1 = wk + ρyi x i (tilt hyperplane toward or past misclassified point)n bk+1 = bk + ρyi R2
n k = k+1n End If
n End Forn Until no mistakes n Return (wk, bk) n Novikoff: Algorithm converges in < (2R/γ)2 steps (γ = margin between sets)
Deficiencies of Perceptron
n Many possible solutionsn Order of observations in the training set
n If γ is small, stopping time can be large n When data is not separable, the
algorithm doesn’t converge, but goes into cyclesn Cycles may be long and hard to recognize.
Optimal Separating Hyperplane –Basis for Support Vector Machine
n Maximize the linear gap (margin) between two sets
n Found by quadratic programming (Vapnik)n Solution is determined by just a few points
(support vectors) near the boundaryn Sparse solution in dual space
n May be modified to maximize the margin γthat allows for a fixed number of misclassifications
Toy Example: SVM