Linear Classification Rules - asc.ohio-state.edu · Linear Classification Rules Stat 894 Spring...

Linear Classification Rules

Stat 894Spring 2005

Linear Methodsn Features X = X1, X2, …, Xp

n OUTPUT Y: Codes, Labels {1,0}n LINEAR Decision Boundary in Feature Space

n

n Could be non-linear in original spacen Features: Any arbitrary (known) functions of

measured attributesn Transformations of Quantitative attributesn Basis expansions

n Polynomials, Radial Basis function

n partitions the feature-space into two

∑=

+=p

jjjXf

10)( ββX

( ) 0f x =

Generative Modeln : Given Y= y, Prob.

Distribution (Class conditional) of inputs x

n Class Conditionals and Prior prob. knownn Optimal rule (Bayes rule)n Classify a pattern with

input x to class 1 if

n Threshold t depends on misclassification cost and prior prob.

n Bayes Rule:n Posterior prob.

n Given misclassification costs, optimal decision Minimizes Expected Posterior Lossn All Misclassification costs

equal: Maximize posterior probability

n Class conditionals not fully specifiedn Learn a proxy for Bayes

rule from training set examples

1

0

( )( )

f xt

f x>

( )yf x

( | ) ( | )G j x P Y j X x= = =

( )f x

Class Conditionals Known-Hypothetical densities

FIGURE 1: Probability density of measuring particular feature value x given the pattern is in category . [From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.]

iω

Posterior Probabilities Bayes Rule (2)

FIGURE 2: Posterior probabilities for particular prior prob.,

and for class-conditional prob. densities shown in Fig. 1. Given a pattern measured to have feature value x = 14, and

From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.

2( | 14) .08P Xω = =

1( | 14) .92P Xω = =

1( ) 2 / 3P ω =

2( ) 1 / 3P ω =

Likelihood Ratio Threshold (3)

FIGURE 3: The likelihood ratio for the distributions in Fig 1. For the zero-one loss, our decision boundaries are determined by the threshold . If our loss function penalizes misclassifyingpatterns as , more than the converse, the threshold increases to . Hence R1 becomes smaller.

n From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification

1 2( | ) / ( | )p x p xω ω

2ω1ω

aθbθ

Decision Boundaries

FIGURE 4. If the covariance matrices for two distributions are equal and proportional to the identity matrix, then the distributions are spherical in d dimensions. The boundary is a affine hyperplane of d-1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensional examples, we indicate and the boundaries for the case . . In the three-dim. case, the grid plane separates R1 from R2.

n From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.

( | )ip x ω1 2( ) ( )P Pω ω=

Decision Boundaries: Prior Probability Effect

FIGURE 5. As the priors are changed, the decision boundary shifts. For sufficiently disparate priors the boundary will not lie between the means of these one-, and two-dimensional spherical Gaussian distributions.

From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.

Global Linear Rules – 2 classesn Linear Regressionn LDA: Bayes Rule

n Normal: different means, same covariance matrix

n

n QDA: Bayes Rulen Normal: different

means and covariance matrices

n Logistic Regressionn LDA format

n Model or its monotone function as a linear function of xn Estimate coefficients

using Generalized Linear Modeln Iterative algorithm

finds MLE of parameters

01 0/

t xf f eβ β+=

(1| ) / (2 | )G x G x

Linear Boundaries in Feature Space: Non-Linear in original Space

Estimation Method:Regression

n Linear Regressionn Use y = 1 class 1, 0 otherwisen Estimate coefficient vector

n Minimize RSS(β) =

n Classify x as class 1 if

∑ ∑∑= ==

−−=−N

i

p

jjiji

N

iii xyxfy

1 1

20

1

2 )())(( ββ

0ˆ( ) Tf x x tβ= >

Linear Discriminant Analysisn Classify x as class 1

n 0-1 loss, prior ½:

n If W=I, Euclidean distance between x and centroids

n If not, Karhunen-Loevetransformation:n De-correlates inputs, equal

variancesn Spherical Coordinatesn Euclidean Distance in new

coordinates

n Decision boundary orthogonal to vector joining the centroids

n If , boundary bisects line joining two centroidsn Classify x to the class with

closest centeroid

1 11 1 0 1 0 1 0 0

1( ) ( ) ( ) ( )

2T Tx W W tµ µ µ µ µ µ µ− −− − < − − −

1 / 2W x−

0 0t =0 0t =

LDA: Canonical VariablesBest Discriminating Directions

n B = Between class covariance matrixn Cov. Matrix of class meansn measure of pair-wise

distances between centroidsn W = Same Within-class

covariance matrixn Measures variability and the

extent of ellipsoidal shape (departure from spherical) of inputs within a class

n K-L transformation converts these inputs into spherical point cloud (normalized and de-correlated)

n Best Discriminating Directionn Maximize

n or Maximize

n Optimal solution:n First PC of

n If W =I, first PC of Bn Max separation of data in

direction orthogonal to

T

T

a Baa Wa

subject to 1T Ta Ba a Wa =

pa R∈

1/2 1 / 2W BW− −

a

Centroids Spread vs Best Discriminant Direction

Multiple Classes K 3n Linear Regression

n Extend Y to Indicator vector of the k classesn Same as one class

versus othersn Classify into class with

the largest fitted component n Masking Problem

n Some class may never dominate

n Pair-wise classificationn Find boundaries between

each pair of classesn Label each regionn Some boundaries get

removed

n LDA or QDA

≥

LDA for K 3 Classesn Pick (K-1) canonical variates

n Each successively orthogonal to the previous ones that maximizes the ratio of between to within class sums of squares (variances)

n Project means (centroids) and Examples orthogonally onto the space determined by these directions

n Form decision boundaries from the centroids expressed in the canonical coordinates. n An unlabeled point is classified into the class with

closest centroid

≥

Masking Effect in Linear Regression Approach

LDA: 11 Classes Example

Bayes Decision Boundaries/ Examples Learned Boundaries

LDA in Feature Space vs QDA

Two-Dimensional Projections of LDA Directions

LDA and Dimension Reduction

LDA Classification Boundaries in Reduced Subspace

Regularizationof QDA covariance matrices

n If number of classes K is large, the number of unknown parameters (=K p(p+1)/2) in the K covariance matrices Σk is very large.

n May get better predictions by shrinking within class covariance matrix estimates toward a common covariance matrix Σ used in LDA

n The shrunken estimates are known to perform better than the unregularized estimates, the usual MLEsn Estimate the mixing coefficient by cross-validation

ˆ ˆ ˆ( ) (1 )k kα α αΣ = Σ + − Σ

α

Regularized QDA: Vowel Data

Logistic Regressionn Model the posterior

prob. of K classes via linear functions of x, that sum to 1

n (K-1) log-odds of each class compared to a reference class (say K) modeled as linear functions of x, with unknown parameters

n Thus

n Given the class prob., a multinomial distribution for the training set.

n Estimate the unknown parametersn Max Likelihoodn Generalized linear model

n Classify the object into the class with maximum posterior prob.

n K=2, model simple (only 2 parameters)

0

( | )log

( | )T

j j

G j xx

G K xβ β= +

101

1( | )

1 exp( )K T

j jl

G K xxβ β

−

=

=+ +∑

Model (Variable) Selection

n Best model selection vian Sequential Likelihood Ratios (~deviance) n Information criteria (AIC or BIC) based

methodsn Significance of “t-values” of coefficients can

sometimes lead to meaningless conclusionsn Correlated inputs can lead to “non-monotone” t-

statistic in Logistic Regression

n Graphical techniques can be very helpful

South African Heart Disease Data

LDA vs. Logistic Regressionn Both models similar

n Linear posterior log-odds

n Linear posterior prob

n LDA uses generative class conditional modelsn Maximizes log-likelihood

based on joint density

n Logistic Regressionn Fewer assumptions

n Directly models the posterior log-odds

n Marginal density of X is left unspecified

n Maximizes conditional log-likelihood

n When class conditionals actually Gaussiann Additional assumption

provides better estimatesn Loss of efficiency ~30%

0

( | )log

( | )T

j j

G j xx

G K xβ β= +

01

01

exp( )( | )

1 exp( )

Tk k

K Tj jl

xG k x

x

β β

β β−

=

+=

+ +∑

Local Linear Rules vs. Separating hyperplanes

n Lines that minimize misclassification error in the training data (e.g., 1–NN)n Computationally hardn Typically not great on test data

n If two classes are perfectly separable with a linear boundary in feature spacen Different algorithms can find this boundary

n Perceptron: Early form of Neural Networksn Maximal Margin Method: SVM Principle

Hyperplanes?n Green line defines a

hyperplane (affine) set L: in

n For ,n Vector normal to surface

L:n For anyn (Signed) distance of any

x to L:

2R

1 2,x x L∈ 1 2( ) 0T x xβ − =

* /β β β=

0 0 0, Tx L xβ β∈ = −

( ) 0f x =

0 0

1* ( ) ( )T Tx x xβ β β

β− = +

( ) / ( )f x f x′=

Perceptron Algorithm:A Toy Example

Perceptron Algorithm

n Givenn Linearly separable training set {(xi,yi)} , i = 1,2,…,n ; yi =1 or -1n R = max || xi || , i = 1,2,…,n ; Learning rate ρ > 0

n Find: hyperplane w’x + b = 0 such that yi(w’x i + b) > 0, i = 1,2,…,n n Initialize

n w0 = 0 (normal vector to hyperplane); b0 = 0 (intercept of hyperplane)n k = 0 (counts updates of the hyperplane)

n Repeatn For I = 1 to n

n If yi(w’x + b) <= 0 (mistake), then n wk+1 = wk + ρyi x i (tilt hyperplane toward or past misclassified point)n bk+1 = bk + ρyi R2

n k = k+1n End If

n End Forn Until no mistakes n Return (wk, bk) n Novikoff: Algorithm converges in < (2R/γ)2 steps (γ = margin between sets)

Deficiencies of Perceptron

n Many possible solutionsn Order of observations in the training set

n If γ is small, stopping time can be large n When data is not separable, the

algorithm doesn’t converge, but goes into cyclesn Cycles may be long and hard to recognize.

Optimal Separating Hyperplane –Basis for Support Vector Machine

n Maximize the linear gap (margin) between two sets

n Found by quadratic programming (Vapnik)n Solution is determined by just a few points

(support vectors) near the boundaryn Sparse solution in dual space

n May be modified to maximize the margin γthat allows for a fixed number of misclassifications

Toy Example: SVM

Date post:	27-May-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

Linear Classification Rules - asc.ohio-state.edu · Linear Classification Rules Stat 894 Spring...

Documents