General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Lecture 8: Multiclass Classification (I)
Hao Helen Zhang
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multiclass Classification
General setup
Bayes rule for multiclass problems
under equal costsunder unequal costs
Linear multiclass methods
Linear regression modelLiner discriminant analysis (LDA)Multiple logistic regression Models
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multiclass Problems
Class label Y ∈ {1, ...,K},K ≥ 3.
The classifier f : Rd −→ {1, ...,K}.Examples: zip code classification, multiple-type cancerclassification, vowel recognition
The loss function
C (k , l) = cost of classifying a sample in class k to class l .
In general, C (k , k) = 0 for any k = 1, · · · ,K . The loss can bedescribed as an K × K matrix.For example, K = 3,
C =
0 1 11 0 11 1 0
.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Terminologies
Prior class probabilities (marginal distribution of Y ):
πk = P(Y = k), k = 1, · · · ,K
Conditional distribution of X given Y : assume that
gk is the conditional density of X given Y = k , k = 1, · · · ,K .
Marginal density of X (a mixture):
g(x) =K∑
k=1
πkgk(x), k = 1, · · · ,K .
Posterior class probabilities (conditional dist. of Y given X):
P(Y = k |X = x) =πkgk(x)
g(x), k = 1, · · · ,K .
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Bayes Rule
The optimal (Bayes) rule aims to minimize the average lossfunction over the population
φB(x) = arg minf
EX,YC (Y , f (X))
= arg minf
EXEY |XC (Y , f (X)),
where
EY |XL(Y , f (X)) =K∑
k=1
I (f (X) = k)K∑l=1
C (l , k)P(Y = l |X).
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Bayes Rule Under Equal Costs
If C (k , l) = I (k 6= l), then R[f ] = EXEY |XI (Y 6= f (X )) and
EY |XC (Y , f (X)) =K∑
k=1
I (f (X) = k)K∑l 6=k
P(Y = l |X )
=K∑l=k
I (f (X) = k)[1− P(Y = k|X)]
Given x, the minimizer of EY |X=xC (Y , f (X)) is f (x) = k∗, where
k∗ = arg mink=1,..,K
1− P(Y = k |x).
The Bayes rule is
φB(x) = k∗ if P(Y = k∗|x) = maxk=1,...,K
P(Y = k |x),
which assigns x to the most probable class using Pr(Y = k |X = x).Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Bayes Rule Under Unequal Costs
For the general loss function, the Bayes rule can be derived as
φB(x) = k∗ if k∗ = arg mink=1,...,K
K∑l=1
C (l , k)P(Y = l |x).
There is not a simple analytic form for the solution.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Decision (Discriminating) Functions
For multiclass problems, we generally need to estimate multiplediscriminant functions fk(x), k = 1, · · · ,K
Each fk(x) is associated with class k .
fk(x) represents the evidence strength of a sample (x, y)belonging to class k.
The decision rule constructed using fk ’s is
f (x) = k∗, where k∗ = arg maxk=1,...,K
fk(x).
The decision boundary of the classification rule f between class kand class l is defined as
{x : fk(x) = fl(x)}.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Traditional Methods: Divide and Conquer
Main ideas:
(i) Decompose the multiclass classification problem into multiplebinary classification problems.
(ii) Use the majority voting principle (a combined decision fromthe committee) to predict the label
Common approaches: simple but effective
One-vs-rest (one-vs-all) approaches
Pairwise (one-vs-one, all-vs-all) approaches
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
One-vs-rest Approach
One of the simplest multiclass classifier; commonly used in SVMs;also known as the one-vs-all (OVA) approach
(i) Solve K different binary problems: classify “class k” versus“the rest classes” for k = 1, · · · ,K .
(ii) Assign a test sample to the class giving the largest fk(x)(most positive) value, where fk(x) is the solution from the kthproblem
Properties:
Very simple to implement, perform well in practice
Not optimal (asymptotically): the decision rule is not Fisherconsistent if there is no dominating class (i.e.arg max pk(x) < 1
2 ).
Read: Rifkin and Klautau (2004) “In Defense of One-vs-allClassification”
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Pairwise Approach
Also known as all-vs-all (AVA) approach
(i) Solve(K
2
)different binary problems: classify “class k” versus
“class j” for all j 6= k . Each classifier is called gij .
(ii) For prediction at a point, each classifier is queried once andissues a vote. The class with the maximum number of(weighted) votes is the winner.
Properties:
Training process is efficient, by dealing with small binaryproblems.
If K is big, there are too many problems to solve. If K = 10,we need to train 45 binary classifiers.
Simple to implement; perform competitively in practice.
Read: Park and Furnkranz (2007) “Efficient PairwiseClassification”
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Linear Classifier for Multiclass Problems
Assume the decision boundaries are linear. Common linearclassification methods:
Multivariate linear regression methods
Linear log-odds (logit) models
Linear discriminant analysis (LDA)Multiple logistic regression
Separating hyperplane – explicitly model the boundaries aslinear
Perceptron model, SVMs
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Coding for Linear Regression
For each class k , code it using an indicator variable Zk :
Zk = 1 if Y = k;
Zk = 0 if Y 6= k .
The response y is coded using a vector z = (z1, ..., zK ). IfY = k , only the kth component of z = 1.
The entire training samples can be coded using a n × Kmatrix, denoted by Z .
Call the matrix Z the indicator response matrix.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Multivariate Lineaer Regression
Consider the multivariate model
Z = XB + E, or Zk = fk ≡ Xbk + εk , k = 1, · · · ,K .
Zk is the kth column of the indicator response matrix Z .
X is n × (d + 1) input matrix
E is n × K matrix of errors.
B is (d + 1)× K matrix of parameters, bk is the kth columnof B.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Ordinary Least Square Estimates
Minimize the residual sum of squares
RSS(B) =K∑
k=1
n∑i=1
[zik − β0k −d∑
j=1
βjkxij ]2
= tr[(Z − XB)T (Z − XB)].
The LS estimates has the same form as univariate case
B = (XTX )−1XTZ ,
Z = X (XTX )−1XTZ ≡ XB.
Coefficient bk for yk is same as univariate OLS estimates.(Multiple outputs do not affect the OLS estimates).
For x, compute f (x) =[(1 x)B
]Tand classify it as
f (x) = argmaxk=1,...,K fk(x).
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
About Multiple Linear Regression
Justification: Each fk is an estimate of conditional expectation ofYk , which is the conditional probability of (X ,Y ) belonging to theclass k
E(Yk |X = x) = Pr(Y = k |X = x).
If the intercept is in the model (column of 1 in X), then∑Kk=1 fk(x) = 1 for any x.
If the linear assumption is appropriate, then fk give aconsistent estimate of the probability for k = 1, · · · ,K , as thesample size n goes to infinity.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Limitations of Linear Regression
There is no guarantee that all fk ∈ [0, 1]; some can benegative or greater than 1.
Linear structure of X ’s can be rigid.
Serious masking problem for K ≥ 3 due to rigid linearstructure.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 4Linear Regression
1
1
1
1
1
1111
1
1
1
1
11
1 1
1
1
1
1
1
1 11
1
1
1
1
1
1
11
11
1
1
1
1 11
1
1
1
1
1
1
1
1 1
1
1
1
1
11
1 1 1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
11
111 1
1
11 1
1
1
1 1
1
1
11
1
1
1
1
1
1
1
1
1
1
111
1
1
1
1
1
1
1
11
11
1
1
11
1
1
1
1
1
1
1
1 111
1
1
11
1
1
11
11
11
1
1
1
1
1
111
1
11
1
1
11
1
1
11
1
1
1
1
1
1
1
1
11
11
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1 111 1
1
1
1
1
1
1
1
1
1 11
11
1
1
1
1
1
1
1
1
1 1
1
11
11
1
1
1
1
1
1 1
1
1
1
1
1
11
11
1
1 1
11
1
1
1
1
1
1
1
111
1
1
1 1
1
1
1
11
1
1
1
11
1
1 1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
11
1
1
1
1
1 1
1
1
1
1
1 1
111
1
1
111
11
11
1
111
1
1
1
1
1
1
1
1
1
11 1
1
1
1
1
1
11
1
11
1
1
1
1
11
1
11
1
1
1 1
11
1
11
1
11
1
1
1
1
11
1
11
1
11
11
1
1
1
11
1
1
1
11
1
11
11
1
1
1
1
11
11
1
1
1
1
1
111 1
11
1 1
1
1
1
1
1
1
1 1
1
11
1
1
1
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1 1
11
1 1
11
1
1
1
1
1
1
1
1
1
1
11
22
2
2
22
2
2
2
2
2
2
22
2
22
22 22
2
2 2
2
2 222
2
22
2
2
2
2
2
2
2
2
22
2
22
22
2
22
2
2
22
2
2
2
2
2
2
2 2
2
2
2
22
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2 22
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
22
2
2
2
22
2
2
2
22
2
2 22 2
2
2
2
2
2
22
22
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
22
2
22
2
2
2
22
2
2
2
2
2
2
2
2
22
2
2
2
2
22
2
2
2
22
2
2
2
2
2
2
2
2
2
22
2
2
2
22
22
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
222
2
22
2
2
2
2
2
22
2
2
22
2
2
2 2
2
22
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
22
2
2
22
2
2
2
2
2
22
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2 2
2
2 22
2
2
2
2
22
2
2
222
2
2
2
22
22
222
22
2
2
22
2
2
2 2
2
2
222
2
2
2
2
2
22
2
2 2
2
22
2
2
22
2
2
22
2
2
2
2
2
2
22
2 2
2
2
2
2
2
2
22
2
2 2
2
2
22
22 22
2
2
2 2
2
22
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2 2
2
2 2
2
2
2
22
2
2
2
22
222
2
22
2
22
222
2
22
2
2
2
22
2
2
2
2
22
2
2
22
2 2
2
2
2
22
2
2
22
2
2 2
2
2
2
2
2
22
2
2
22
2
2
2
2
2
3
3
3
3
3
3
33
3
3
33
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
33
3
3
3
33
3
3 3 3
3
3
3
33
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
333
3 3
3
3
33
3
33
33
3
3
3
3
333
3
3
33
33
33
33
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
33
33
3
3
33
3
3
333
3
3
3
33
3
3
3
3
3
3
3
33
3
3
3
3
33
3
3
3
33
3
3
3 3
3
3
3
3
3
3
3
33
3
3
3
3
3 33
3
33
3
3
3
3
333
3
3
3
33
3
3
3
3 3
33
3 333
3
3
3
3
33
3
33
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
33
33
3
3
3
333
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
33
3
3
33
33
3
3
3
3
3
33
3
3
33
3
3
3
33
3
3 33
3
3
3
33
3
33
333
3
3
33
33
3
3
3
3
3
3 3
3
3
3
3
33
3
3
33
3
3
3
3
33
3 3
3
3
33
3
33
33
3
3
33
33
33
33 3
3
3
3
3
3
3
33
33
33
33
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3 3
3
3
33
3
3
3 33
3
3
3
33
3
3
33
3
3
3
3
3
3
3
3
3
33
3
33
3
3
3
3
3
33
3
3
3
3 3
3
3
33
3
3
33
3
3
3
33
333
33
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
33
33
3
3
3
3
333
3
3
3
3
33
3
3
33
3
3
3
3
33
33
33
Linear Discriminant Analysis
1
1
1
1
1
1111
1
1
1
1
11
1 1
1
1
1
1
1
1 11
1
1
1
1
1
1
11
11
1
1
1
1 11
1
1
1
1
1
1
1
1 1
1
1
1
1
11
1 1 1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
11
111 1
1
11 1
1
1
1 1
1
1
11
1
1
1
1
1
1
1
1
1
1
111
1
1
1
1
1
1
1
11
11
1
1
11
1
1
1
1
1
1
1
1 111
1
1
11
1
1
11
11
11
1
1
1
1
1
111
1
11
1
1
11
1
1
11
1
1
1
1
1
1
1
1
11
11
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1 111 1
1
1
1
1
1
1
1
1
1 11
11
1
1
1
1
1
1
1
1
1 1
1
11
11
1
1
1
1
1
1 1
1
1
1
1
1
11
11
1
1 1
11
1
1
1
1
1
1
1
111
1
1
1 1
1
1
1
11
1
1
1
11
1
1 1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
11
1
1
1
1
1 1
1
1
1
1
1 1
111
1
1
111
11
11
1
111
1
1
1
1
1
1
1
1
1
11 1
1
1
1
1
1
11
1
11
1
1
1
1
11
1
11
1
1
1 1
11
1
11
1
11
1
1
1
1
11
1
11
1
11
11
1
1
1
11
1
1
1
11
1
11
11
1
1
1
1
11
11
1
1
1
1
1
111 1
11
1 1
1
1
1
1
1
1
1 1
1
11
1
1
1
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1 1
11
1 1
11
1
1
1
1
1
1
1
1
1
1
11
22
2
2
22
2
2
2
2
2
2
22
2
22
22 22
2
2 2
2
2 222
2
22
2
2
2
2
2
2
2
2
22
2
22
22
2
22
2
2
22
2
2
2
2
2
2
2 2
2
2
2
22
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2 22
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
22
2
2
2
22
2
2
2
22
2
2 22 2
2
2
2
2
2
22
22
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
22
2
22
2
2
2
22
2
2
2
2
2
2
2
2
22
2
2
2
2
22
2
2
2
22
2
2
2
2
2
2
2
2
2
22
2
2
2
22
22
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
222
2
22
2
2
2
2
2
22
2
2
22
2
2
2 2
2
22
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
22
2
2
22
2
2
2
2
2
22
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2 2
2
2 22
2
2
2
2
22
2
2
222
2
2
2
22
22
222
22
2
2
22
2
2
2 2
2
2
222
2
2
2
2
2
22
2
2 2
2
22
2
2
22
2
2
22
2
2
2
2
2
2
22
2 2
2
2
2
2
2
2
22
2
2 2
2
2
22
22 22
2
2
2 2
2
22
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2 2
2
2 2
2
2
2
22
2
2
2
22
222
2
22
2
22
222
2
22
2
2
2
22
2
2
2
2
22
2
2
22
2 2
2
2
2
22
2
2
22
2
2 2
2
2
2
2
2
22
2
2
22
2
2
2
2
2
3
3
3
3
3
3
33
3
3
33
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
33
3
3
3
33
3
3 3 3
3
3
3
33
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
333
3 3
3
3
33
3
33
33
3
3
3
3
333
3
3
33
33
33
33
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
33
33
3
3
33
3
3
333
3
3
3
33
3
3
3
3
3
3
3
33
3
3
3
3
33
3
3
3
33
3
3
3 3
3
3
3
3
3
3
3
33
3
3
3
3
3 33
3
33
3
3
3
3
333
3
3
3
33
3
3
3
3 3
33
3 333
3
3
3
3
33
3
33
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
33
33
3
3
3
333
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
33
3
3
33
33
3
3
3
3
3
33
3
3
33
3
3
3
33
3
3 33
3
3
3
33
3
33
333
3
3
33
33
3
3
3
3
3
3 3
3
3
3
3
33
3
3
33
3
3
3
3
33
3 3
3
3
33
3
33
33
3
3
33
33
33
33 3
3
3
3
3
3
3
33
33
33
33
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3 3
3
3
33
3
3
3 33
3
3
3
33
3
3
33
3
3
3
3
3
3
3
3
3
33
3
33
3
3
3
3
3
33
3
3
3
3 3
3
3
33
3
3
33
3
3
3
33
333
33
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
33
33
3
3
3
3
333
3
3
3
3
33
3
3
33
3
3
3
3
33
33
33
PSfrag repla ementsX1X1
X 2X 2Figure 4.2: The data ome from three lasses inIR2 and are easily separated by linear de ision bound-aries. The right plot shows the boundaries found bylinear dis riminant analysis. The left plot shows theboundaries found by linear regression of the indi a-tor response variables. The middle lass is ompletelymasked (never dominates).
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 4
11
1
1
1
1
11
11
1
1
1
1
1
1
1
1
1
11
1
11
111
1
1
1
1
1
1
1
1
1
1
1111
1
11
111
11
1
1
11
1
1
1
1
1
1
1
1
1
11
111
11
1
111
1
1
1
1
1
1111
11
11
1
1
11
11
1
111
1
11
1
1
1
11
11
1
1
1
1
1
11
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
1
1
111
1
11
111
11
11
1
111
1
222 222 2 2 2222 22 22 2 22 22222 2 22 222 22 2 22 22 22 22222 22 2 222 2 2222 222 222 222 2222 22 22 2 22 222 2222 22 222 222 222 222 222 2 22 22 222 222 22 22 2222 22 2 222 2 22 2 22 2 22 22 222 2 222 222 222 2
33
3
3
3
3
33
33
3
3
3
3
3
3
3
3
3
33
3
33
333
3
3
3
3
3
3
3
33
3
3333
3
33
333
33
3
3
33
3
3
3
3
3
3
3
3
3
33
333
33
3
333
3
33
3
3
3333
33
33
3
3
33
33
3
333
3
33
3
3
3
33
33
3
3
3
3
3
33
3
3
33
3
3
3
3
3
3
33
3
3
3
3
3
3
3
333
3
33
333
33
33
3
333
3
0.0
0.5
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Degree = 1; Error = 0.25
11
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
11
1
1
1
111
1
1
1
1
1
1
1
1
1
1
11
11
1
11
111
1
1
1
1
111
1
1
11
11
1
1
11
1111 1
1
111
11 1
11
1111
11
11
1
1
1111
1
1111
111
11 11 111 11
11
1 11 1111 11 1
111 1 11 1 11 1 1
1 111
11 1 111 111 1111
22
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
22
2
2
2
222
2
2
2
2
2
2
2
2
2
2
22
22
2
22
222
2
2
2
2 2222
2
222
22 222 2222 22
22 2
2
22
22 2222 22 2222
22 222 2222
22 2
2
2
22
22
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
22
2
2
2
2
2
2
2
22
2
2
2
2
2
2
22
2
2
2
22
2
2
333 33
33 3 333
3
33
33
33
3 3333
3
3 33 333
3
3
3 33 33
33 33333 33 3 33
3 3
3333
3
33
3
3
3
3
33
333
3 3
333
3
3
33
33
3333
33
33
3
3
33
33
3333
3
333
3
3
33
33
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
33
3
3
3
3
3
3
3
33
3
3
3
3
3
3
33
3
3
3
33
3
3
0.0
0.5
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Degree = 2; Error = 0.03
Figure 4.3: The e�e ts of masking on linear regres-sion in IR for a three- lass problem. The rug plot atthe base indi ates the positions and lass membershipof ea h observation. The three urves in ea h panel arethe �tted regressions to the three- lass indi ator vari-ables; for example, for the red lass, yred is 1 for thered observations, and 0 for the green and blue. The �tsare linear and quadrati polynomials. Above ea h plotis the training error rate. The Bayes error rate is 0:025for this problem, as is the LDA error rate.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
About Masking Problems
One-dimensional example: (see the previous figure)
Projected the data onto the line joining the three centroid
There is no information in the northwest-southeast direction
Three curves are regression lines
The left panel is for the linear regression fit:the middle class is never dominant!The right panel is for the quadratic fit (masking problem issolved!)
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
How to Solve Masking Problems
In the previous example with K = 2
Using Quadratic regression rather than linear regression
Linear regression error 0.25; Quadratic regression error 0.03;Bayes error 0.025.
In general,
If K ≥ 3 classes are lined up, we need polynomial terms up todegree K − 1 to resolve the masking problem.
In the worst case, need O(pK−1) terms.
Though LDA is also based on linear functions, it does notsuffer from masking problems.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Large K , Small d Problem
Masking often occurs for large K and small d .Example: Vowel data K = 11 and d = 10. (Two-dimensional fortraining data using LDA projection)
A difficult classification problem, as the class overlap isconsiderate
the best methods achieve 40% test error
Method Traing Error Test Error
Linear regression 0.48 0.67LDA 0.32 0.56QDA 0.01 0.53Logistic regression 0.22 0.51
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
�������� � ������� �� �������� �������� ���������� � �������� ���� ������� �
Coordinate 1 for Training Data
Coord
inate 2
for Tr
aining
Data
-4 -2 0 2 4
-6-4
-20
24
oooooo
ooo
o
oo
ooo
o oo
ooo
oo
o
o o o o oo
ooo o oo
oo
o
o
o o
o oo
ooo
oo oooo
o
o
oo
o
o
oooo
o
o
oooooo
o ooo
o
o
o
o
oooo
ooo o
oo
o
oo
oo
o
ooooo
oooooo
o
o
ooo
o
o
o oooo o
ooo o oo
ooo o
oo
oo o ooo
o ooo
oo
oooooo
oooooo
oo
oooo
oooooo
ooooo o
oooooo
oooooo
oo
oooo
o oo
oo
o
o ooooo
ooooo
o
oooooo
oooooo
oooooo
oooooo
oo
o
o
oo
ooooo oo ooo
o oo o
oooo
oooo
o
o
oo
o
o
oo
oooo
o
o
oooo
o
o oo
o
oo
o
oooooo
oo oooo
oo
oo oo o
oo
oo o
o o o
oo
ooo
oooo
oooo
o
o
oo
o
ooo
ooo
ooo
o o ooo o
o oooo
o
ooooo
o
o o
oo
oo
oo o ooo
ooo
oo o
o
oo
o o o
ooooo
o
ooo
oo
o
o
oo
ooo
ooooo
o
oooo
oo
o oooo
o
oooooo
ooo o
oo
ooooo oo oo
o o o
oooooo
oo
ooo o
oo
o oo o
ooo
ooo
o
o
oo
oo
oo
oooo
oooo
oo
ooo ooo
oo
oo oo
o ooo
oo
ooo
ooo
ooo
ooo
oooooo
oo
o
oo o
••••
•••• •••• ••
••
••••
••
Linear Discriminant Analysis
������ ��� � ����������� � �� �� �� ��� �����
�� ����� ��� �� � � ���� ���� � � ����� ��
���� �� �� ��� ��� � ���� �� � ��� ��� ����
��� ������� �� ��� ���� � �� �� ���!��� ��
������ ��� ��� � ���� �� � ��� ��� �� �� �������
�� �
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Model Assumptions for Linear Discriminant Analysis
Model Setup
Let πk be prior probabilities of class k
Let gk(x) be the class-conditional densities of X in class k.
The posterior probability
Pr(Y = k|X = x) =gk(x)πk∑Kl=1 gl(x)πl
.
Model Assumptions:
Assume each class density is multivariate Gaussian N(µk ,Σk).
Further, we assume Σk = Σ for all k (equal covariancerequired!)
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Linear Discriminant Function
The log-ratio between class k and l is
logPr(Y = k |X = x)
Pr(Y = l |X = x)= log
πkπl
+ loggk(x)
gl(x)
= [xTΣ−1µk −1
2µTk Σ−1µk + log πk ]−
[xTΣ−1µl −1
2µTl Σ−1µl + log πl ]
For each class k, its discriminant function is
fk(x) = xTΣ−1µk −1
2µTk Σ−1µk + log πk ,
The quadratic term canceled due to the equal covariance matrix.The decision rule is
Y (x) = argmaxk=1,...,K fk(x).
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 4
+ +
+3
21
1
1
2
3
3
3
1
2
3
3
2
1 1 21
1
3
3
1 21
2
3
2
3
3
1
2
2
1
1
1
1
3
2
2
2
2
1 3
2 2
3
1
3
1
3
3 2
1
3
3
2
3
1
3
3
21
33
2
2
32
2
211
1
1
1
2
1
3
3
11
3
32
2
2
23
1
2
Figure 4.5: The left panel shows three Gaussian distri-butions, with the same ovarian e and di�erent means.In luded are the ontours of onstant density en losing95% of the probability in ea h ase. The Bayes de isionboundaries between ea h pair of lasses are shown (bro-ken straight lines), and the Bayes de ision boundariesseparating all three lasses are the thi ker solid lines (asubset of the former). On the right we see a sampleof 30 drawn from ea h Gaussian distribution, and the�tted LDA de ision boundaries.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Interpretation of LDA
log Pr(Y = k |X = x) = −1
2(x−µk)TΣ−1(x−µk) + log πk + const.
The constant term does not involve x.
If prior probabilities are same, the LDA classifies x to the classwith centroid closest to x, using the squared Mahalanobisdistance, dΣ(x ,µ), based on the common covariance matrix.
If dΣ(x− µk) = dΣ(x− µl), then the prior determines theclassifier.
Special Case: Σk = I for all k
fk(x) = −1
2‖x− µk‖2 + log πk .
only Euclidean distance is needed!
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Mahalanobis Distance
Mahalanobis distance is a distance measure introduced by P. C.Mahalanobis in 1936.Def: The Mahalanobis distance of x = (x1, . . . , xd)T from a set ofpoints with mean µ = (µ1, . . . , µd)T and covariance matrix Σ isdefined as:
dΣ(x,µ) =[(x− µ)TΣ−1(x− µ)
]1/2.
It differs from Euclidean distance in that
it takes into account the correlations of the data set
it is scale-invariant.
It can also be defined as a dissimilarity measure between x and x′
of the same distribution with the covariance matrix Σ:
dΣ(x, x′) =[(x− x′)TΣ−1(x− x′)
]1/2.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Special Cases
If Σ is the identity matrix, the Mahalanobis distance reducesto the Euclidean distance.
If Σ is diagonal diag(σ21, · · · , σ2
d), then the resulting distancemeasure is called the normalized Euclidean distance:
dΣ(x, x′) =
[d∑
i=1
(xi − x ′i )2
σ2i
]1/2
,
where σi is the standard deviation of the xi and x ′i .
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Normalized Distance
Assume we like to decide whether x belongs to a class. Intuitively,the decision relies on the distance of x to the class center µ: thecloser to µ, the more likely x belongs to the class.
To decide whether a given distance is large or small, we needto know if the points in the class are spread out over a largerange or a small range. Statistically, we measure the spread bythe standard deviation of distances of the sample points to µ.
If dΣ(x,µ) is less than one standard deviation, then it ishighly probable that x belongs to the class. The further awayit is, the more likely that x not belonging to the class. This isthe concept of the normalized distance.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Motivation of Mahalanobis Distance
The drawback of the normalized distance is that the sample pointsare assumed to be distributed about µ in a spherical manner.
For non-spherical situations, for instance ellipsoidal, theprobability of x belonging to the class depends not only on thedistance from µ, but also on the direction.
In those directions where the ellipsoid has a short axis, x mustbe closer. In those where the axis is long, x can be furtheraway from the center.
The ellipsoid that best represents the class’s probability distributioncan be estimated by the covariance matrix of the samples.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Linear Decision Boundary of LDA
Linear discriminant functions
fk(x) = βT1kx + β0k , fj(x) = βT
1jx + β0j .
Decision boundary function between class k and j is
(β1k − β1j)T x + (β0k − β0j) = β
T1 x + β0 = 0.
Since β1k = Σ−1µk and β1j = Σ−1µj , the decision boundary hasthe directional vector
β1 = Σ−1(µk − µj)
β1 in generally not in the direction of µk − µj
The discriminant direction β1 minimizes the overlap forGaussian data
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 4
+
++
+
Figure 4.9: Although the line joining the entroidsde�nes the dire tion of greatest entroid spread, theproje ted data overlap be ause of the ovarian e (leftpanel). The dis riminant dire tion minimizes thisoverlap for Gaussian data (right panel).
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Parameter Estimation in LDA
In practice, we estimate the parameters from the training data
πk = nk/n, where nk is sample size in
µk =∑
Yi=k xi/nkThe sample covariance matrix for the kth class:
Sk =1
nk − 1
∑Yi=k
(xi − µk)(xi − µk)T .
(Unbiased) pooled sample covariance is a weighted average
Σ =K∑
k=1
nk − 1∑Kl=1(nl − 1)
Sk
=K∑
k=1
∑Yi=k
(xi − µk)(xi − µk)T/(n − K ).
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Implemenation of LDA
Compute the eigen-decomposition:
Σ = UDUT
Sphere the data using Σ, using the following transformation
X ∗ = D−1/2UTX .
The common covariance estimate of X ∗ is the identity matrix.
In the transformed space, classify the point to the closesttransformed centroid, with the adjustment using πk ’s.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Connection between LDA and Linear Regression for K = 2
Suppose we code the targets in the two classes as +1 and −1.Fit linear model
yi = β0 + βT1 xi , i = 1, · · · , n
Then the OLS (β0, β1) satisfy
β1 ∝ Σ−1(µ1 − µ2)
β0 is different from the LDA unless n1 = n2.
The linear regression gives a different decision rule from LDA,unless n1 = n2.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Multiple Logistic Models
Model the K posterior probabilities by linear functions of x.We use logit transformations
logPr(Y = 1|X = x)
Pr(Y = K |X = x)= β10 + βT
1 x
logPr(Y = 2|X = x)
Pr(Y = K |X = x)= β20 + βT
2 x
... ...
logPr(Y = K − 1|X = x)
Pr(Y = K |X = x)= β(K−1)0 + βT
K−1x
The choice of denominator in the odd-ratios is arbitrary.The parameter vector
θ = {β10,β1, ...., β(K−1)0,βK−1}.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Probability Estimates
We use the logit transformation to assure that
the total probabilities sum to one
each estimated probability lies in [0, 1]
pk(x) ≡ Pr(Y = k |x) =exp (βk0 + βT
k x)
1 +∑K−1
l=1 exp (βl0 + βTl x)
for k = 1, ...,K − 1.
pK (x) ≡ Pr(Y = K |x) =1
1 +∑K−1
l=1 exp (βl0 + βTl x)
Check∑K
k=1 pk(x) = 1 for any x.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)
General SetupBayes Rule for Multiclass Problems
Traditional Methods for Multiclass ProblemsLinear Regression Models
Multivariate Linear RegressionLinear Discriminant AnalysisMultiple Logistic Regression
Maximum Likelihood Estimate for Logistic Models
The joint conditional likelihood of yi given xi is
l(θ) =n∑
i=1
log pyi (θ; xi )
Using the Newton-Raphson method:
solve iteratively by minimizing re-weighted least squares.
provide consistent estimates if the models are specifiedcorrectly.
Hao Helen Zhang Lecture 8: Multiclass Classification (I)