Advanced Studies in Applied Statistics (WBL), ETHZApplied Multivariate StatisticsSpring 2018, Week 9
Lecturer: Beate [email protected]
1Remark: Much of the material have been developed together with Oliver Dürr for different lectures at ZHAW.
Topics of today
• Bayes rule for decision boundaries: optimal if data distribution is known ;-)
• Linear Discriminant Analysis (LDA): estimate Bayes decision boundaries
– LDA in uni-variate (p = 1) and multi-variate (p≥2) cases
– LDA versus logistic regression
– extending to quadratic discriminant analyses (QDA)
• Performance evaluation
– Cross validation schemes
– ROC curve
2
Bayes rule for balanced 2 group classification and an conditional univariate Normal distributed feature with equal variances
Assumptions: a) balanced 2 class classification problem (both class prevalences/priors 0.5). b) 1 predictor x with known _ ~ , , _ ~ , , same variance!
At which x-value is the optimal split value?
Bayes rule: decision boundary is s , if classify to class with larger .
decision boundary
3
Why is it called Bayes rule?
( | class C) (class C)(class | X x) , C {A,B}( )
P X x PP CP X x
Each class has a prior probability P(A) and P(B) ⟹ we classify to class with higher prior.
Observing feature value x and determine the posterior probabilities P(A|x) and P(B|x)⟹ we then classify to class with higher posterior.
We use the Bayes theorem to determine the posterior:
if: (class | X x) (class | X x)( | class A) (class A) ( | class B)
Decide for class A
( | class A) (cla
(class B) ( ) ( )
ss A) (
P A P BP X x P P X x P
P X x PP X x P X x
P
| class B) (clas B
)
s X x P
With ( ) = ( | class A) (class ) ( | class B) (classB)P X x P X x P A P X x P
4
decision plane
Decision boundary from Bayes rule in the case of class-conditional multi-variate equal-variance Gaussian features
class 1
class 2
5
p=1 -> decision boundary is point
p=2 -> decision boundary is line
p=2 -> decision boundary is 2D plane….
p=L -> decision boundary is L-1 hyperplane
p=1 p=2
p=3
Accuracy of the Bayes Classifier
PREDICTED CLASS
ACTUALCLASS
Class=1 Class=2
Class=1
Class=2
Es gilt: > pnorm(0, -1.25, 1)[1] 0.8943502
6
Linear Discriminant Analysis
Book: Introduction to Statistical LearningVideos (from ISLR)Linear Discriminant Analysis and Bayes Theorem (7:12)Univariate Linear Discriminant Analysis (7:37)Multivariate Linear Discriminant Analysis and ROC Curves (17:42)
8
Principle idea of univariate LDA
Dashed vertical line: theoretical Bayes’ decision boundary (known since it is a simulation study)Solid vertical line: the LDA decision boundary.
Sample with 20 observations per class Assumed underlying distribution
1) Estimate priors of classes by class frequencies in train data set
2) Estimate parameters for assumed underlying Normal distributions ( , , ).
3) Determine decision boundary according to Bayes rule.
Default LDA assumes that predictor follows in each class a Normal distributions which all have same variance and only differ in their mean values.
9
Use Training Data set for Estimation
• The mean μk could be estimated by the average of all training observations from the kth class.
• The variance σ2 could be estimated as the weighted average of variances of all k classes.
• And, πk is estimated as the proportion of the training observations that belong to the kth class.
µ̂k =1
nk i :yi = k
xi
σ̂2 =1
n − K
K
k= 1 i :yi = k
(xi − µ̂k )2
π̂k = nk / n.10
Example
• The dashed vertical line is the Bayes’ decision boundary• The solid vertical line is the LDA decision boundary
– Bayes’ error rate: 10.6%– LDA error rate: 11.1%
20 randomly drawn from each distribution Theoretical Curve (Bayes Classifier)
LDA reaches Bayes for n∞
11
Multi-variate Normal distribution (recap)
x1 and x2 are not correlations
Correlations between x1 and x2
1 0.70.7 1
1 0
0 1
12
Density:
1
2 ~ ,
p
XX
N
X
X μ Σ
11exp2
2
tp p
kf
x μ Σ x μx
Σ
Examples for p=2:
Mahalanobis distance is the multivariate z-score (recap)
MD=1
MD=1
MD=1
MD=2
1MD t x x μ Σ x μThe Mahalanobis distance MD(x) measures the distance of x to the mean of the multivariate Normal distribution in units of standard deviations.
13
MD=2
LDA for p>1
• We still require that the multivariate feature distribution in each class is given by a multivariate Normal distribution with multivariate mean and variance-covariance matrix .
• In LDA Σ must be the same for each class (but it may contain off-diagonal elements) leading to hyper-planes as decision boundaries between classes
• For a new observation we measure the Mahalanobis distance to all class centres and classify it to the nearest class centre.
14
Example: p = 2 K = 3
15
Example LDA: Predicting Species
Iris VirginicaIris Setosa Iris Versicolor
… …
Iris Data Set: Fisher 1936. In R iris16
Example Iris
17
decision boundary
LDA as classifier and method for dimension reduction
Visualization of the distances to the centres Possibility for dimension reduction
2 classes in can be shown 1D (since 2 points can be placed on a line)
3 classes can be shown in 2D (since 3 points can be placed on a plane)
….K classes can be shown in (K-1)D
??
Important for classification is only the (possibly not isotrope) Mahalanobis distance to the class centres μk .
18
??
Example Iris Data
Projection with LDA
Projection with PCA
19
When preferring LDA over logistic regression?
• LDA works better than logistic regression if classes are well-separated, and logistic regression estimates are very unstable
• LDA is more stable than logistic regression if n is small
• LDA is also appropriate for more than two response classes
• LDA provides low-dimensional views of the data
20
LDA requires that the distribution of the predictors X is approximately normal in each of the classes.
Extension to Quadratic Discriminant Analysis (QDA)
image credits: http://www.pnas.org/content/94/2/565
In QDA we allow that each class has different Variance Matrix for each class specific multivariate Gauss (we have much more parameter to estimate!) -> still smallest Mahalanobis distances to one of the class centers determine class -> class boundaries can now have quadratic shape
https://www.projectrhea.org/rhea/index.php/Bayes_Rule_for_1-dimensional_and_N-dimensional_feature_spaces
The error of a classifier based on the Bayes rule
Without proof: Bayes rule classifier has the smallest possible error.
Challenge: We often do not know the class-priors and class-conditionaldistributions of the observed feature(s)
22
LDA and QDA in R library(MASS)data(iris)
######### LDA #######################fit = lda(Species ~ ., data = iris)res = predict(fit, iris)# train accuracy of ldasum(res$class == iris$Species)/nrow(iris) # 0.98
# 3 classes -> 2D Plot in LD2 vs LD1plot(res$x, col=iris$Species)
# do leave-one-out cvfit.cv = lda(formula = Species ~ .,
data = iris, prior = c(1,1,1)/3,CV = TRUE) # leave-one-out cv
head(fit.cv$posterior, 3)# setosa versicolor virginica# 1 1 5.087494e-22 4.385241e-42# 2 1 9.588256e-18 8.888069e-37# 3 1 1.983745e-19 8.606982e-39
######## QDA #########################fit2 = qda(Species ~ ., data = iris)res2 = predict(fit2, iris)# train accuracy of qdasum(res2$class == iris$Species)/nrow(iris)# 0.98
-5 0 5 10
-2-1
01
2LD1
LD2
Crossvalidation
24
Questions for X-Validation
• Model Selection (which model to take)– Often there is a knob controlling the complexity of a classifier
• Number of NN in KNN?– Number of features to use?– Square / log the features and add them?– Shall I use Logistic Regression or LDA?
• Model Evaluation – How good is the performance on new unseen data.
25
The Validation Set Approach
Examples (rows of Datamatrix)
26
Leave-One-Out Cross Validation (LOOCV)
Fit w/o red sample and predict the red sample. Average over all n repeats
27
K-fold Cross Validation
Fit w/o red samples and predict the red samples. Average over all k repeats. Do a weighted average if folds do not have the same size.
Question: What happens if k=n, what if k=2?28
What is the best cross-validation scheme?
• Example (3 Datasets)
• Accuracy gets better (or stays constant) the larger the training set • Kohavi, R. (1995),A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,
in 'IJCAI', pp. 1137-1145.
Large training set (low bias)Small training set (sometimes larger bias)
bar position: estimated true accuracybar width: 95% of true accuracy
# left out# folds
What is the best cross-validation scheme?
• LOOCV – Not random– Sometimes slow (there are build in procedures for some classifiers)– Higher Variance (between completely new data sets)– Lowest possible bias (nearly the whole data seen)
• K-Fold Xvalidation– Random result– Usually faster– Lower Variance – Higher Bias (usually not problematic)
Standard today: k=10 Fold cross-validation (sometimes repeated after permutation)
30
Moving the cutoff
31
Operating on different levels of trust (motivation)
• LDA makes 252+ 23 mistakes on 10000 predictions (2.75% misclassification error rate)
• Great?• But LDA miss-predicts 252/333 = 75.5% of defaulters!• LDA gives probability belonging to one class.
• Perhaps, we shouldn’t use 0.5 as threshold for predicting default?
32
• Now the total number of mistakes is 235+138 = 373 (3.73% misclassification error rate)
• But we only miss-predicted 138/333 = 41.4% of defaulters• We can examine the error rate with other thresholds
Operating on different levels of trust (motivation)
33
Different levels in one plot
• Black solid: overall error rate• Blue dashed: Fraction of defaulters missed• Orange dotted: non defaulters incorrectly classified
normal operation point LDA
34
T+
T-
D+
TruePositive
False Negative
D-
False Positive
True Negative
Decision errors in a 2 class problem
The sensitivity (true positive rate) of a test or classifier is the ability of the test to identify correctly the class 1 or diseased individuals
The specificity (true negative rate or 1- false positive rate) of a test is the ability of the classifier to identify correctly the 0-classes or healthy individuals
spec = TN / N=TN / (FP + TN)
Confusion matrix:
sens = TP / P= TP / (TP + FN)
true class
pred
icte
d c
lass
35
ROC curves
36
BMJ 1994; 309:188
Determine the Sens and Spec for for the indicated 2 cut-offs.
37
For each cutoff we get aclassification rule and a corresponding confusion matrix and can determine sensitivity and specificity
We can use a continuous score such as probability to construct a ROC curve
Use the ROC curve as performance measurearea under the curve (AUC) quantifies discrimination power
0 1.0
The larger the AUC the better is the performance of the diagnostic test.A useless test has an AUC = 0.5. A perfect test has an AUC= 1.
38
ROC-Curve in R (using ROCR)
############################# ROC Curves in R (insample)library(ROCR)library(ISLR)fit = lda(default ~ ., data=Default) preds = predict(fit, Default)$posteriorhead(preds)pred <- prediction(preds[,2], Default$default)plot(performance(pred,"tpr","fpr"))performance(pred, "auc")
39