Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for...

Advanced Studies in Applied Statistics (WBL), ETHZApplied Multivariate StatisticsSpring 2018, Week 9

Lecturer: Beate [email protected]

1Remark: Much of the material have been developed together with Oliver Dürr for different lectures at ZHAW.

Topics of today

• Bayes rule for decision boundaries: optimal if data distribution is known ;-)

• Linear Discriminant Analysis (LDA): estimate Bayes decision boundaries

– LDA in uni-variate (p = 1) and multi-variate (p≥2) cases

– LDA versus logistic regression

– extending to quadratic discriminant analyses (QDA)

• Performance evaluation

– Cross validation schemes

– ROC curve

2

Bayes rule for balanced 2 group classification and an conditional univariate Normal distributed feature with equal variances

Assumptions: a) balanced 2 class classification problem (both class prevalences/priors 0.5). b) 1 predictor x with known _ ~ , , _ ~ , , same variance!

At which x-value is the optimal split value?

Bayes rule: decision boundary is s , if classify to class with larger .

decision boundary

3

Why is it called Bayes rule?

( | class C) (class C)(class | X x) , C {A,B}( )

P X x PP CP X x

Each class has a prior probability P(A) and P(B) ⟹ we classify to class with higher prior.

Observing feature value x and determine the posterior probabilities P(A|x) and P(B|x)⟹ we then classify to class with higher posterior.

We use the Bayes theorem to determine the posterior:

if: (class | X x) (class | X x)( | class A) (class A) ( | class B)

Decide for class A

( | class A) (cla

(class B) ( ) ( )

ss A) (

P A P BP X x P P X x P

P X x PP X x P X x

P

| class B) (clas B

)

s X x P

With ( ) = ( | class A) (class ) ( | class B) (classB)P X x P X x P A P X x P

4

decision plane

Decision boundary from Bayes rule in the case of class-conditional multi-variate equal-variance Gaussian features

class 1

class 2

5

p=1 -> decision boundary is point

p=2 -> decision boundary is line

p=2 -> decision boundary is 2D plane….

p=L -> decision boundary is L-1 hyperplane

p=1 p=2

p=3

Accuracy of the Bayes Classifier

PREDICTED CLASS

ACTUALCLASS

Class=1 Class=2

Class=1

Class=2

Es gilt: > pnorm(0, -1.25, 1)[1] 0.8943502

6

Linear Discriminant Analysis

Book: Introduction to Statistical LearningVideos (from ISLR)Linear Discriminant Analysis and Bayes Theorem (7:12)Univariate Linear Discriminant Analysis (7:37)Multivariate Linear Discriminant Analysis and ROC Curves (17:42)

8

Principle idea of univariate LDA

Dashed vertical line: theoretical Bayes’ decision boundary (known since it is a simulation study)Solid vertical line: the LDA decision boundary.

Sample with 20 observations per class Assumed underlying distribution

1) Estimate priors of classes by class frequencies in train data set

2) Estimate parameters for assumed underlying Normal distributions ( , , ).

3) Determine decision boundary according to Bayes rule.

Default LDA assumes that predictor follows in each class a Normal distributions which all have same variance and only differ in their mean values.

9

Use Training Data set for Estimation

• The mean μk could be estimated by the average of all training observations from the kth class.

• The variance σ2 could be estimated as the weighted average of variances of all k classes.

• And, πk is estimated as the proportion of the training observations that belong to the kth class.

µ̂k =1

nk i :yi = k

xi

σ̂2 =1

n − K

K

k= 1 i :yi = k

(xi − µ̂k )2

π̂k = nk / n.10

Example

• The dashed vertical line is the Bayes’ decision boundary• The solid vertical line is the LDA decision boundary

– Bayes’ error rate: 10.6%– LDA error rate: 11.1%

20 randomly drawn from each distribution Theoretical Curve (Bayes Classifier)

LDA reaches Bayes for n∞

11

Multi-variate Normal distribution (recap)

x1 and x2 are not correlations

Correlations between x1 and x2

1 0.70.7 1

1 0

0 1

12

Density:

1

2 ~ ,

p

XX

N

X

X μ Σ

11exp2

2

tp p

kf

x μ Σ x μx

Σ

Examples for p=2:

Mahalanobis distance is the multivariate z-score (recap)

MD=1

MD=1

MD=1

MD=2

1MD t x x μ Σ x μThe Mahalanobis distance MD(x) measures the distance of x to the mean of the multivariate Normal distribution in units of standard deviations.

13

MD=2

LDA for p>1

• We still require that the multivariate feature distribution in each class is given by a multivariate Normal distribution with multivariate mean and variance-covariance matrix .

• In LDA Σ must be the same for each class (but it may contain off-diagonal elements) leading to hyper-planes as decision boundaries between classes

• For a new observation we measure the Mahalanobis distance to all class centres and classify it to the nearest class centre.

14

Example: p = 2 K = 3

15

Example LDA: Predicting Species

Iris VirginicaIris Setosa Iris Versicolor

… …

Iris Data Set: Fisher 1936. In R iris16

Example Iris

17

decision boundary

LDA as classifier and method for dimension reduction

Visualization of the distances to the centres Possibility for dimension reduction

2 classes in can be shown 1D (since 2 points can be placed on a line)

3 classes can be shown in 2D (since 3 points can be placed on a plane)

….K classes can be shown in (K-1)D

??

Important for classification is only the (possibly not isotrope) Mahalanobis distance to the class centres μk .

18

??

Example Iris Data

Projection with LDA

Projection with PCA

19

When preferring LDA over logistic regression?

• LDA works better than logistic regression if classes are well-separated, and logistic regression estimates are very unstable

• LDA is more stable than logistic regression if n is small

• LDA is also appropriate for more than two response classes

• LDA provides low-dimensional views of the data

20

LDA requires that the distribution of the predictors X is approximately normal in each of the classes.

Extension to Quadratic Discriminant Analysis (QDA)

image credits: http://www.pnas.org/content/94/2/565

In QDA we allow that each class has different Variance Matrix for each class specific multivariate Gauss (we have much more parameter to estimate!) -> still smallest Mahalanobis distances to one of the class centers determine class -> class boundaries can now have quadratic shape

https://www.projectrhea.org/rhea/index.php/Bayes_Rule_for_1-dimensional_and_N-dimensional_feature_spaces

The error of a classifier based on the Bayes rule

Without proof: Bayes rule classifier has the smallest possible error.

Challenge: We often do not know the class-priors and class-conditionaldistributions of the observed feature(s)

22

LDA and QDA in R library(MASS)data(iris)

######### LDA #######################fit = lda(Species ~ ., data = iris)res = predict(fit, iris)# train accuracy of ldasum(res$class == iris$Species)/nrow(iris) # 0.98

# 3 classes -> 2D Plot in LD2 vs LD1plot(res$x, col=iris$Species)

# do leave-one-out cvfit.cv = lda(formula = Species ~ .,

data = iris, prior = c(1,1,1)/3,CV = TRUE) # leave-one-out cv

head(fit.cv$posterior, 3)# setosa versicolor virginica# 1 1 5.087494e-22 4.385241e-42# 2 1 9.588256e-18 8.888069e-37# 3 1 1.983745e-19 8.606982e-39

######## QDA #########################fit2 = qda(Species ~ ., data = iris)res2 = predict(fit2, iris)# train accuracy of qdasum(res2$class == iris$Species)/nrow(iris)# 0.98

-5 0 5 10

-2-1

01

2LD1

LD2

Crossvalidation

24

Questions for X-Validation

• Model Selection (which model to take)– Often there is a knob controlling the complexity of a classifier

• Number of NN in KNN?– Number of features to use?– Square / log the features and add them?– Shall I use Logistic Regression or LDA?

• Model Evaluation – How good is the performance on new unseen data.

25

The Validation Set Approach

Examples (rows of Datamatrix)

26

Leave-One-Out Cross Validation (LOOCV)

Fit w/o red sample and predict the red sample. Average over all n repeats

27

K-fold Cross Validation

Fit w/o red samples and predict the red samples. Average over all k repeats. Do a weighted average if folds do not have the same size.

Question: What happens if k=n, what if k=2?28

What is the best cross-validation scheme?

• Example (3 Datasets)

• Accuracy gets better (or stays constant) the larger the training set • Kohavi, R. (1995),A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,

in 'IJCAI', pp. 1137-1145.

Large training set (low bias)Small training set (sometimes larger bias)

bar position: estimated true accuracybar width: 95% of true accuracy

# left out# folds

What is the best cross-validation scheme?

• LOOCV – Not random– Sometimes slow (there are build in procedures for some classifiers)– Higher Variance (between completely new data sets)– Lowest possible bias (nearly the whole data seen)

• K-Fold Xvalidation– Random result– Usually faster– Lower Variance – Higher Bias (usually not problematic)

Standard today: k=10 Fold cross-validation (sometimes repeated after permutation)

30

Moving the cutoff

31

Operating on different levels of trust (motivation)

• LDA makes 252+ 23 mistakes on 10000 predictions (2.75% misclassification error rate)

• Great?• But LDA miss-predicts 252/333 = 75.5% of defaulters!• LDA gives probability belonging to one class.

• Perhaps, we shouldn’t use 0.5 as threshold for predicting default?

32

• Now the total number of mistakes is 235+138 = 373 (3.73% misclassification error rate)

• But we only miss-predicted 138/333 = 41.4% of defaulters• We can examine the error rate with other thresholds

Operating on different levels of trust (motivation)

33

Different levels in one plot

• Black solid: overall error rate• Blue dashed: Fraction of defaulters missed• Orange dotted: non defaulters incorrectly classified

normal operation point LDA

34

T+

T-

D+

TruePositive

False Negative

D-

False Positive

True Negative

Decision errors in a 2 class problem

The sensitivity (true positive rate) of a test or classifier is the ability of the test to identify correctly the class 1 or diseased individuals

The specificity (true negative rate or 1- false positive rate) of a test is the ability of the classifier to identify correctly the 0-classes or healthy individuals

spec = TN / N=TN / (FP + TN)

Confusion matrix:

sens = TP / P= TP / (TP + FN)

true class

pred

icte

d c

lass

35

ROC curves

36

BMJ 1994; 309:188

Determine the Sens and Spec for for the indicated 2 cut-offs.

37

For each cutoff we get aclassification rule and a corresponding confusion matrix and can determine sensitivity and specificity

We can use a continuous score such as probability to construct a ROC curve

Use the ROC curve as performance measurearea under the curve (AUC) quantifies discrimination power

0 1.0

The larger the AUC the better is the performance of the diagnostic test.A useless test has an AUC = 0.5. A perfect test has an AUC= 1.

38

ROC-Curve in R (using ROCR)

############################# ROC Curves in R (insample)library(ROCR)library(ISLR)fit = lda(default ~ ., data=Default) preds = predict(fit, Default)$posteriorhead(preds)pred <- prediction(preds[,2], Default$default)plot(performance(pred,"tpr","fpr"))performance(pred, "auc")

39

Date post:	29-Sep-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	1 times

Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for...

Documents