+ All Categories
Home > Documents > Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for...

Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for...

Date post: 29-Sep-2020
Category:
Upload: others
View: 5 times
Download: 1 times
Share this document with a friend
38
Advanced Studies in Applied Statistics (WBL), ETHZ Applied Multivariate Statistics Spring 2018, Week 9 Lecturer: Beate Sick [email protected] 1 Remark: Much of the material have been developed together with Oliver Dürr for different lectures at ZHAW.
Transcript
Page 1: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Advanced Studies in Applied Statistics (WBL), ETHZApplied Multivariate StatisticsSpring 2018, Week 9

Lecturer: Beate [email protected]

1Remark: Much of the material have been developed together with Oliver Dürr for different lectures at ZHAW.

Page 2: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Topics of today

• Bayes rule for decision boundaries: optimal if data distribution is known ;-)

• Linear Discriminant Analysis (LDA): estimate Bayes decision boundaries

– LDA in uni-variate (p = 1) and multi-variate (p≥2) cases

– LDA versus logistic regression

– extending to quadratic discriminant analyses (QDA)

• Performance evaluation

– Cross validation schemes

– ROC curve

2

Page 3: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Bayes rule for balanced 2 group classification and an conditional univariate Normal distributed feature with equal variances

Assumptions: a) balanced 2 class classification problem (both class prevalences/priors 0.5). b) 1 predictor x with known _ ~ , , _ ~ , , same variance!

At which x-value is the optimal split value?

Bayes rule: decision boundary is s , if classify to class with larger .

decision boundary

3

Page 4: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Why is it called Bayes rule?

( | class C) (class C)(class | X x) , C {A,B}( )

P X x PP CP X x

Each class has a prior probability P(A) and P(B) ⟹ we classify to class with higher prior.

Observing feature value x and determine the posterior probabilities P(A|x) and P(B|x)⟹ we then classify to class with higher posterior.

We use the Bayes theorem to determine the posterior:

if: (class | X x) (class | X x)( | class A) (class A) ( | class B)

Decide for class A

( | class A) (cla

(class B) ( ) ( )

ss A) (

P A P BP X x P P X x P

P X x PP X x P X x

P

| class B) (clas B

)

s X x P

With ( ) = ( | class A) (class ) ( | class B) (classB)P X x P X x P A P X x P

4

Page 5: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

decision plane

Decision boundary from Bayes rule in the case of class-conditional multi-variate equal-variance Gaussian features

class 1

class 2

5

p=1 -> decision boundary is point

p=2 -> decision boundary is line

p=2 -> decision boundary is 2D plane….

p=L -> decision boundary is L-1 hyperplane

p=1 p=2

p=3

Page 6: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Accuracy of the Bayes Classifier

PREDICTED CLASS

ACTUALCLASS

Class=1 Class=2

Class=1

Class=2

Es gilt: > pnorm(0, -1.25, 1)[1] 0.8943502

6

Page 7: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Linear Discriminant Analysis

Book: Introduction to Statistical LearningVideos (from ISLR)Linear Discriminant Analysis and Bayes Theorem (7:12)Univariate Linear Discriminant Analysis (7:37)Multivariate Linear Discriminant Analysis and ROC Curves (17:42)

8

Page 8: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Principle idea of univariate LDA

Dashed vertical line: theoretical Bayes’ decision boundary (known since it is a simulation study)Solid vertical line: the LDA decision boundary.

Sample with 20 observations per class Assumed underlying distribution

1) Estimate priors of classes by class frequencies in train data set

2) Estimate parameters for assumed underlying Normal distributions ( , , ).

3) Determine decision boundary according to Bayes rule.

Default LDA assumes that predictor follows in each class a Normal distributions which all have same variance and only differ in their mean values.

9

Page 9: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Use Training Data set for Estimation

• The mean μk could be estimated by the average of all training observations from the kth class.

• The variance σ2 could be estimated as the weighted average of variances of all k classes.

• And, πk is estimated as the proportion of the training observations that belong to the kth class.

µ̂k =1

nk i :yi = k

xi

σ̂2 =1

n − K

K

k= 1 i :yi = k

(xi − µ̂k )2

π̂k = nk / n.10

Page 10: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Example

• The dashed vertical line is the Bayes’ decision boundary• The solid vertical line is the LDA decision boundary

– Bayes’ error rate: 10.6%– LDA error rate: 11.1%

20 randomly drawn from each distribution Theoretical Curve (Bayes Classifier)

LDA reaches Bayes for n∞

11

Page 11: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Multi-variate Normal distribution (recap)

x1 and x2 are not correlations

Correlations between x1 and x2

1 0.70.7 1

1 0

0 1

12

Density:

1

2 ~ ,

p

XX

N

X

X μ Σ

11exp2

2

tp p

kf

x μ Σ x μx

Σ

Examples for p=2:

Page 12: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Mahalanobis distance is the multivariate z-score (recap)

MD=1

MD=1

MD=1

MD=2

1MD t x x μ Σ x μThe Mahalanobis distance MD(x) measures the distance of x to the mean of the multivariate Normal distribution in units of standard deviations.

13

MD=2

Page 13: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

LDA for p>1

• We still require that the multivariate feature distribution in each class is given by a multivariate Normal distribution with multivariate mean and variance-covariance matrix .

• In LDA Σ must be the same for each class (but it may contain off-diagonal elements) leading to hyper-planes as decision boundaries between classes

• For a new observation we measure the Mahalanobis distance to all class centres and classify it to the nearest class centre.

14

Page 14: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Example: p = 2 K = 3

15

Page 15: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Example LDA: Predicting Species

Iris VirginicaIris Setosa Iris Versicolor

… …

Iris Data Set: Fisher 1936. In R iris16

Page 16: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Example Iris

17

Page 17: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

decision boundary

LDA as classifier and method for dimension reduction

Visualization of the distances to the centres Possibility for dimension reduction

2 classes in can be shown 1D (since 2 points can be placed on a line)

3 classes can be shown in 2D (since 3 points can be placed on a plane)

….K classes can be shown in (K-1)D

??

Important for classification is only the (possibly not isotrope) Mahalanobis distance to the class centres μk .

18

??

Page 18: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Example Iris Data

Projection with LDA

Projection with PCA

19

Page 19: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

When preferring LDA over logistic regression?

• LDA works better than logistic regression if classes are well-separated, and logistic regression estimates are very unstable

• LDA is more stable than logistic regression if n is small

• LDA is also appropriate for more than two response classes

• LDA provides low-dimensional views of the data

20

LDA requires that the distribution of the predictors X is approximately normal in each of the classes.

Page 20: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Extension to Quadratic Discriminant Analysis (QDA)

image credits: http://www.pnas.org/content/94/2/565

In QDA we allow that each class has different Variance Matrix for each class specific multivariate Gauss (we have much more parameter to estimate!) -> still smallest Mahalanobis distances to one of the class centers determine class -> class boundaries can now have quadratic shape

Page 21: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

https://www.projectrhea.org/rhea/index.php/Bayes_Rule_for_1-dimensional_and_N-dimensional_feature_spaces

The error of a classifier based on the Bayes rule

Without proof: Bayes rule classifier has the smallest possible error.

Challenge: We often do not know the class-priors and class-conditionaldistributions of the observed feature(s)

22

Page 22: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

LDA and QDA in R library(MASS)data(iris)

######### LDA #######################fit = lda(Species ~ ., data = iris)res = predict(fit, iris)# train accuracy of ldasum(res$class == iris$Species)/nrow(iris) # 0.98

# 3 classes -> 2D Plot in LD2 vs LD1plot(res$x, col=iris$Species)

# do leave-one-out cvfit.cv = lda(formula = Species ~ .,

data = iris, prior = c(1,1,1)/3,CV = TRUE) # leave-one-out cv

head(fit.cv$posterior, 3)# setosa versicolor virginica# 1 1 5.087494e-22 4.385241e-42# 2 1 9.588256e-18 8.888069e-37# 3 1 1.983745e-19 8.606982e-39

######## QDA #########################fit2 = qda(Species ~ ., data = iris)res2 = predict(fit2, iris)# train accuracy of qdasum(res2$class == iris$Species)/nrow(iris)# 0.98

-5 0 5 10

-2-1

01

2LD1

LD2

Page 23: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Crossvalidation

24

Page 24: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Questions for X-Validation

• Model Selection (which model to take)– Often there is a knob controlling the complexity of a classifier

• Number of NN in KNN?– Number of features to use?– Square / log the features and add them?– Shall I use Logistic Regression or LDA?

• Model Evaluation – How good is the performance on new unseen data.

25

Page 25: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

The Validation Set Approach

Examples (rows of Datamatrix)

26

Page 26: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Leave-One-Out Cross Validation (LOOCV)

Fit w/o red sample and predict the red sample. Average over all n repeats

27

Page 27: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

K-fold Cross Validation

Fit w/o red samples and predict the red samples. Average over all k repeats. Do a weighted average if folds do not have the same size.

Question: What happens if k=n, what if k=2?28

Page 28: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

What is the best cross-validation scheme?

• Example (3 Datasets)

• Accuracy gets better (or stays constant) the larger the training set • Kohavi, R. (1995),A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,

in 'IJCAI', pp. 1137-1145.

Large training set (low bias)Small training set (sometimes larger bias)

bar position: estimated true accuracybar width: 95% of true accuracy

# left out# folds

Page 29: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

What is the best cross-validation scheme?

• LOOCV – Not random– Sometimes slow (there are build in procedures for some classifiers)– Higher Variance (between completely new data sets)– Lowest possible bias (nearly the whole data seen)

• K-Fold Xvalidation– Random result– Usually faster– Lower Variance – Higher Bias (usually not problematic)

Standard today: k=10 Fold cross-validation (sometimes repeated after permutation)

30

Page 30: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Moving the cutoff

31

Page 31: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Operating on different levels of trust (motivation)

• LDA makes 252+ 23 mistakes on 10000 predictions (2.75% misclassification error rate)

• Great?• But LDA miss-predicts 252/333 = 75.5% of defaulters!• LDA gives probability belonging to one class.

• Perhaps, we shouldn’t use 0.5 as threshold for predicting default?

32

Page 32: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

• Now the total number of mistakes is 235+138 = 373 (3.73% misclassification error rate)

• But we only miss-predicted 138/333 = 41.4% of defaulters• We can examine the error rate with other thresholds

Operating on different levels of trust (motivation)

33

Page 33: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Different levels in one plot

• Black solid: overall error rate• Blue dashed: Fraction of defaulters missed• Orange dotted: non defaulters incorrectly classified

normal operation point LDA

34

Page 34: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

T+

T-

D+

TruePositive

False Negative

D-

False Positive

True Negative

Decision errors in a 2 class problem

The sensitivity (true positive rate) of a test or classifier is the ability of the test to identify correctly the class 1 or diseased individuals

The specificity (true negative rate or 1- false positive rate) of a test is the ability of the classifier to identify correctly the 0-classes or healthy individuals

spec = TN / N=TN / (FP + TN)

Confusion matrix:

sens = TP / P= TP / (TP + FN)

true class

pred

icte

d c

lass

35

Page 35: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

ROC curves

36

Page 36: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

BMJ 1994; 309:188

Determine the Sens and Spec for for the indicated 2 cut-offs.

37

For each cutoff we get aclassification rule and a corresponding confusion matrix and can determine sensitivity and specificity

We can use a continuous score such as probability to construct a ROC curve

Page 37: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

Use the ROC curve as performance measurearea under the curve (AUC) quantifies discrimination power

0 1.0

The larger the AUC the better is the performance of the diagnostic test.A useless test has an AUC = 0.5. A perfect test has an AUC= 1.

38

Page 38: Advanced Studies in Applied Statistics (WBL), ETHZ Applied ......Topics of today • Bayes rule for decision boundaries: optimal if data distribution is known ;-) • Linear Discriminant

ROC-Curve in R (using ROCR)

############################# ROC Curves in R (insample)library(ROCR)library(ISLR)fit = lda(default ~ ., data=Default) preds = predict(fit, Default)$posteriorhead(preds)pred <- prediction(preds[,2], Default$default)plot(performance(pred,"tpr","fpr"))performance(pred, "auc")

39


Recommended