Computational BioMedical Informatics

Post on 24-Feb-2016

42 views 0 download

Tags:

description

Computational BioMedical Informatics. SCE 5095: Special Topics Course Instructor: Jinbo Bi Computer Science and Engineering Dept. Course Information. Instructor: Dr. Jinbo Bi Office: ITEB 233 Phone: 860-486-1458 Email: jinbo@engr.uconn.edu - PowerPoint PPT Presentation

transcript

1

Computational BioMedical Informatics

SCE 5095: Special Topics Course

Instructor: Jinbo BiComputer Science and Engineering Dept.

2

Course Information

Instructor: Dr. Jinbo Bi – Office: ITEB 233– Phone: 860-486-1458– Email: jinbo@engr.uconn.edu

– Web: http://www.engr.uconn.edu/~jinbo/– Time: Tue / Thur. 3:30pm – 4:45pm – Location: ITEB 127– Office hours: Tue/Thur. 4:45-5:30pm

HuskyCT– http://learn.uconn.edu– Login with your NetID and password– Illustration

3

Regression and classification

Both regression and classification problems are typically supervised learning problems

The main property of supervised learning– Training example contains the input variables

and the corresponding target label– The goal is to find a good mapping from the

input variables to the target variable

4

Classification: Definition

Given a collection of examples (training set )– Each example contains a set of variables

(features), and the target variable class. Find a model for class attribute as a function

of the values of other variables. Goal: previously unseen examples should be

assigned a class as accurately as possible.– A test set is used to determine the accuracy of the

model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

5

Classification Application 1

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categorical

categorical

continuous

class

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

TestSet

Training Set Model

Learn Classifier

Past transaction records, label them

Current data, want to use the model to predict

Fraud detection – goals: Predict fraudulent cases in credit card transactions.

6

Classification: Application 2

Handwritten Digit Recognition Goal: Identify the digit of a handwritten number

– Approach:Align all images to derive the featuresModel the class (identity) based on these features

7

Illustrating Classification Task

Apply Model

Induction

Deduction

Learn Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learningalgorithm

Training Set

8

Classification algorithms

K-Nearest-Neighbor classifiers Naïve Bayes classifier Neural Networks Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Decision Tree Logistic Regression Graphical models

9

Regression: Definition

Goal: predict the value of one or more continuous target attributes give the values of the input attributes

Difference between classification and regression only lies in the target attribute– Classification: discrete or categorical target– Regression: continuous target

Greatly studied in statistics, neural network fields.

10

Regression application 1

categorical

categorical

continuous

Continuous ta

rget

Refund Marital Status

Taxable Income Loss

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ? 10

TestSet

Training Set Model

Learn Regressor

Past transaction records, label them

Current data, want to use the model to predict

goals: Predict the possible loss from a customer

Tid Refund MaritalStatus

TaxableIncome Loss

1 Yes Single 125K 100

2 No Married 100K 120

3 No Single 70K -200

4 Yes Married 120K -300

5 No Divorced 95K -400

6 No Married 60K -500

7 Yes Divorced 220K -190

8 No Single 85K 300

9 No Married 75K -240

10 No Single 90K 9010

11

Regression applications

Examples:– Predicting sales amounts of new product

based on advertising expenditure.– Predicting wind velocities as a function of

temperature, humidity, air pressure, etc.– Time series prediction of stock market indices.

12

Regression algorithms

Least squares methods Regularized linear regression (ridge regression) Neural networks Support vector machines (SVM) Bayesian linear regression

13

Practical issues in the training

Underfitting

Overfitting

Before introducing these important concept, let us study a simple regression algorithm – linear regression

14

Least squares

We wish to use some real-valued input variables x to predict the value of a target y

We collect training data of pairs (xi,yi), i=1,…N Suppose we have a model f that maps each x

example to a value of y’ Sum of squares function:

– Sum of the squares of the deviation between the observed target value y and the predicted value y’

N

iii

N

iii xfyyy

1

2

1

2 )('

15

Least squares

Find a function f such that the sum of squares is minimized

For example, your function is in the form of linear functions f (x) = wTx

Least squares with a linear function of parameters w is called “linear regression”

N

iii xfy

1

2)(min

N

ii

Ti xwyw

1

2min

16

Linear regression

Linear regression has a closed-form solution for w

The minimum is attained at the zero derivative

)(XwyXwy

min1

2

wE

wxyw

T

N

i

Tii

0)(2)(

XwyXw

wE T

yXXXw TT 1

17

x is evenly distributed from [0,1] y = f(x) + random error y = sin(2πx) + ε, ε ~ N(0,σ)

Polynomial Curve Fitting

18

Polynomial Curve Fitting

19

Sum-of-Squares Error Function

20

0th Order Polynomial

21

1st Order Polynomial

22

3rd Order Polynomial

23

9th Order Polynomial

24

Over-fitting

Root-Mean-Square (RMS) Error:

25

Polynomial Coefficients

26

Data Set Size:

9th Order Polynomial

27

Data Set Size:

9th Order Polynomial

28

Regularization

Penalize large coefficient values

Ridge regression

29

Regularization:

30

Regularization:

31

Regularization: vs.

32

Polynomial Coefficients

33

Ridge Regression

Derive the analytic solution to the optimization problem for ridge regression

22 |||||||| min wXwy

wwXwyXwy TT )()( min

wIXXwXwy TTT )(2- min Using KKT condition – first order derivative = 0

yXwIXX TT )(

wwXwXwXwy TTTT 2- min

yXIXXw TT 1)(

34

Neural networks

Introduction Different designs of NN Feed-forward Network (MLP) Network Training Error Back-propagation Regularization

35

Introduction

Neuroscience studies how networks of neurons produce intellectual behavior, cognition, emotion and physiological responses

Computer science studies how to simulate knowledge in cognitive science, including the way neurons process signals

Artificial neural networks simulate the connectivity in the neural system, the way it passes through signal, and mimic the massively parallel operations of the human brain

36

Common features

Dentrites

37

Different types of NN

Adaptive NN: have a set of adjustable parameters that can be tuned

Topological NN

Recurrent NN

38

Different types of NN

Feed-forward NN Multi-layer perceptron Linear perceptron

HiddenLayer

InputLayer Output

Layer

39

Different types of NN

Radial basis function NN (RBFN)

40

Multi-Layer Perceptron

Layered perceptron networks can realize any logical function, however there is no simple way to estimate the parameters/generalize the (single layer) Perceptron convergence procedure

Multi-layer perceptron (MLP) networks are a class of models that are formed from layered sigmoidal nodes, which can be used for regression or classification purposes.

They are commonly trained using gradient descent on a mean squared error performance function, using a technique known as error back propagation in order to calculate the gradients.

Widely applied to many prediction and classification problems over the past 15 years.

41

Linear perceptron

Input layer output layer

::

x1

wt

w2

w1

xt

x2 yΣ

y = w1*x1 + w2*x2 + … + wt*xt

Many functions can not be approximated using perceptron

42

Multi-Layer Perceptron

XOR (exclusive OR) problem

0+0=0 1+1=2=0 mod 2 1+0=1 0+1=1 Perceptron does not

work here!Single layer generates a linear decision boundary

43

Multi-Layer Perceptron

::

x1

wt1(1)

W21(1)

W11(1)

xt

x2 yf(Σ)

f(Σ)

f(Σ)

W11(2)

W22(1)

W21(2)

Each link is associated with a weight, and these weights are the tuning parameters to be learnedEach neuron except ones in the input layer receives inputs from the previous layer, and reports an output to next layer

Input layer Hidden layer output layer

44

Each neuron

wn

.

.

.

w2

S

w1

n

iiiout pwSUM

)exp(11

outj SUM

OUT

summation

f is Activation function

o The activation function f can beIdentity function f(x) = xSigmoid function Hyperbolic tangent

45

1st layer 2nd layer3rd layer

Universal Approximation: Three-layer network can in principleapproximate any function with any accuracy!

Universal Approximation of MLP

46

Feed-forward network function

The output from each hidden node

The final output

N

iiijj xwfo

1

)1( )1(

x1

xt

x2 y

N nodes M nodes

M

jjjkk owfy

1

)1()2(

Signal flows

47

Network Training

A supervised neural network is a function h(x;w) that maps from inputs x to target y

Usually training a NN does not involve the change of NN structures (such as how many hidden layers or how many hidden nodes)

Training NN refers to adjusting the values of connection weights so that h(x;w) adapts to the problem

Use sum of squares as the error metric

ijwwE

)(

L

iii wxhywE

1

2);()(

Use gradient descent

48

Gradient descent

Review of gradient descent Iterative algorithm containing many iterations Each iteration, the weights w receive a small

update

Terminate – until the network is stable (in other words, the

training error cannot be reduced further) E(wnew) < E(w) not hold– until the error on a validation set starts to

climb up (early stopping)

ij

ijnew

ij wEww

49

Error Back-propagation

The update of the weights goes backwards because we have to use the chain rule to evaluate the gradient of E(w)

Learning is backwards

x1

xt

x2 y=h(x;w)

M nodes N nodes

Signal flows forwards

W ij W jk

50

Error Back-propagation

Update the weights in the output layer first Propagate errors from the high layer to low layer Recall

Learning is backwardsx1

xt

x2 y=h(x;w)

M nodes N nodes

W ij W jk

N

iiijj xwfo

1

)1( )1(

M

jjj owfy

1

)1()2(

51

Evaluate gradient

First compute the partial derivatives for weights in the output layer

Second compute the partial derivatives for weights in the hidden layer

M

jjj owfy

1

)1()2(

N

iiijj xwfo

1

)1( )1(

L

pp

L

ppp EyywE

11

2 ~ˆ)(

)1(

1

)1(

)2()2(

)('ˆ2

ˆ2~

)2(

j

M

jjjii

j

iii

j

oowfyy

wyyy

wE

i

N

iiijj

M

jjjii

ij

jj

M

jjjii

ij

iii

ij

xxwfwowfyy

wo

wowfyy

wyyy

wE

)(')('ˆ2

)('ˆ2

ˆ2~

1

)2(

1

)1(

)1(

)1()2(

1

)1(

)1()1(

)1()2(

)2(

52

Back-propagation algorithm

Design the structure of NN Initialize all connection weights For t = 1, to T

– Present training examples, propagate forwards from input layer to output layer, compute y, and evaluate the errors

– Pass errors backwards through the network to recursively compute derivatives, and use them to update weights

– If termination rule is met, stop; or continue end

ij

tij

tij w

Eww 1

53

Notes on back-propagation

Note that these rules apply to different kinds of feed-forward networks. It is possible for connections to skip layers, or to have mixtures. However, errors always start at the highest layer and propagate backwards

54

Activation and Error back-propagation

55

Two schemes of training There are two schemes of updating weights

– Batch: Update weights after all examples have been presented (epoch).

– Online: Update weights after each example is presented.

Although the batch update scheme implements the true gradient descent, the second scheme is often preferred since – it requires less storage, – it has more noise, hence is less likely to get

stuck in a local minima (which is a problem with nonlinear activation functions). In the online update scheme, order of presentation matters!

56

Problems of back-propagation

It is extremely slow, if it does converge. It may get stuck in a local minima. It is sensitive to initial conditions. It may start oscillating.

57

Overfitting – number of hidden units

Over-fitting

Sinusoidal data set used in polynomial curve fitting example

58

Regularization (1)

How to adjust the number of hidden units to get the best performance while avoiding over-fitting

Add a penalty term to the error function

The simplest regularizer is the weight decay:

wwww T

2)()(~ EE

ij

ijnew

ij wEww )1(

59

Regularization (2)

A method to Early Stopping– obtain good generalization performance and– control the effective complexity of the network

Instead of iteratively reducing the error until a minimum error on the training data set has been reached

We have a validation set of data available Stop when the NN achieves the smallest error

w.r.t. the validation data set

60

Effect of early stopping

Validation Set

Training SetError vs. Number of iterations

A slight increase in the validation set error

61

Classification

Underfitting or Overfitting can also happen in classification approaches

We will illustrate these practical issues on classification problem

Before the illustration, we introduce a simple classification technique – K-nearest neighbor method

62

K-nearest neighbor (K-NN)

K-NN is one of the simplest machine learning algorithm

K-NN is a method for classifying test examples based on closest training examples in the feature space

An example is classified by a majority vote of its neighbors

k is a positive integer, typically small. If k = 1, then the example is simply assigned to the class of its nearest neighbor.

63

K-NN

K = 1K = 3

64

K-NN on real problem data

• Oil data set• K acts as a smoother, choosing K is model

selection• For , the error rate of the 1-nearest-neighbour

classifier is never more than twice the optimal error (obtained from the true conditional class distributions).

65

Limitation of K-NN

K-NN is a nonparametric model (no any particular function is fitted)

Nonparametric models requires storing and computing with the entire data set.

Parametric models, once fitted, are much more efficient in terms of storage and computation.

66

Probabilistic interpretation of K-NN

Given a data set with Nk data points from class Ck and , we have

and correspondingly

Since , Bayes’ theorem gives

67

Underfit and Overfit (Classification)

500 circular and 500 triangular data points.

Circular points:0.5 sqrt(x1

2+x22) 1

Triangular points:sqrt(x1

2+x22) > 1 or

sqrt(x12+x2

2) < 0.5

68

Underfit and Overfit (Classification)

500 circular and 500 triangular data points.

Circular points:0.5 sqrt(x1

2+x22) 1

Triangular points:sqrt(x1

2+x22) > 1 or

sqrt(x12+x2

2) < 0.5

69

Underfitting and Overfitting

Overfitting

Underfitting: when model is too simple, both training and test errors are large

Number of Iterations

70

Overfitting due to Noise

Decision boundary is distorted by noise point

71

Overfitting due to Insufficient Examples

Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the neural nets to predict the test examples using other training records that are irrelevant to the classification task

72

Notes on Overfitting

Overfitting results in classifiers (a neural net, or a support vector machine) that are more complex than necessary

Training error no longer provides a good estimate of how well the classifier will perform on previously unseen records

Need new ways for estimating errors

73

Occam’s Razor

Given two models of similar generalization errors, one should prefer the simpler model over the more complex model

For complex models, there is a greater chance that it was fitted accidentally by errors in data

Therefore, one should include model complexity when evaluating a model

74

How to Address Overfitting

Minimize training error no longer guarantees a good model (a classifier or a regressor)

Need better estimate of the error on the true population – generalization error Ppopulation( f(x) not equal to y )

In practice, design a procedure that gives better estimate of the error than training error

In theoretical analysis, find an analytical bound to bound the generalization error or use Bayesian formula

75

Model Evaluation (pp. 295—304 of data mining)

Metrics for Performance Evaluation– How to evaluate the performance of a model?

Methods for Performance Evaluation– How to obtain reliable estimates?

Methods for Model Comparison– How to compare the relative performance

among competing models?

76

Model Evaluation

Metrics for Performance Evaluation– How to evaluate the performance of a model?

Methods for Performance Evaluation– How to obtain reliable estimates?

Methods for Model Comparison– How to compare the relative performance

among competing models?

77

Metrics for Performance Evaluation

Regression– Sum of squares

– Sum of deviation

– Exponential function of the deviation

78

Metrics for Performance Evaluation

Focus on the predictive capability of a model– Rather than how fast it takes to classify or

build models, scalability, etc. Confusion Matrix:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

a: TP (true positive)b: FN (false negative)c: FP (false positive)d: TN (true negative)

79

Metrics for Performance Evaluation…

Most widely-used metric:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a(TP)

b(FN)

Class=No c(FP)

d(TN)

FNFPTNTPTNTP

dcbada

Accuracy

80

Limitation of Accuracy

Consider a 2-class problem– Number of Class 0 examples = 9990– Number of Class 1 examples = 10

If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %– Accuracy is misleading because model does

not detect any class 1 example

81

Cost Matrix

PREDICTED CLASS

ACTUALCLASS

C(i|j) Class=Yes Class=No

Class=Yes C(Yes|Yes) C(No|Yes)

Class=No C(Yes|No) C(No|No)

C(i|j): Cost of misclassifying class j example as class i

82

Computing Cost of Classification

Cost Matrix

PREDICTED CLASS

ACTUALCLASS

C(i|j) + -+ -1 100- 1 0

Model M1 PREDICTED CLASS

ACTUALCLASS

+ -+ 150 40- 60 250

Model M2 PREDICTED CLASS

ACTUALCLASS

+ -+ 250 45- 5 200

Accuracy = 80%Cost = 3910

Accuracy = 90%Cost = 4255

83

Cost vs Accuracy

Count PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

Cost PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes p q

Class=No q p

N = a + b + c + d

Accuracy = (a + d)/N

Cost = p (a + d) + q (b + c)

= p (a + d) + q (N – a – d)

= q N – (q – p)(a + d)

= N [q – (q-p) Accuracy]

Accuracy is proportional to cost if1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p

84

Cost-Sensitive Measures

baa

caa

(r) Recall

(p)Precision

Precision is biased towards C(Yes|Yes) & C(Yes|No) Recall is biased towards C(Yes|Yes) & C(No|Yes)

Count PREDICTED CLASS

ACTUALCLASS

Class=Yes

Class=No

Class=Yes

a b

Class=No

c d

A model that declares every record to be the positive class: b = d = 0

A model that assigns a positive class to the (sure) test record: c is small

Recall is high

Precision is high

85

Cost-Sensitive Measures (Cont’d)

cbaa

prrp

baa

caa

222(F) measure-F

(r) Recall

(p)Precision

F-measure is biased towards all except C(No|No)

dwcwbwawdwaw

4321

41Accuracy Weighted

Count PREDICTED CLASS

ACTUALCLASS

Class=Yes

Class=No

Class=Yes

a b

Class=No

c d

86

Model Evaluation

Metrics for Performance Evaluation– How to evaluate the performance of a model?

Methods for Performance Evaluation– How to obtain reliable estimates?

Methods for Model Comparison– How to compare the relative performance

among competing models?

87

Methods for Performance Evaluation

How to obtain a reliable estimate of performance?

Performance of a model may depend on other factors besides the learning algorithm:– Class distribution– Cost of misclassification– Size of training and test sets

88

Learning Curve

Learning curve shows how accuracy changes with varying sample size

Requires a sampling schedule for creating learning curve: Arithmetic sampling

(Langley, et al) Geometric sampling

(Provost et al)

Effect of small sample size:- Bias in the estimate- Variance of estimate

89

Methods of Estimation Holdout

– Reserve 2/3 for training and 1/3 for testing Random subsampling

– Repeated holdout Cross validation

– Partition data into k disjoint subsets– k-fold: train on k-1 partitions, test on the remaining one– Leave-one-out: k=n

Stratified sampling – oversampling vs undersampling

Bootstrap– Sampling with replacement

90

A Useful Link http://dlib.net/ml_guide.svg

91

Methods of Estimation (Cont’d) Holdout method

– Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation

– Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies

obtained Cross-validation (k-fold, where k = 10 is most popular)

– Randomly partition the data into k mutually exclusive subsets, each approximately equal size

– At i-th iteration, use Di as test set and others as training set

– Leave-one-out: k folds where k = # of tuples, for small sized data– Stratified cross-validation: folds are stratified so that class dist. in

each fold is approx. the same as that in the initial data

92

Methods of Estimation (Cont’d) Bootstrap

– Works well with small data sets– Samples the given training tuples uniformly with replacement i.e., each time a tuple is selected, it is equally likely to be selected

again and re-added to the training set Several boostrap methods, and a common one is .632 boostrap

– Suppose we are given a data set of d examples. The data set is sampled d times, with replacement, resulting in a training set of d samples. The data points that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)

– Repeat the sampling procedure k times, overall accuracy of the model:

))(368.0)(632.0()( _1

_ settraini

k

isettesti MaccMaccMacc

93

Model Evaluation

Metrics for Performance Evaluation– How to evaluate the performance of a model?

Methods for Performance Evaluation– How to obtain reliable estimates?

Methods for Model Comparison– How to compare the relative performance

among competing models?

94

ROC (Receiver Operating Characteristic)

Developed in 1950s for signal detection theory to analyze noisy signals – Characterize the trade-off between positive

hits and false alarms ROC curve plots TPR (on the y-axis) against FPR

(on the x-axis) Performance of each classifier represented as a

point on the ROC curve If the classifier returns a real-valued prediction,

– changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point

95

ROC Curve

At threshold t:TP=50, FN=50, FP=12, TN=88

PREDICTED CLASS

ACTUALCLASS

Class=Yes

Class=No

Class=Yes

a(TP)

b(FN)

Class=No

c(FP)

d(TN)

TPR = TP/(TP+FN)FPR = FP/(FP+TN)

96

ROC Curve

PREDICTED CLASS

ACTUALCLASS

Class=Yes

Class=No

Class=Yes

a(TP)

b(FN)

Class=No

c(FP)

d(TN)

TPR = TP/(TP+FN)FPR = FP/(FP+TN)

(TPR,FPR): (0,0): declare everything

to be negative class– TP=0, FP = 0

(1,1): declare everything to be positive class– FN = 0, TN = 0

(1,0): ideal– FN = 0, FP = 0

97

ROC Curve

(TPR,FPR): (0,0): declare everything

to be negative class (1,1): declare everything

to be positive class (1,0): ideal

Diagonal line:– Random guessing– Below diagonal line: prediction is opposite of the

true class

98

How to Construct an ROC curve

Instance P(+|A) True Class

1 0.95 +

2 0.93 +

3 0.87 -

4 0.85 -

5 0.85 -

6 0.85 +

7 0.76 -

8 0.53 +

9 0.43 -

10 0.25 +

• Use classifier that produces posterior probability for each test instance P(+|A)

• Sort the instances according to P(+|A) in decreasing order

• Apply threshold at each unique value of P(+|A)

• Count the number of TP, FP,

TN, FN at each threshold• TP rate, TPR = TP/(TP+FN)• FP rate, FPR = FP/(FP +

TN)

99

How to Construct an ROC curve

Instance P(+|A) True Class

1 0.95 +

2 0.93 +

3 0.87 -

4 0.85 -

5 0.85 -

6 0.85 +

7 0.76 -

8 0.53 +

9 0.43 -

10 0.25 +

• Use classifier that produces posterior probability for each test instance P(+|A)

• Sort the instances according to P(+|A) in decreasing order

• Pick a threshold 0.85• p>= 0.85, predicted to P• p< 0.85, predicted to N• TP = 3, FP=3, TN=2, FN=2• TP rate, TPR = 3/5=60%• FP rate, FPR = 3/5=60%

100

How to construct an ROC curveClass + - + - - - + - + +

P 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

Threshold >=

ROC Curve:

101

Using ROC for Model Comparison

No model consistently outperforms the other M1 is better for

small FPR M2 is better for

large FPR

Area Under the ROC curve (AUC) Ideal:

Area = 1 Random guess:

Area = 0.5

102

Data normalization

Example-wise normalization– Each example is normalized

and mapped to unit sphere Feature-wise normalization

– [0,1]-normalization: normalize each feature into a unit space

– Standard normalization: normalize each feature to have mean 0 and standard deviation 1

1

1

1

1

103

Training data is given. – Each object is associated with a class label Y {1, 2,

…, K} and a feature vector of d measurements: X = (X1, …, Xd).

Build a model from the training data.

Unseen objects are to be classified as belonging to one of a number of predefined classes {1, 2, …, K}.

Linear Discriminant Analysis / Fisher’s linear disciminant

Classification

104

Two classes

Variable 1

Varia

ble

2

Best a

xis

1u

2u

105

Three classes

106

Classifiers are built from a training set (TS) L = (X1, Y1), ..., (Xn,Yn)

Classifier C built from a learning set L:

C: X {1,2, ... ,K}

Bayes classifier base on conditional densities p(Ck | X), C(X) = arg maxk p(Ck | X) This is a maximum a posterior, and p(Ck | X) is a

posterior density

Classifiers

107

The Rules of Probability

Sum Rule

Product Rule

Bayes’ Rule

posterior likelihood × prior

= p(X|Y)p(Y)

is irrelevant to Y=C

)|( YXp)|( dataXCYp )(Yp

108

p(Ck | X) = p(X | Ck) p(Ck) /p(X)

Find a class label C(X) so that maxk p(Ck | X) = maxk p(X | Ck) p(Ck)

Naïve Bayes assumes independence among all features (last class)– p(X | Ck) = p(x1 | Ck) p(x2 | Ck) . . . p(xd | Ck)

Very strong assumption

Maximum a posterior

109

S

S

kkT

k

kdk XXCXp

1

21exp

)det()2(1)|(

Assume multivariate Gaussian (normal) class densities X|Y= k ~ N(k, Sk),

)(log)det()2(log

21)()|(log 1

kkd

kkT

kkk

Cp

XXCpCXp

S

S

C(X) = arg mink {(X - k)’ Sk-1

(X - k) + log| Sk | -2log(p(Ck))}

Multivariate normal dist for each class

Maximizing posterior is equivalent to maximizing p(X|CK)p(CK), and equivalent to maximizing the logorithm of p(X|CK)p(CK)

110

Two-class case

TXXXX TT SSSS 22

12211

111 loglog

1)()|()()|(

22

11 CpCXpCpCXp

)()|()()|( 2211 CpCXpCpCXp C(X) = C1If

otherwise C(X) = C2

))()(log(

)|()|(log

1

2

2

1

CpCp

CXpCXp

Equivalently, )()(

)|()|(

1

2

2

1

CpCp

CXpCXp

111

Guassian discriminant rule

For multivariate Gaussian (normal) class densities X|Y= k ~ N(k, Sk), the classification rule is

C(X) = arg mink {(X - k)’ Sk-1

(X - k) + log| Sk |}

In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA)

In practice, population mean vectors k and covariance matrices Sk are estimated by corresponding sample quantities

TXXXX TT SSSS 22

12211

111 loglog

112

SiCx

Tii

ii xx

C))((1

iCxi

i xC ||1Class mean

Class covariance

Sample mean and variance

113

S

000021012

31

000011011

000000001

000010000

31

011011

001001

01001

0

31

))(())(())((31

111

31

.120

,112

,101

332211

321

321

TTT XXXXXX

XXX

XXX

Example

114

Two-class case

If the two classes have the same covariance matrix, Sk = S the discriminant rule is linear (Linear discriminant analysis, or LDA; FLDA for k = 2):

Quadratic rule

TXXXX TT SSSS 22

12211

111 loglog

cX T S )( 121

cwX T )( 121 S w

become

where

Usually, )(12211 SSS nn

n

115

Illustration

μ1 μ2

116

Two-class case

Maximize the signal-to-noise ratio

wwww

withinT

betweenT

w SSmax

Tbetween ))(( 1212 S

)(12211 SSS nn

nwithin

where

Solution is )( 121 S

withinw

Between-class separation

Within-class cohesion

117

Two-class case (illustration)

LDA gives the yellow direction

Two classes overlap

Two classes are separated

118

Two-class case (illustration)

1 2

2- 1

LDA axisBest Threshold

119

Multi-class case

Two approaches– Apply Fisher LDA to each “one-versus-rest”

class

120

Multi-class case

K

i Cx

Tiiw

K

k

Tkkkb

i

xxn

S

nn

S

1

1

))((1

))((1

Transformation matrix G that projects the data to be most separable is the matrix that maximizes

WSWWSW

wT

bT

maxW

Second approach:Similarly, find multiple directions that form a low dimensional space

Correct way to write it is

1

W))((tracemax WSWWSW w

Tb

T

Between-class matrix

Within-class matrix

121

The goal is to simultaneously maximize the between-class separation and minimize the within-class cohesion

The solution to is the generalized eigenvalue problem The generalized eigenvectors are eigenvectors by solving

Intuition

gSgS wb

bw SS 1

1

W))((tracemax WSWWSW w

Tb

T

122

Graphic view of the transformation (projection)

d

n

dnA Training data (matrix)

)1( knLA

n

K-1

Reduced training data

K-1

d

)1( kdWTransformation matrix

123

Graphical view of classificationd

n

dnA

n

K-1

)1( knLAK-1

d

)1( kdG

1K-1

)1(1 kLh

Find the nearest neighborOr nearest centroid

d1

dh 1

A test data point h

124

First applied by M. Barnard at the suggestion of R. A. Fisher (1936), Fisher linear discriminant analysis (FLDA):

Dimension reduction– Finds linear combinations of the features X=X1,...,Xd with

large ratios of between-groups to within-groups sums of squares - discriminant variables;

Classification– Predicts the class of an observation X by the class whose

mean vector is closest to X in terms of the discriminant variables

Summary