+ All Categories
Home > Documents > CSCE555 Bioinformatics

CSCE555 Bioinformatics

Date post: 13-Mar-2016
Category:
Upload: ann-lamb
View: 21 times
Download: 0 times
Share this document with a friend
Description:
CSCE555 Bioinformatics. Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555. University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu. Outline. - PowerPoint PPT Presentation
Popular Tags:
33
CSCE555 Bioinformatics CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu .
Transcript
Page 1: CSCE555 Bioinformatics

CSCE555 BioinformaticsCSCE555 BioinformaticsLecture 15 classification for microarray data

Meeting: MW 4:00PM-5:15PM SWGN2A21

Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555University of South Carolina

Department of Computer Science and Engineering2008 www.cse.sc.edu.

Page 2: CSCE555 Bioinformatics

OutlineOutlineClassification problem in

microarray dataClassification concepts and

algorithmsEvaluation of classification

algorithmsSummary

04/24/23 2

Page 3: CSCE555 Bioinformatics

Lab 2.3 3

?Bad prognosisrecurrence < 5yrs

Good Prognosisrecurrence > 5yrs

ReferenceL van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan..

ObjectsArray

Feature vectorsGene

expression

Predefine classesClinical

outcome

new array

Learning set

Classificationrule

Good PrognosisMatesis > 5

Page 4: CSCE555 Bioinformatics

Lab 2.3 4

B-ALL T-ALL AML

ReferenceGolub et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531-537.

ObjectsArray

Feature vectorsGene

expression

Predefine classes

Tumor type

?

new array

Learning set

ClassificationRule

T-ALL

Page 5: CSCE555 Bioinformatics

Classification/Classification/DiscriminationDiscrimination

Each object (e.g. arrays or columns)associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG)

Aim: predict Y_new from X_new.

sample1 sample2 sample3 sample4 sample5 … New sample1 0.46 0.30 0.80 1.51 0.90 ... 0.342 -0.10 0.49 0.24 0.06 0.46 ... 0.433 0.15 0.74 0.04 0.10 0.20 ... -0.234 -0.45 -1.03 -0.79 -0.56 -0.32 ... -0.915 -0.06 1.06 1.35 1.09 -1.09 ... 1.23

Y Normal Normal Normal Cancer Cancer unknown =Y_new

X X_new

Page 6: CSCE555 Bioinformatics

Discrimination/Discrimination/ClassificationClassification

Lab 2.3 6

Page 7: CSCE555 Bioinformatics

Lab 2.3 7

Predefined Class

{1,2,…K}

1 2 K

Objects

Basic principles of discrimination•Each object associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG)

Aim: predict Y from X.

X = {red, square} Y = ?

Y = Class Label = 2

X = Feature vector {colour, shape}

Classification rule ?

Page 8: CSCE555 Bioinformatics

Lab 2.3 8

KNN: Nearest neighbor KNN: Nearest neighbor classifierclassifierBased on a measure of distance between

observations (e.g. Euclidean distance or one minus correlation).

k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows:◦ find the k observations in the learning set closest to X◦ predict the class of X by majority vote, i.e., choose the

class that is most common among those k observations.The number of neighbors k can be chosen by

cross-validation (more on this later).

Page 9: CSCE555 Bioinformatics

9

3-Nearest Neighbors3-Nearest Neighbors

query point qf

3 nearest neighbors2x,1o

Page 10: CSCE555 Bioinformatics

Limitation of KNN: what is Limitation of KNN: what is K?K?

Page 11: CSCE555 Bioinformatics

SVM: Support Vector SVM: Support Vector MachinesMachinesSVMs are currently among the best

performers for a number of classification tasks ranging from text to genomic data.

In order to discriminate between two classes, given a training dataset◦ Map the data to a higher dimension space

(feature space)◦ Separate the two classes using an optimal

linear separator

11

Page 12: CSCE555 Bioinformatics

12

Key Ideas of SVM: Margins of Key Ideas of SVM: Margins of Linear SeparatorsLinear Separators

Maximum margin linear classifier

Page 13: CSCE555 Bioinformatics

13

Optimal hyperplaneOptimal hyperplane

ρ

Support vector

margin

Optimal hyper-plane

Support vectors uniquely characterize optimal hyper-plane

Page 14: CSCE555 Bioinformatics

Finding the Support Finding the Support VectorsVectors

Lagrangian multiplier method for constrained opt

Page 15: CSCE555 Bioinformatics

15

Key Ideas of SVM: Feature Space Key Ideas of SVM: Feature Space MappingMappingMap the original data to some higher-

dimensional feature space where the training set is linearly separable:

Φ: x → φ(x)

(x1,x2) (x1,x2, x1^2, x2^2, x1*x2, …)

Page 16: CSCE555 Bioinformatics

The “Kernel Trick”The “Kernel Trick” The linear classifier relies on inner product between vectors

K(xi,xj)=xiTxj

If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes:

K(xi,xj)= φ(xi) Tφ(xj)

A kernel function is some function that corresponds to an inner product in some expanded feature space.

Example: 2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi

Txj)2,

Need to show that K(xi,xj)= φ(xi) Tφ(xj):

K(xi,xj)=(1 + xiTxj)2

,= 1+ xi12xj1

2 + 2 xi1xj1 xi2xj2+ xi2

2xj22 + 2xi1xj1 +

2xi2xj2= = [1 xi1

2 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj1

2 √2 xj1xj2 xj22 √2xj1

√2xj2] = = φ(xi)

Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x2

2 √2x1 √2x2]

16

Page 17: CSCE555 Bioinformatics

Examples of Kernel Functions Linear: K(xi,xj)= xi

Txj

Polynomial of power p: K(xi,xj)= (1+ xi Txj)p

Gaussian (radial-basis function network):

Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)

)2

exp(),( 2

2

ji

ji

xxxx

K

Page 18: CSCE555 Bioinformatics

18

SVMSVMAdvantages:

◦ maximize the margin between two classes in the feature space characterized by a kernel function

◦ are robust with respect to high input dimension

Disadvantages:◦ difficult to incorporate background

knowledge◦ Sensitive to outliers

Page 19: CSCE555 Bioinformatics

19

Variable/Feature Selection Variable/Feature Selection with SVMswith SVMsRecursive Feature Elimination

◦ Train a linear SVM◦ Remove the variables with the lowest weights

(those variables affect classification the least), e.g., remove the lowest 50% of variables

◦ Retrain the SVM with remaining variables and repeat until classification is reduced

Very successfulOther formulations exist where minimizing

the number of variables is folded into the optimization problem

Similar algorithm exist for non-linear SVMsSome of the best and most efficient variable

selection methods

Page 20: CSCE555 Bioinformatics

20

SoftwareSoftwareA list of SVM implementation can be

found at http://www.kernel-machines.org/software.html

Some implementation (such as LIBSVM) can handle multi-class classification

SVMLight, LibSVM are among one of the earliest implementation of SVM

Several Matlab toolboxes for SVM are also available

Page 21: CSCE555 Bioinformatics

How to Use SVM to Classify How to Use SVM to Classify Microarray DataMicroarray DataPrepare the data format for

LibSVM

Labels

Index of non-zero features

value of non-zero features

<label> <index1>:<value1> <index2>:<value2> ...

Usage: svm-train [options] training_set_file [model_file]

Examples of options: -s 0 -c 10 -t 1 -g 1 -r 1 -d 3

Usage: svm-predict [options] test_file model_file output_file

Page 22: CSCE555 Bioinformatics

22

Decision tree classifiersDecision tree classifiersGene 1

Mi1 < -0.67

Gene 2Mi2 > 0.18

0

2

1

yes

yes

no

no

0.18

Advantage: transparent rules, easy to interpret

G1 0.1 -0.2 0.3 G2 0.3 0.4 0.4G3 … ……Class 0 1 0

Page 23: CSCE555 Bioinformatics

23

Ensemble classifiersEnsemble classifiers

Training Set

X1, X2, … X100

Classifier 1Resample 1

Classifier 2Resample 2

Classifier 499Resample 499

Classifier 500Resample 500

Examples:BaggingBoosting

Random Forest

Aggregateclassifier

Page 24: CSCE555 Bioinformatics

24

Aggregating classifiers:Aggregating classifiers:BaggingBagging

Training Set (arrays)X1, X2, … X100

Tree 1Resample 1X*1, X*2, … X*100

Lets the treevote

Tree 2Resample 2X*1, X*2, … X*100

Tree 499Resample 499X*1, X*2, … X*100

Tree 500Resample 500X*1, X*2, … X*100

Testsample

Class 1

Class 2

Class 1

Class 1

90% Class 110% Class 2

Page 25: CSCE555 Bioinformatics

Weka Data Mining ToolboxWeka Data Mining Toolbox

Weka Package (java) includes:

◦All previous classifiers

◦Neural networks

◦Projection pursuit

◦Bayesian belief networks

◦And More

25

Page 26: CSCE555 Bioinformatics

26

Feature Selection in Feature Selection in ClassificationClassificationWhat: select a subset of featuresWhy:

◦Lead to better classification performance by removing variables that are noise with respect to the outcome

◦May provide useful insights into the biology

◦Can eventually lead to the diagnostic tests (e.g., “breast cancer chip”)

Page 27: CSCE555 Bioinformatics

Classifier Performance Classifier Performance assessmentassessmentAny classification rule needs to be evaluated for

its performance on the future samples. It is almost never the case in microarray studies that a large independent population-based collection of samples is available at the time of initial classifier-building phase.

One needs to estimate future performance based on what is available: often the same set that is used to build the classifier.

Assessing performance of the classifier based on◦ Cross-validation.◦ Test set◦ Independent testing on future dataset

27

Page 28: CSCE555 Bioinformatics

Diagram of performance Diagram of performance assessmentassessment

Training set

Performance assessment

TrainingSet

Independenttest set

Classifier

Classifier

Resubstitution estimation

Test set estimation

Page 29: CSCE555 Bioinformatics

Diagram of performance Diagram of performance assessmentassessment

Training set

Performance assessment

TrainingSet

Independenttest set

(CV) Learningset

(CV) Test set

Classifier

Classifier

Classifier

Resubstitution estimation

Test set estimation

Cross Validation

Page 30: CSCE555 Bioinformatics

Performance assessmentPerformance assessment V-fold cross-validation (CV) estimation: Cases in learning set

randomly divided into V subsets of (nearly) equal size. Build classifiers by leaving one set out; compute test set error rates on the left out set and averaged. ◦ Bias-variance tradeoff: smaller V can give larger bias but smaller

variance◦ Computationally intensive.

Leave-one-out cross validation (LOOCV).

(Special case for V=n). Works well for stable classifiers (k-NN, LDA, SVM)

Lab 2.3 30

Supplementary slide

Page 31: CSCE555 Bioinformatics

Which to use depends mostly Which to use depends mostly on sample sizeon sample sizeIf the sample is large enough,

split into test and train groups.If sample is barely adequate for

either testing or training, use leave one out

In between consider V-fold. This method can give more accurate estimates than leave one out, but reduces the size of training set.

Page 32: CSCE555 Bioinformatics

SummarySummaryMicroarray Classification TaskClassifiers: KNN, SVM, Decision

Tree, Weka, LibSVMClassifier evaluation, cross-

validation

Page 33: CSCE555 Bioinformatics

AcknowledgementAcknowledgementTerry SpeedJean Yee Hwa YangJane Fridlyand


Recommended