CSCE555 BioinformaticsCSCE555 BioinformaticsLecture 15 classification for microarray data
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555University of South Carolina
Department of Computer Science and Engineering2008 www.cse.sc.edu.
OutlineOutlineClassification problem in
microarray dataClassification concepts and
algorithmsEvaluation of classification
algorithmsSummary
04/24/23 2
Lab 2.3 3
?Bad prognosisrecurrence < 5yrs
Good Prognosisrecurrence > 5yrs
ReferenceL van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan..
ObjectsArray
Feature vectorsGene
expression
Predefine classesClinical
outcome
new array
Learning set
Classificationrule
Good PrognosisMatesis > 5
Lab 2.3 4
B-ALL T-ALL AML
ReferenceGolub et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531-537.
ObjectsArray
Feature vectorsGene
expression
Predefine classes
Tumor type
?
new array
Learning set
ClassificationRule
T-ALL
Classification/Classification/DiscriminationDiscrimination
Each object (e.g. arrays or columns)associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG)
Aim: predict Y_new from X_new.
sample1 sample2 sample3 sample4 sample5 … New sample1 0.46 0.30 0.80 1.51 0.90 ... 0.342 -0.10 0.49 0.24 0.06 0.46 ... 0.433 0.15 0.74 0.04 0.10 0.20 ... -0.234 -0.45 -1.03 -0.79 -0.56 -0.32 ... -0.915 -0.06 1.06 1.35 1.09 -1.09 ... 1.23
Y Normal Normal Normal Cancer Cancer unknown =Y_new
X X_new
Discrimination/Discrimination/ClassificationClassification
Lab 2.3 6
Lab 2.3 7
Predefined Class
{1,2,…K}
1 2 K
Objects
Basic principles of discrimination•Each object associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG)
Aim: predict Y from X.
X = {red, square} Y = ?
Y = Class Label = 2
X = Feature vector {colour, shape}
Classification rule ?
Lab 2.3 8
KNN: Nearest neighbor KNN: Nearest neighbor classifierclassifierBased on a measure of distance between
observations (e.g. Euclidean distance or one minus correlation).
k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows:◦ find the k observations in the learning set closest to X◦ predict the class of X by majority vote, i.e., choose the
class that is most common among those k observations.The number of neighbors k can be chosen by
cross-validation (more on this later).
9
3-Nearest Neighbors3-Nearest Neighbors
query point qf
3 nearest neighbors2x,1o
Limitation of KNN: what is Limitation of KNN: what is K?K?
SVM: Support Vector SVM: Support Vector MachinesMachinesSVMs are currently among the best
performers for a number of classification tasks ranging from text to genomic data.
In order to discriminate between two classes, given a training dataset◦ Map the data to a higher dimension space
(feature space)◦ Separate the two classes using an optimal
linear separator
11
12
Key Ideas of SVM: Margins of Key Ideas of SVM: Margins of Linear SeparatorsLinear Separators
Maximum margin linear classifier
13
Optimal hyperplaneOptimal hyperplane
ρ
Support vector
margin
Optimal hyper-plane
Support vectors uniquely characterize optimal hyper-plane
Finding the Support Finding the Support VectorsVectors
Lagrangian multiplier method for constrained opt
15
Key Ideas of SVM: Feature Space Key Ideas of SVM: Feature Space MappingMappingMap the original data to some higher-
dimensional feature space where the training set is linearly separable:
Φ: x → φ(x)
(x1,x2) (x1,x2, x1^2, x2^2, x1*x2, …)
The “Kernel Trick”The “Kernel Trick” The linear classifier relies on inner product between vectors
K(xi,xj)=xiTxj
If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
A kernel function is some function that corresponds to an inner product in some expanded feature space.
Example: 2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi
Txj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2
,= 1+ xi12xj1
2 + 2 xi1xj1 xi2xj2+ xi2
2xj22 + 2xi1xj1 +
2xi2xj2= = [1 xi1
2 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj1
2 √2 xj1xj2 xj22 √2xj1
√2xj2] = = φ(xi)
Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x2
2 √2x1 √2x2]
16
Examples of Kernel Functions Linear: K(xi,xj)= xi
Txj
Polynomial of power p: K(xi,xj)= (1+ xi Txj)p
Gaussian (radial-basis function network):
Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)
)2
exp(),( 2
2
ji
ji
xxxx
K
18
SVMSVMAdvantages:
◦ maximize the margin between two classes in the feature space characterized by a kernel function
◦ are robust with respect to high input dimension
Disadvantages:◦ difficult to incorporate background
knowledge◦ Sensitive to outliers
19
Variable/Feature Selection Variable/Feature Selection with SVMswith SVMsRecursive Feature Elimination
◦ Train a linear SVM◦ Remove the variables with the lowest weights
(those variables affect classification the least), e.g., remove the lowest 50% of variables
◦ Retrain the SVM with remaining variables and repeat until classification is reduced
Very successfulOther formulations exist where minimizing
the number of variables is folded into the optimization problem
Similar algorithm exist for non-linear SVMsSome of the best and most efficient variable
selection methods
20
SoftwareSoftwareA list of SVM implementation can be
found at http://www.kernel-machines.org/software.html
Some implementation (such as LIBSVM) can handle multi-class classification
SVMLight, LibSVM are among one of the earliest implementation of SVM
Several Matlab toolboxes for SVM are also available
How to Use SVM to Classify How to Use SVM to Classify Microarray DataMicroarray DataPrepare the data format for
LibSVM
Labels
Index of non-zero features
value of non-zero features
<label> <index1>:<value1> <index2>:<value2> ...
Usage: svm-train [options] training_set_file [model_file]
Examples of options: -s 0 -c 10 -t 1 -g 1 -r 1 -d 3
Usage: svm-predict [options] test_file model_file output_file
22
Decision tree classifiersDecision tree classifiersGene 1
Mi1 < -0.67
Gene 2Mi2 > 0.18
0
2
1
yes
yes
no
no
0.18
Advantage: transparent rules, easy to interpret
G1 0.1 -0.2 0.3 G2 0.3 0.4 0.4G3 … ……Class 0 1 0
23
Ensemble classifiersEnsemble classifiers
Training Set
X1, X2, … X100
Classifier 1Resample 1
Classifier 2Resample 2
Classifier 499Resample 499
Classifier 500Resample 500
Examples:BaggingBoosting
Random Forest
Aggregateclassifier
24
Aggregating classifiers:Aggregating classifiers:BaggingBagging
Training Set (arrays)X1, X2, … X100
Tree 1Resample 1X*1, X*2, … X*100
Lets the treevote
Tree 2Resample 2X*1, X*2, … X*100
Tree 499Resample 499X*1, X*2, … X*100
Tree 500Resample 500X*1, X*2, … X*100
Testsample
Class 1
Class 2
Class 1
Class 1
90% Class 110% Class 2
Weka Data Mining ToolboxWeka Data Mining Toolbox
Weka Package (java) includes:
◦All previous classifiers
◦Neural networks
◦Projection pursuit
◦Bayesian belief networks
◦And More
25
26
Feature Selection in Feature Selection in ClassificationClassificationWhat: select a subset of featuresWhy:
◦Lead to better classification performance by removing variables that are noise with respect to the outcome
◦May provide useful insights into the biology
◦Can eventually lead to the diagnostic tests (e.g., “breast cancer chip”)
Classifier Performance Classifier Performance assessmentassessmentAny classification rule needs to be evaluated for
its performance on the future samples. It is almost never the case in microarray studies that a large independent population-based collection of samples is available at the time of initial classifier-building phase.
One needs to estimate future performance based on what is available: often the same set that is used to build the classifier.
Assessing performance of the classifier based on◦ Cross-validation.◦ Test set◦ Independent testing on future dataset
27
Diagram of performance Diagram of performance assessmentassessment
Training set
Performance assessment
TrainingSet
Independenttest set
Classifier
Classifier
Resubstitution estimation
Test set estimation
Diagram of performance Diagram of performance assessmentassessment
Training set
Performance assessment
TrainingSet
Independenttest set
(CV) Learningset
(CV) Test set
Classifier
Classifier
Classifier
Resubstitution estimation
Test set estimation
Cross Validation
Performance assessmentPerformance assessment V-fold cross-validation (CV) estimation: Cases in learning set
randomly divided into V subsets of (nearly) equal size. Build classifiers by leaving one set out; compute test set error rates on the left out set and averaged. ◦ Bias-variance tradeoff: smaller V can give larger bias but smaller
variance◦ Computationally intensive.
Leave-one-out cross validation (LOOCV).
(Special case for V=n). Works well for stable classifiers (k-NN, LDA, SVM)
Lab 2.3 30
Supplementary slide
Which to use depends mostly Which to use depends mostly on sample sizeon sample sizeIf the sample is large enough,
split into test and train groups.If sample is barely adequate for
either testing or training, use leave one out
In between consider V-fold. This method can give more accurate estimates than leave one out, but reduces the size of training set.
SummarySummaryMicroarray Classification TaskClassifiers: KNN, SVM, Decision
Tree, Weka, LibSVMClassifier evaluation, cross-
validation
AcknowledgementAcknowledgementTerry SpeedJean Yee Hwa YangJane Fridlyand