Date post: | 20-Dec-2015 |
Category: |
Documents |
View: | 221 times |
Download: | 7 times |
Classification: Support Vector Machine
10/10/07
What hyperplane (line) can separate the two classes of data?
What hyperplane (line) can separate the two classes of data?
But there are many other choices!
Which one is the best?
What hyperplane (line) can separate the two classes of data?
But there are many other choices!
Which one is the best?
M: margin
1iy
00 Tix
1iy
M
Optimal separating hyperplane
The best hyperplane is the one that maximizes the margin, M.
1iy
1iy 1iy
1iy
M
A hyperplane is
}0)(:{ 0 Txxfx
Computing the margin width
1iy
1iy
xT + 0
= 1
xT + 0
= 0
xT + 0
= -1
x+
x-
Find x+ and x- on the “plus” and “minus” plane, so that x+ - x- is perpendicular to .
Then M = | x+ - x- |
Find x+ and x- on the “plus” and “minus” plane, so that x+ - x- is perpendicular to .
Then M = | x+ - x- |
Since x+T + 0 = 1
x-T + 0 = -1
(x+ - x-)T = 2
A hyperplane is
}0)(:{ 0 Txxfx
Computing the margin width
1iy
1iy
xT + 0
= 1
xT + 0
= 0
xT + 0
= -1
x+
x-
M = | x+ - x- | = 2/| |
The hyperplane is separating if
The maximizing problem is
subject to
ixy Tii ,0)( 0
,
2max
Computing the marginal width
M
support vector
1iy
1iyixy T
ii ,1)( 0
Optimal separating hyperplaneRewrite the problem as
subject to
Lagrange function
To minimize, set partial derivatives to be 0
Can be solved by quadratic programming.
2||||2
1min
i
TiiiP xyL ]1)([||||
2
10
2
iii
iiii
y
xy
0
ixy Tii ,1)( 0
What is the best hyperplane?
1iy
1iy
When the two classes are non-separable
Idea: allow some points to lie on the wrong side, but not by much.
ii
Support vector machineWhen the two classes are not separable, the problem is
slightly modified:
Find
subject to
Can be solved using quadratic programming.
constant,0
,1)( 0
ii
iTii ixy
1iy
1iy
0,
2||||2
1min
Convert a nonseparable to separable case by nonlinear transformation
non-separable in 1D
Convert a nonseparable to separable case by nonlinear transformation
separable in 1D
))(,()( xfxxh
• Introduce nonlinear kernel functions h(x), and work on the transformed functions.
Then the separating function is
In fact, all you need is the kernel function:
Common kernels:
))(( 0 Txhy
)'(),()',( xhxhxxK
Kernel function
Applications
Prediction of central nervous systems embryonic tumor outcome• 42 patient samples
• 5 cancer types
• Array contains 6817 genes
• Question: are different tumors types distinguishable from gene expression pattern?
(Pomeroy et al. 2002)
(Pomeroy et al. 2002)
Gene expressions within a cancer type cluster together
(Pomeroy et al. 2002)
PCA based on all genes
(Pomeroy et al. 2002)
PCA based on a subset of informational genes
(Pomeroy et al. 2002)
(Khan et al. 2001)
Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks
•Four different cancer types.
•88 samples
•6567 genes
•Goal: to predict cancer types from gene expression data
(Khan et al. 2001)
Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks
Procedures
• Filter out genes that have low expression values (retain 2308 genes)
• Dimension reduction by using PCA --- select top 10 principle components
• 3 fold cross-validation:
(Khan et al. 2001)
Artificial Neural Network
(Khan et al. 2001)
Procedures
• Filter out genes that have low expression values (retain 2308 genes)
• Dimension reduction by using PCA --- select top 10 principle components
• 3 fold cross-validation:
• Repeat 1250 times.
(Khan et al. 2001)
(Khan et al. 2001)
(Khan et al. 2001)
Acknowledgement
• Sources of slides:– Cheng Li– http://www.cs.cornell.edu/johannes/papers/20
01/kdd2001-tutorial-final.pdf– www.cse.msu.edu/~lawhiu/
intro_SVM_new.ppt
Aggregating predictors
• Sometimes aggregating several predictors can perform better than each single predictor alone. Aggregating is achieved by weighted sum of different predictors, which can be the same kind of predictors obtained from slightly perturbed training datasets.
• Key to the improvement of accuracy is the instability of individual classifiers, such as the classification trees.
AdaBoost
• Step 1: Initialization the observation weights • Step 2: For m = 1 to M,
– Fit a classifier Gm(X) to the training data using weight wi– Compute
– Compute
– Set • Step 3: Output
N
i i
N
i miim
w
XGyIwerr
1
1))((
)/)1log(( mmm errerr
NixGyIww immmii ,,1))],((exp[
M
mmm xGsignxG
1
)()(
NiNwi ,,1,/1
misclassified obs are given more weights
Boosting
• Substituting, we get the Lagrange (Wolf) dual function
subject to
To complete the steps, see Burges et al.
• If then
These xi’s are called the support vectors.
is only determined by the support vectors
Optimal separating hyperplane
i j j
Tijijii iD xxyyL
2
1
ii ,0
i
iii xy
,0i 1)( 0 Tii xy
The Lagrange function is
Setting the partial derivatives to be 0.
Substituting, we get
Subject to
i i iii
Tiiii iP xyL )]1()([||||
2
10
2
ii
iii
iiii
y
xy
0
i j j
Tijijii iD xxyyL
2
1
0,0 i iii y
Support vector machine