Get Rich and Cure Cancer with Support Vector Machines

transcript

Get Rich and Cure Cancer

with Support Vector Machines

(Your Summer Projects)

+Kernel Trick

https://www.youtube.com/watch?v=3liCbRZPrZA

+This is achieved with a polynomial kernel

Feature map:

Kernel:

+Optimization of transformed problem: Only kernel matters Dual Lagrangian for transformed problem:

Optimal weight vector:

Thus, optimal hyperplane:

+Kernel Trick We can choose the kernel without first defining a

feature map.

How to get a feature map from a kernel?

Define

i.e. map vectors in the original feature space to functions.

Inner product on transformed space:

+Get rich off of support vectors

+Making 5-day forecasts of financial futures

Given data on the returns for 5 days

Predict the return on the next day

To achieve this, we need to figure out which 5-day stretches tend to predict good returns on the 6th day, and which predict not-so-good returns

A training data set is used for this purpose

+Making 5-day forecasts of financial futures

Day 1 Day 2 Day 3 Day 4 Day 5

x11 x12 x13 x14 x15

x21 x22 x23 x24 x25

x31 x32 x33 x34 x35

x41 x42 x43 x44 x45

… … … … …

5-dimensional feature space Return on 6th day is classifier for data

Routine learns how to classify 5-day-return data points by working with a training data set for 500 days. Constructs a dividing hypersurface and uses it to decide what the 6th-day return should be for new data points.

+Good results – you can try it yourself!

Complete with R code: http://www.r-bloggers.com/trading-with-support-vector-machines-svm/

+Another example: gene expression in normal and cancerous tissue

Gene = unit of heredity

Human genome contains about 21,000 genes

Public domain image from Wikipedia

+Another example: gene expression in normal and cancerous tissue

DNA transcribes to RNA which translates to proteins

This is the process whereby the “genetic code” is made manifest as biological characteristics (genotype gives rise to phenotype)

Wikimedia Commons image by Madeleine Price Ball

+Big question: Which genes are responsible for which outcomes?

In various tissues (e.g. tumor versus normal), which genes are active, hyperactive, and silent?

Can use DNA microarrays to measure gene expression levels.

+DNA Microarray

https://www.youtube.com/watch?v=_6ZMEZK-alM

Source: National Human Genome Research Institute

+Using support vector machines to determine which genes are important for cancer classification

Data points: Patients

Features: Gene expression coefficients (activity level of a given gene)

Feature space will have a huge number of dimensions! Need a way to reduce.

Could examine all possible subspaces of feature space, but note that if dimension (N) of feature space represents thousands of genes, will mean that number of n-dimensional subspaces is

Too large for practical examination of each subspace

+Generate ranking of features

A ranking of features allows us to make a nested sequence of subspaces of feature space F

and then determine the optimum subspace to work with

One possibility for ranking: Work with each gene individually, get its correlation coefficient with the classifier (i.e. find correlation of gene expression level with classification of tissue into tumor v. normal or into two different types of cancer

Note: ranking by correlation coefficient assumes all the features are independent of one another.

+Generate ranking of features

Another possible way to generate a ranking of features: sensitivity analysis.

Have training data set, already classified into two classes (cancerous v. non, or cancer type 1 v. cancer type 2)

Construct a cost function to estimate error in classification

Sensitivity of cost function to removal of a feature measures the importance of that feature and allows the construction of a ranking.

+Ranking by Support Vector Machines Recursive Feature Elimination

Idea of how to use SVM to identify important features: Consider a cartoon scenario.

Indicates that the x1 direction is completely superfluous for classification.

+Ranking by Support Vector Machines

This suggests the following recursive algorithm for ranking features:

Find weight vector, using all features

Identify the least important feature to be the one with the smallest (in absolute value) component of the weight vector

List that feature as least important and eliminate itfrom the data

Iterate the procedure, with the least important feature thrown out.

End result: Ranked list of features!

+Try this at home!

Data is available online!

http://www.broadinstitute.org/software/cprg/?q=node/55

Classify two types of leukemia.

Get Rich and Cure Cancer with Support Vector Machines

Documents