Support Vector Machines for Classification of Flow DataClassification of Flow Data
Funded by SBIR Grant # R43 RR024094-01A1FlowCap 2010pJohn Quinn Ph.D.
Our ObjectiveOur Objective• Demonstrate that supervised training
algorithms can effectively replicate user created gates – Very useful for high throughput settings
– Can increase robustness
• We believe this will be the first application in ppwhich algorithmic gate placement becomes the norm.
Selected AlgorithmSelected Algorithm• Support Vector Machine (SVM)pp ( )
– Radial kernel
• Supervised linear classifier that solves an optimization problem to find the hyperplane(s) that separate classes with the maximum distance between classes
Wi h li i d h i li l– With non-linear mapping data that is not linearly separable can be classified
SVM OperationSVM OperationOptimization:p• Determine which
elements of the training data marktraining data mark the boundary of maximum distance
D
between two classes
or Support vectorsClass 1Class 2
D Maximum separation
SVM OperationSVM Operation
• Optimization problemOptimization problemFor data:
A h l th t t t l b d fi dA hyperplane that separates any two classes can be defined as:For ci=1For ci=-1
Knowing that the data points should be outside of the margin, we can impose the constraint:p
SVM OperationSVM OperationWe know that the support vectors will have a perpendicular di t f th h l fdistance from the hyperplane of:
and
The distance between SV’s can then be expressed as:
So optimization is the minimization of
D
SVM OperationSVM OperationWe then use the inequality, q y,
as a constraint to fix a critical point and useas a constraint to fix a critical point and use Lagrangian multipliers αi, to express w as a linear combination of the training vectors:
The support vectors, NSV, are then the Xiassociated with non-negative Lagrange multipliers
SVM OperationSVM OperationOnce w is known, and the support vectors have been identified, b can be solved as:
If there are more than two classes, the operation remains the same but the hyperplanes are determined either as onehyperplanes are determined either as one versus all or pairwise
• We chose a one versus all format
SVM OperationSVM Operation• Data not linearly separable? Map it to a y p p
space where it is!– We assume that flow data will have a Gaussian
Gdistribution and selected a Gaussian mapping
Input Space Mapped Space
Why use an SVM?Why use an SVM?• SVM’s are deterministic • Find the global maxima and not local
maxima– If the training data are representative of the
real data, you cannot do better.• SVM’s are fast
– They solve a maximization problem, as d d i i i fi iopposed to doing an iterative fitting
PreprocessingPreprocessing• To prepare the training data, we:
N li th d t t f 1 t 1– Normalize the data to a range of -1 to 1– Identified the training data set with the largest number
of clusters• Used this data set as the reference set
– Calculated the centroid of each cluster in the reference set
– In all other training data, calculated the Euclidean distance of each cluster to the clusters in the reference set and assigned them cluster ID’s matchingreference set and assigned them cluster ID s matching the reference cluster with the smallest distance measureTook a sample of each training data set and combined– Took a sample of each training data set and combined them into one training vector to present to the SVM
Algorithm choiceAlgorithm choiceMatlab has a free file share repository
Someone has already put almost any algorithm p y gyou can think of into code
I d th SVM d d bI used the SVM coded by By Junshui Ma, and Yi Zhao of Ohio St. University
It received 5 stars
Training DataTraining Data• Example training datap g
– Showing parameters 1 & 2, and 3 & 4 of the stem cell data set
ResultsResults
ResultsResultsSpeed:pData set Training time Classification time
• CFSE 4 sec 2 min 48 sec (13 files)• CFSE 4 sec 2 min 48 sec (13 files)
• DLBCL 5 sec 67 sec (30 files)
• GvHD 5 sec 38 sec (12 files)
• NDD 11 sec 27 min 28 sec (30 files)
• Stem cell 4 sec 19 sec (30 files)Stem cell 4 sec 19 sec (30 files)
Room for improvement…Room for improvement…• The SVM’s are highly dependant on g y p
identifying a transform that maps the data to a linearly separable space.
• We could experiment with a number of different transforms
FlowCap FeedbackFlowCap Feedback
• What went wellWhat went well– Data easily available– Submission process easySubmission process easy– Questions answered immediately!
• What could be improvedWid bli it ti l l t f– Wider publicity particularly out of our domain
Questions?Questions?