Date post: | 06-Jan-2018 |
Category: |
Documents |
Upload: | kristian-gallagher |
View: | 221 times |
Download: | 0 times |
COT6930 Course Project
Outline
• Gene Selection• Sequence Alignment
Why Gene Selection
• Identify marker genes that characterize different tumor status.
• Many genes are redundant and will introduce noise that lower performance.
• Can eventually lead to a diagnosis chip. (“breast cancer chip”, “liver cancer chip”)
Why Gene Selection
Gene Selection
• Methods fall into three categories:– Filter methods– Wrapper methods– Embedded methods
Filter methods are simplest and most frequently used in the literature
Wrapper methods are likely the most accurate ones
Filter Method
• Features (genes) are scored according to the evidence of predictive power and then are ranked.
• Top s genes with high score are selected and used by the classifier.– Scores: t-statistics, F-statistics, signal-noise ratio, …– The # of features selected, s, is then determined by cross
validation.• Advantage: Fast and easy to interpret.
Good versus bad features
Filter Method: Problem
• Genes are considered independently.– Redundant genes may be included.– Some genes jointly with strong discriminant
power but individually are weak will be ignored.
• Good single features do not necessarily form a good feature set
• The filtering procedure is independent to the classifying method– Features selected can be applied to all types
of classifying methods
Wrapper Method
• Iterative search: many “feature subsets” are scored base on classification performance and the best is used.– Select a good subset of features
• Subset selection: Forward selection, backward selection, their combinations.– Exhaustive searching is impossible.– Greedy algorithm are used instead.
Wrapper Method: Problem
• Computationally expensive– For each feature subset considered, the
classifier is built and evaluated.• Exhaustive searching is impossible
– Greedy search only.• Easy to overfit.
Embedded Method
• Attempt to jointly or simultaneously train both a classifier and a feature subset.
• Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features.
• Intuitively appealing
Relief-F• Relief-F a filter approach for feature selection
– Relief
Relief-F• Original Relief is only able to handle binary classification problem.
Extension was made to handle multiple-class problem
Relief-F
• Categorical attributes
• Numerical attributes
Relief-F Problem
• Time Complexity– m×(m×a+c×m×a+a)=O(cm2a)– Assume m=100, c=3, a=10,000– Time complexity 300×106
• Only considers one single attribute, cannot select a subset of “good” genes
Solution: Parallel Relief-F
• Version 1: – Clusters runs ReliefF in parallel, and updated
weighted weight values are collected at the master.
– Theoretical time complexity O(cm2a/p)• P is the # of clusters
Parallel Relief-F
• Version 2:– Clusters runs ReliefF in parallel, and each
cluster directly update the global weight values.
– Each cluster also considers the current weight values to select nearest neighbour instances
– Theoretical time complexity O(cm2a/p)• p is the # of clusters
Parallel Relief-F
• Version 3– Consider selecting a subset of important
features– Comparing the difference between
including/excluding a specific feature, and understand the importance of a gene with respect to an existing subset of features
– Discussion in private!
Outline
• Gene Selection• Sequence Alignment
– Given a dataset D with N=1000 sequences (e.g., 1000 each)
– Given an input x, – Do pair-wise global sequence alignment
between x and all sequences D• Dispatch jobs to clusters• And aggregate the results