Computational Biology, Part 7Supervised Machine Learning and Searching for Sequence Families
Robert F. MurphyRobert F. Murphy
Copyright Copyright 2008-2009. 2008-2009.
All rights reserved.All rights reserved.
www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf
What is Machine Learning?
■ Fundamental Question of Computer Fundamental Question of Computer Science: How can we build machines that Science: How can we build machines that solve problems, and which problems are solve problems, and which problems are inherently tractable/intractable?inherently tractable/intractable?
■ Fundamental Question of Statistics: What Fundamental Question of Statistics: What can be inferred from data plus a set of can be inferred from data plus a set of modeling assumptions, with what modeling assumptions, with what reliability?reliability?
Tom Mitchell white paper
Fundamental Question of Machine Learning
■ How can we build computer systems that How can we build computer systems that automatically improve with experience, and automatically improve with experience, and what are the fundamental laws that govern what are the fundamental laws that govern all learning processes?all learning processes?◆ Tom MitchellTom Mitchell
Tom Mitchell white paper
Why Machine Learning?
■ Learn relationships from large sets of complex Learn relationships from large sets of complex data: Data miningdata: Data mining◆ Predict clinical outcome from testsPredict clinical outcome from tests◆ Decide whether someone is a good credit riskDecide whether someone is a good credit risk
■ Do tasks too complex to program by handDo tasks too complex to program by hand◆ Autonomous drivingAutonomous driving
■ Customize programs to user needsCustomize programs to user needs◆ Recommend book/movie based on previous likesRecommend book/movie based on previous likes
Tom Mitchell white paper
Why Machine Learning?
■ Economically efficientEconomically efficient■ Can consider larger data spaces and Can consider larger data spaces and
hypothesis spaces than people canhypothesis spaces than people can■ Can formalize learning problem to Can formalize learning problem to
explicitly identify/describe goals and explicitly identify/describe goals and criteriacriteria
Successful Machine Learning Applications
■ Speech recognitionSpeech recognition◆ Telephone menu navigationTelephone menu navigation
■ Computer visionComputer vision◆ Mail sortingMail sorting
■ Bio-surveillanceBio-surveillance◆ Identifying disease outbreaksIdentifying disease outbreaks
■ Robot controlRobot control◆ Autonomous drivingAutonomous driving
■ Empirical scienceEmpirical science
Tom Mitchell white paper
Machine Learning Paradigms
■ Supervised LearningSupervised Learning◆ ClassificationClassification◆ RegressionRegression
■ Unsupervised LearningUnsupervised Learning◆ ClusteringClustering
■ Semi-supervised LearningSemi-supervised Learning◆ CotrainingCotraining◆ Active learningActive learning
Supervised Learning
■ ApproachesApproaches◆ Classification (discrete predictions)Classification (discrete predictions)◆ Regression (continuous predictions)Regression (continuous predictions)
■ Common considerationsCommon considerations◆ Representation (Features)Representation (Features)◆ Feature SelectionFeature Selection◆ Functional formFunctional form◆ Evaluation of predictive powerEvaluation of predictive power
Classification vs. Regression
■ If I want to predict whether a patient will If I want to predict whether a patient will die from a disease within six months, that is die from a disease within six months, that is classificationclassification
■ If I want to predict how long the patient will If I want to predict how long the patient will live, that is regressionlive, that is regression
Representation
■ Definition of thing or things to be predictedDefinition of thing or things to be predicted◆ Classification: Classification: classesclasses◆ Regression: Regression: regression variableregression variable
■ Definition of things (Definition of things (instancesinstances) to make ) to make predictions forpredictions for◆ IndividualsIndividuals◆ FamiliesFamilies◆ Neighborhoods, etc.Neighborhoods, etc.
■ Choice of descriptors (Choice of descriptors (featuresfeatures) to describe ) to describe different aspects of instancesdifferent aspects of instances
Formal description
■ Defining Defining XX as a set of as a set of instancesinstances x x described described by by featuresfeatures
■ Given training examples Given training examples D D from from XX■ Given a Given a target function ctarget function c that maps that maps X-X->{0,1}>{0,1}■ Given a Given a hypothesis space Hhypothesis space H■ Determine an hypothesis Determine an hypothesis hh in in HH such that such that
h(x)h(x)==c(x) c(x) for all for all xx in in DD
Courtesy Tom Mitchell
Inductive learning hypothesis
■ Any hypothesis found to approximate the Any hypothesis found to approximate the target function well over a sufficiently large target function well over a sufficiently large set of training examples will also set of training examples will also approximate the target function over other approximate the target function over other unobserved examplesunobserved examples
Courtesy Tom Mitchell
Hypothesis space
■ The hypothesis space determines the The hypothesis space determines the functional formfunctional form
■ It defines what are allowable rules/functions It defines what are allowable rules/functions for classificationfor classification
■ Each classification method uses a different Each classification method uses a different hypothesis spacehypothesis space
-+
???
Simple two class problem
Describe each image by featuresTrain classifier
k-Nearest Neighbor (kNN)
■ In feature space, training examples areIn feature space, training examples are
k-Nearest Neighbor (kNN)
■ We want to label ‘We want to label ‘??’’
Feature #1 (e.g.., ‘area’)
Feature #2(e.g.., roundness)
+
-++ +
+
+
+
-
--
--
?
k-Nearest Neighbor (kNN)
■ Find k nearest neighbors and voteFind k nearest neighbors and vote
for k=3,
nearest neighbors are
So we label it +
Linear Discriminants
■ Fit multivariate Gaussian to each classFit multivariate Gaussian to each class■ Measure distance from ? to each GaussianMeasure distance from ? to each Gaussian
area
bright.
+
-
+
+
+
+
-
--
--
?
Decision trees
■ Again we want to label ‘Again we want to label ‘??’’
Slide courtesy of Christos Faloutsos
Decision trees
■ so we build a decision tree:so we build a decision tree:
50
40
Slide courtesy of Christos Faloutsos
Decision trees
■ so we build a decision tree:so we build a decision tree:
Slide courtesy of Christos Faloutsos
Decision trees
■ Goal: split address space in (almost) Goal: split address space in (almost) homogeneous regionshomogeneous regions
area<50
Y
+round. <40
N
-...
Y N
‘area’
round.
+
-++ +
++
+
-
--
--
?
50
40
Slide courtesy of Christos Faloutsos
Support vector machines
■ Again we want to label ‘Again we want to label ‘??’’
Feature #1 (e.g.., ‘area’)
Feature #2(e.g.., roundness)
+
-++ +
+
+
+
-
--
--
?
Slide courtesy of Christos Faloutsos
Support Vector Machines (SVMs)
■ Use single linear separator??Use single linear separator??
area
round.
+
-
+
+
+
+
-
--
--
?
Slide courtesy of Christos Faloutsos
Support Vector Machines (SVMs)■ Use single linear separator??Use single linear separator??
area
round.
+
-
+
+
+
+
-
--
--
?
Slide courtesy of Christos Faloutsos
Support Vector Machines (SVMs)■ Use single linear separator??Use single linear separator??
area
round.
+
-
+
+
+
+
-
--
--
?
Slide courtesy of Christos Faloutsos
Support Vector Machines (SVMs)■ Use single linear separator??Use single linear separator??
+
-
+
+
+
+
-
--
--
?
area
round.
Slide courtesy of Christos Faloutsos
Support Vector Machines (SVMs)■ Use single linear separator??Use single linear separator??
+
-
+
+
+
+
-
--
--
?
area
round.
Slide courtesy of Christos Faloutsos
Support Vector Machines (SVMs)■ we want to label ‘we want to label ‘??’ - linear separator??’ - linear separator??■ A: the one with the widest corridor!A: the one with the widest corridor!
area
round.
+
-+
++
+
-
--
--
?
Slide courtesy of Christos Faloutsos
Support Vector Machines (SVMs)■ What if the points for each class are not What if the points for each class are not
readily separated by a straight line?readily separated by a straight line?■ Use the “kernel trick” – project the points Use the “kernel trick” – project the points
into a higher dimensional space in which we into a higher dimensional space in which we hope that straight lines will separate the hope that straight lines will separate the classesclasses
■ ““kernel” refers to the function used for this kernel” refers to the function used for this projection projection
Support Vector Machines (SVMs)■ Definition of SVMs explicitly considers Definition of SVMs explicitly considers
only two classesonly two classes■ What if we have more than two classes?What if we have more than two classes?■ Train multiple SVMsTrain multiple SVMs■ Two basic approachesTwo basic approaches
◆ One against all (one SVM for each class)One against all (one SVM for each class)◆ Pairwise SVMs (one for each pair of classes)Pairwise SVMs (one for each pair of classes)
• Various ways of implementing thisVarious ways of implementing this
Cross-Validation
■ If we train a classifier to minimize error on a set of If we train a classifier to minimize error on a set of data, have no ability to estimate (generalize) error data, have no ability to estimate (generalize) error that will be seen on new datasetthat will be seen on new dataset
■ To calculate To calculate generalizablegeneralizable accuracy, we use accuracy, we use n-n-foldfold cross-validationcross-validation
■ Divide images into Divide images into nn sets, train using sets, train using nn-1 of them -1 of them and test on the remaining setand test on the remaining set
■ Repeat until each set is used as test set and average Repeat until each set is used as test set and average results across all trialsresults across all trials
■ Variation on this is called Variation on this is called leave-one-outleave-one-out
Describing classifier errors
■ For binary classifiers (positive or negative), defineFor binary classifiers (positive or negative), define◆ TP = true positives, FP = false positivesTP = true positives, FP = false positives◆ TN = true negatives, FN = false negativesTN = true negatives, FN = false negatives◆ Recall = TP / (TP + FN)Recall = TP / (TP + FN)◆ Precision = TP / (TP + FP)Precision = TP / (TP + FP)◆ F-measure= 2*Recall*Precision/(Recall + Precision)F-measure= 2*Recall*Precision/(Recall + Precision)
Confusion matrix - binaryTrue \ Predicted Positive Negative
Positive True Positive False Negative
Negative False Positive True Negative
Precision-recall analysis
Vary classifier parameter to “loosen” some performance estimate: i.e., confidence
Ideal performance
Describing classifier errors
■ For multi-class classifiers, typically reportFor multi-class classifiers, typically report◆ Accuracy = Accuracy = # test images correctly classified# test images correctly classified
# test images# test images◆ Confusion matrix = table showing all possible Confusion matrix = table showing all possible
combinations of true class and predicted classcombinations of true class and predicted class
Confusion matrix – multi-class
Overall accuracy = 98%
True Class
Output of the Classifier
DNA ER Gia Gpp Lam Mit Nuc Act TfR Tub
DNA 98 2 0 0 0 0 0 0 0 0
ER 0 100 0 0 0 0 0 0 0 0
Gia 0 0 100 0 0 0 0 0 0 0
Gpp 0 0 0 96 4 0 0 0 0 0
Lam 0 0 0 4 95 0 0 0 0 2
Mit 0 0 2 0 0 96 0 2 0 0
Nuc 0 0 0 0 0 0 100 0 0 0
Act 0 0 0 0 0 0 0 100 0 0
TfR 0 0 0 0 2 0 0 0 96 2
Tub 0 2 0 0 0 0 0 0 0 98
Ground truth
■ What is the source and confidence of a class What is the source and confidence of a class label?label?
■ Most common: Human assignment, Most common: Human assignment, unknown confidenceunknown confidence
■ Preferred: Assignment by experimental Preferred: Assignment by experimental design, confidence ~100%design, confidence ~100%
Stating Goals vs. Approaches
■ Temptation when first considering using a Temptation when first considering using a machine learning approach to a biological machine learning approach to a biological problem is to describe the problem as problem is to describe the problem as automating the approach that you would automating the approach that you would solve the problemsolve the problem
■ ““I need a program to predict how much a I need a program to predict how much a gene is expressed by measuring how well its gene is expressed by measuring how well its promoter matches a template”promoter matches a template”
Stating Goals vs. Approaches
■ ““I need a program that given a gene I need a program that given a gene sequence predicts how much that gene is sequence predicts how much that gene is expressed by measuring how well its expressed by measuring how well its promoter matches a template”promoter matches a template”
■ ““I need a program that given a gene I need a program that given a gene sequence predicts how much that gene is sequence predicts how much that gene is expressed by learning from sequences of expressed by learning from sequences of genes whose expression is known”genes whose expression is known”
Resources
■ Association for the Advancement of Artificial Association for the Advancement of Artificial IntelligenceIntelligence◆ http://www.aaai.org/AITopics/pmwiki/pmwiki.php/AITopics/MachineLearninghttp://www.aaai.org/AITopics/pmwiki/pmwiki.php/AITopics/MachineLearning
■ Machine Learning – Mitchell, Carnegie MellonMachine Learning – Mitchell, Carnegie Mellon
◆ http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.htmlhttp://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html■ Practical Machine Learning – Jordan, UC BerkeleyPractical Machine Learning – Jordan, UC Berkeley
◆ http://www.cs.berkeley.edu/~asimma/294-fall06/http://www.cs.berkeley.edu/~asimma/294-fall06/■ Learning and Empirical Inference – Rish, Tesauro, Learning and Empirical Inference – Rish, Tesauro,
Jebara, Vadpnik – Columbia Jebara, Vadpnik – Columbia ◆ http://www1.cs.columbia.edu/~jebara/6998/http://www1.cs.columbia.edu/~jebara/6998/
Goals for sequence families
■ Find previously unrecognized members of a Find previously unrecognized members of a familyfamily
■ Develop a model of a familyDevelop a model of a family
Possible Approaches
■ Model-basedModel-based◆ Motif-based (MEME/MAST)Motif-based (MEME/MAST)◆ Hidden Markov model-based (HMMER)Hidden Markov model-based (HMMER)◆ CobblingCobbling◆ Cobbled FPSCobbled FPS
■ Non-model-basedNon-model-based◆ Family Pairwise Search (FPS)Family Pairwise Search (FPS)
PSSMs
■ Motifs can be summarized and searched for Motifs can be summarized and searched for using using PPosition-osition-SSpecific pecific SScoring coring MMatricesatrices
■ Calculated from a multiple alignment of a Calculated from a multiple alignment of a conserved region for members of a familyconserved region for members of a family
Learning PSSMs
■ However, unsupervised learning methods can However, unsupervised learning methods can be used to find motifs in unaligned sequencesbe used to find motifs in unaligned sequences
■ Best characterized algorithm is MEMEBest characterized algorithm is MEME◆ T.L. Bailey & C. Elkan (1995) Unsupervised Learning of T.L. Bailey & C. Elkan (1995) Unsupervised Learning of
Multiple Motifs in Biopolymers Using Expectation Multiple Motifs in Biopolymers Using Expectation Maximization. Maximization. Machine Learning J. 21Machine Learning J. 21:51-83:51-83
■ Program for searching with MEME-Program for searching with MEME-generated PSSM is called MASTgenerated PSSM is called MAST
Position Specific Iterated BLAST (PSI-BLAST)■ Use PSSMs, instead of a similarity matrix, to score Use PSSMs, instead of a similarity matrix, to score
matches between query and databasematches between query and database■ Iterate: using a multiple alignment of high scoring Iterate: using a multiple alignment of high scoring
sequences found in each search round to generate sequences found in each search round to generate a new PSSM for use in the next round of searchinga new PSSM for use in the next round of searching
■ Search until no new sequences are found, or the Search until no new sequences are found, or the user specified maximum number of iterations is user specified maximum number of iterations is reached, whichever comes firstreached, whichever comes first
■ Very similar to MEME/MASTVery similar to MEME/MAST
Problems with PSSMs
■ Some families are characterized by two or Some families are characterized by two or more “sub”-motifs with variable spacing more “sub”-motifs with variable spacing between thembetween them
■ Deciding upon motif boundaries difficultDeciding upon motif boundaries difficult■ Possible information in intervening Possible information in intervening
sequences lost if only motifs are usedsequences lost if only motifs are used
Cobbling
■ Pick “most representative” protein sequence Pick “most representative” protein sequence from a familyfrom a family
■ Convert it to a profile by replacing each Convert it to a profile by replacing each amino acid by the corresponding column amino acid by the corresponding column from a similarity matrix from a similarity matrix
Cobbling
■ For each recognized “motif” in the family, For each recognized “motif” in the family, replace the corresponding section of the replace the corresponding section of the profile with the profile of the motifprofile with the profile of the motif
Cobbling
■ Advantage: At least some sequence Advantage: At least some sequence information between motifs is retained.information between motifs is retained.
■ S. Henikoff & J.G. Henikoff (1997) S. Henikoff & J.G. Henikoff (1997) Embedding strategies for effective use of Embedding strategies for effective use of information from multiple sequence information from multiple sequence alignments. alignments. Protein Science 6Protein Science 6:698-705:698-705
Cobbler Illustration
scores from profiles of conserved motifs
similarity scores for sequence from “most representative” family member
sequence of “most representative” family member
Family Pairwise Search
■ For all known members of family, calculate For all known members of family, calculate (pairwise) homology to each sequence in (pairwise) homology to each sequence in database (using BLAST) and sum those database (using BLAST) and sum those scoresscores
Family Pairwise Search
■ Does not generate a model of the motifDoes not generate a model of the motif■ Analogous to k nearest neighbor Analogous to k nearest neighbor
classificationclassification
Which method is best?
■ CompareCompare◆ BLAST using randomly chosen family memberBLAST using randomly chosen family member◆ BLAST FPSBLAST FPS◆ MAST (uses MEME to build PSSM)MAST (uses MEME to build PSSM)◆ HMMERHMMER
■ W.N. Gundy (1998) Homology Detection W.N. Gundy (1998) Homology Detection via Family Pairwise Search. via Family Pairwise Search. J. Comput. J. Comput. Biol. 5Biol. 5:479-492:479-492
Comparison Protocol
■ For each methodFor each method◆ For each known protein familyFor each known protein family
✦ Train with family membersTrain with family members✦ Search database for matchesSearch database for matches✦ Rank by score from searchRank by score from search✦ Determine how many known family Determine how many known family
members are ranked highlymembers are ranked highly
Evaluation
TrainingSequences
Training Model
TestingSequences
All otherSequences
Known FamilyMembers
Searching
Ranked Listof Matches
Evaluation metric - ROC
■ Define Define ROCROCnn
✦ ROCROCnn is the fraction of true positives detected at a is the fraction of true positives detected at a threshold giving threshold giving nn false positives false positives
Example of Evaluation for ROC2
■ Assume 8 Known Family MembersAssume 8 Known Family Members■ If Ranked List of Matches is:If Ranked List of Matches is:
◆ Known Family Member 1Known Family Member 1◆ Known Family Member 7Known Family Member 7◆ Known Family Member 4Known Family Member 4◆ Known Family Member 2Known Family Member 2◆ Other Sequence 1Other Sequence 1◆ Known Family Member 3Known Family Member 3◆ Known Family Member 8Known Family Member 8◆ Other Sequence 2Other Sequence 2◆ Known Family Member 5Known Family Member 5◆ Known Family Member 6Known Family Member 6
■ ROCROC22 score is 0.75 (six of the eight scored higher score is 0.75 (six of the eight scored higher than the second highest scoring non-member)than the second highest scoring non-member)
Sequences up to 2 non-family members
Protocol for Comparison of Methods ■ Calculate Calculate ROCROC5050 for each familyfor each family
■ Average over all familiesAverage over all families■ Bigger is better!Bigger is better!
Results
BLAST FPS
HMMER
MAST
BLAST FPS
MAST
HMMER
BLAST
Conclusion
■ FPS better than single sequence BLASTFPS better than single sequence BLAST■ FPS better than model-based methodsFPS better than model-based methods
Comparison Protocol
■ Caution!Caution!◆ True positive True positive defined as being listed as a defined as being listed as a
member of the family in the PROSITE member of the family in the PROSITE compilationcompilation
◆ Some Some false positivesfalse positives could be actual family could be actual family members that were missed during PROSITE members that were missed during PROSITE compilation!compilation!
◆ (Should be minor effect)(Should be minor effect)
Which is best (part 2)?
■ CompareCompare◆ BLASTBLAST◆ BLAST FPSBLAST FPS◆ cobbled BLASTcobbled BLAST◆ cobbled BLAST FPScobbled BLAST FPS
■ W.N. Grundy and T.L. Bailey (1999) W.N. Grundy and T.L. Bailey (1999) Family pairwise search with embedded Family pairwise search with embedded motif models. motif models. Bioinformatics 15:Bioinformatics 15:463-470463-470
Comparison Protocol
■ Evaluation metricEvaluation metric◆ rank sumrank sum
✦ calculate difference in ROCcalculate difference in ROC5050 for two methods for a for two methods for a given familygiven family
✦ sort by absolute value of differencesort by absolute value of difference✦ sum ranks of families for which one method is better sum ranks of families for which one method is better
than the otherthan the other
◆ Bigger is better!Bigger is better!
Results
Conclusions
■ For task of finding members of a family For task of finding members of a family given a reasonable number of known given a reasonable number of known members of that family, cobbled FPS is best members of that family, cobbled FPS is best of the methods tested!of the methods tested!
■ As number of known sequences in a family As number of known sequences in a family grows, HMM methods may do bettergrows, HMM methods may do better