Karsten Borgwardt: Data Mining in Bioinformatics, Page 1
Data Mining in BioinformaticsDay 9: String & Text Mining in Bioinformatics
Karsten Borgwardt
March 1 to March 12, 2010
Machine Learning & Computational Biology Research GroupMPIs Tübingen
Why compare sequences?
Karsten Borgwardt: Data Mining in Bioinformatics, Page 2
Protein sequencesProteins are chains of amino acids.20 different types of amino acids can be found in proteinsequences.Protein sequence changes over time by mutations, dele-tion, insertions.Different protein sequences may diverge from one com-mon ancestor.Their sequences may differ slightly, yet their function isoften conserved.
Why compare sequences?
Karsten Borgwardt: Data Mining in Bioinformatics, Page 3
Biological Question:Biologists are interested in the reverse direction:Given two protein sequences, is it likely that they origi-nate from the same common ancestor?
Computational Challenge:How to measure similarity between two protein se-quence, or equivalently:How to measure similarity between two strings
Kernel Challenge:How to measure similarity between two strings via a ker-nel function
In short: How to define a string kernel
History of sequence comparison
Karsten Borgwardt: Data Mining in Bioinformatics, Page 4
First phaseSmith-WatermanBLAST
Second phaseProfilesHidden Markov Models
Third phasePSI-BlastSAM-T98
Fourth phaseKernels
Sequence comparison: Phase 1
Karsten Borgwardt: Data Mining in Bioinformatics, Page 5
IdeaMeasure pairwise similarities between sequences withgaps
MethodsSmith-Waterman
dynamic programminghigh accuracyslow (O(n2))
BLASTfaster heuristic alternative with sufficient accuracysearches common substrings of fixed lengthextends these in both directionsperforms gapped alignment
Sequence comparison: Phase 2
Karsten Borgwardt: Data Mining in Bioinformatics, Page 6
IdeaCollect aggregate statistics from a family of sequencesCompare this statistics to a single unlabeled protein
MethodsHidden Markov Models (HMMs)
Markov process with hidden and observable parame-tersForward algorithm determines probability if given se-quence is output of particular HMM
ProfilesProfiles of sequence families are derived by multiplesequence alignmentGiven sequence is compared to this profile
Sequence comparison: Phase 3
Karsten Borgwardt: Data Mining in Bioinformatics, Page 7
IdeaCreate single models from database collectionsof homologous sequences
MethodsPSI-BLAST
Position specific iterative BLASTProfile from highest scoring hits in initial BLAST runsPosition weighting according to degree of conserva-tionIteration of these steps
SAM-T98, now SAM-T02database search with HMM from multiple sequencealignment
Phase 4: Kernels and SVMs
Karsten Borgwardt: Data Mining in Bioinformatics, Page 8
General ideaModel differences between classes of sequencesUse SVM classifier to distinguish classesUse kernel to measure similarity between strings
Kernels for Protein SequencesSVM-Fisher kernelComposite kernelMotif kernelString kernel
SVM-Fisher method
Karsten Borgwardt: Data Mining in Bioinformatics, Page 9
General ideaCombine HMMs and SVMs for sequence classificationWon best-paper award at ISMB 1999
Sequence representationfixed-length vectorcomponents are transition and emission probabilitiestransformation into Fisher score
SVM-Fisher method
Karsten Borgwardt: Data Mining in Bioinformatics, Page 10
AlgorithmModel protein family F as HMMTransform query protein X into fixed-length vector viaHMMCompute kernel between X and positive and negativeexamples of the protein family
Advantagesallows to incorporate prior knowledgeallows to deal with missing datais interpretableoutperforms competing methods
Composition kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 11
General ideaModel sequence by amino acid contentBin amino acids w.r.t physico-chemical properties
Sequence representationfeature vector of amino acid frequenciesphysico-chemical properties includepredicted secondary structure, hydrophobicity,normalized van der Waals volume, polarity,polarizabilityuseful database: AAindex
Motif kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 12
General ideaConserved motif in amino acid sequences indicatestructural and functional relationshipModel sequence s as a feature vector f representingmotifsi-th component of f is 1⇔ s contains i-th motif
Motif databasesPROSITEeMOTIFsBLOCKS+ combines several databases
Generated bymanual constructionmultiple sequence alignment
Pairwise comparison kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 13
General ideaEmploy empirical kernel map on Smith-Waterman/Blastscores
AdvantageUtilizes decades of practical experience with Blast
DisadvantageHigh computational cost (O(m3))
AlleviationEmploy Blast instead of Smith-WatermanUse vectorization set for empirical map only
Phase 4: String Kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 14
General ideaCount common substrings in two stringsA substring of length k is a k-mer
VariationsAssign weights to k-mersAllow for mismatchesAllow for gapsInclude substitutionsInclude wildcards
Spectrum Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 15
General ideaFor each l-mer α ∈ Σl, the coordinate indexed by α willbe the number of times α occurs in sequence x.Then the l-spectrum feature map is
ΦSpectruml (x) = (φα(x))α∈Σl
Here φα(x) is the # occurrences of α in x.The spectrum kernel is now the inner product in the fea-ture space defined by this map:
kSpectrum(x, x′) =< ΦSpectruml (x),ΦSpectrum
l (x′) >
Sequences are deemed the more similar, the more com-mon substrings they contain
Spectrum Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 16
PrincipleSpectrum kernel: Count exactly common k-mers
Mismatch Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 17
General ideaDo not enforce strictly exact matchesDefine mismatch neighborhood of an l-mer α with up tom mismatches:
φMismatch(l,m) (α) = (φβ(α))β∈Σl
For a sequence x of any length, the map is then ex-tended as
φMismatch(l,m) (x) =
∑l−mers α in x
(φMismatch(l,m) (α))
The mismatch kernel is now the inner product in featurespace defined by:
kMismatch(l,m) (x, x′) =< ΦMismatch
(l,m) (x),ΦMismatch(l,m) (x′) >
Mismatch Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 18
PrincipleMismatch kernel: Count common k-mers with max. mmismatches
Gappy Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 19
General ideaAllow for gaps in common substrings→ “subsequences”A g-mer then contributes to all its l-mer subsequences
φGap(g,l)(α) = (φβ(α))β∈Σl
For a sequence x of any length, the map is then ex-tended as
φGap(g,l)(x) =∑
g−mers α in x(φGap(g,l)(α))
The gappy kernel is now the inner product in featurespace defined by:
kGap(g,l)(x, x′) =< ΦGap
(g,l)(x),ΦGap(g,l)(x
′) >
Gappy Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 20
PrincipleGappy kernel: Count common l-subsequences of g-mers
Substitution Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 21
General ideamismatch neighborhood→ substitution neighborhoodAn l-mer then contributes to all l-mers in its substitutionneighborhood
M(l,σ)(α) = {β = b1b2 . . . bl ∈ Σl : −l∑i
logP (ai|bi) < σ}
For a sequence x of any length, the map is then ex-tended as
φSub(l,σ)(x) =∑
l−mers α in x(φSub(l,σ)(α))
The substitution kernel is now:
kSub(l,σ)(x, x′) =< ΦSub
(l,σ)(x),ΦSub(l,σ)(x
′) >
Substitution Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 22
PrincipleSubstitution kernel: Count common l-subsequences insubstitution neighborhood
Wildcard Kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 23
General ideaaugment alphabet Σ by a wildcard character ∗→ Σ∪{∗}given α from Σl and β from {Σ∪ {∗}}l with maximum moccurrences of ∗l-mer α contributes to l-mer β if their non-wildcard char-acters matchFor a sequence x of any length, the map is then givenby
φWildcard(l,m,λ) (x) =
∑l−mers α in x
(φβ(α))β∈W
where φβ(α) = λj if α matches pattern β containing jwildcards, φβ(α) = 0 if α does not match β, and0 ≤ λ ≤ 1.
Wildcard Kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 24
PrincipleWildcard kernel: Count l-mers that match except forwildcards
References and further reading
Karsten Borgwardt: Data Mining in Bioinformatics, Page 25
References
[1] C. Leslie, E. Eskin, and W. S. Noble. The spectrumkernel: A string kernel for SVM protein classification. InPSB, pages 564–575, 2002.
[2] C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mis-match string kernels for SVM protein classification. InNIPS 2002. MIT Press.
[3] C. Leslie and R. Kuang. Fast kernels for inexact stringmatching. In COLT, 2003.
[4] B. Schölkopf, K. Tsuda, and J.-P. Vert. Kernel Methodsin Computational Biology, Chapter 3 and 4. MIT Press,Cambridge, MA, 2004.