Profile-based String Kernels for Remote Homology Detection and
Motif Extraction
Ray Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi,
Yoav Freund and Christina Leslie.
Department of Computer ScienceColumbia University
Agenda
• Remote Protein Homology Detection• Classification of SCOP Superfamilies• SVM and Kernels• Profile Kernel and its Family Tree• Motif Extraction with Profile Kernel• Conclusion and Future Work
Remote Protein Homology Detection • Protein represented by sequence of amino acids• Easy to sequence proteins, difficult to obtain structure• Remote homologs: remote evolutionary relationship
conserved structure/function, low sequence similarity
Classification of SCOP SuperfamiliesSCOP
Negative Training Set
Negative Test Set
Fold
Superfamily
FamilyPositive Test SetPositive
Training Set
• Remote homologs: sequences that belong to the same superfamily but not the same family
• Discriminative framework: use positive (+1) and negative (-1) training sequences to learn classifier
Support Vector Machine (SVM) Classifiers
• Training examples mapped to (usually high-dimensional) feature space by a feature map
Φ(x) = (φ1(x), … , φN(x))• Learn linear classifier in feature space
f(x) = 〈 w, Φ(x) 〉 + b by solving optimization problem: trade-off between maximizing geometric margin and minimizing margin violations
• Large margin: good generalization performance even in high dimensions
+
+
+
_
+
__
_
_
+
+_
w
b
Kernels for Discrete Objects• Kernel trick: To train an SVM, can use kernel rather
than explicit feature map • Can define kernels for sequences, graphs, other discrete
objects:{ sequences } RN
Kernel value is inner product in feature space: K(x, y) = 〈 Φ(x), Φ(y) 〉
• Original string kernels (Watkins, Haussler, Lodhi et al.) require quadratic time in sequence length, O(|x| |y|), to compute each kernel value K(x, y)
• We introduce fast novel string kernels computed with a trie data structure
Φ
Profile Kernel and its Family Tree• Three generations
– Spectrum Kernel – Mismatch Kernel– Profile Kernel
• Effective: one of the best performing methods • Fast: computation scales linearly with
sequence length
Spectrum Kernel (Leslie, Eskin and Noble, PSB 2002)
• Feature map indexed by all possible k-length subsequences (“k-mers”) from alphabet Σ of amino acids, |Σ| = 20
Q1:AKQDYYYYE
AKQKQDQDYDYYYYYYYYYYE
Q2:DYYEIAKQY
DYYYYEYEIEIAIAKAKQKQY
Feature Space(AAA-YYY)1 AKQ 11 DYY 1 0 EIA 10 IAK 11 KQD 00 KQY 1 1 QDY 0 0 YEI 11 YYE 12 YYY 0
K(Q1,Q2)==3Problem: K-mers capture some position-independent local similarity, but they do not model mutations
Mismatch Kernel (Leslie, Eskin, Weston and Noble, NIPS 2002)
• For k-mer s, the mismatch neighborhood N(k,m)(s) is the set of all k-mers t within m mismatches from s
• Size of mismatch neighborhood is O(|Σ|mkm)AKQ
CKQDKQ AAQ
… AKY…
( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 )AAQ AKY CKQ DKQ
AKQ
Computing the Mismatch Kernel• Use mismatch tree (trie) to organize lexical traversal of all
instances of k-mers (with mismatches) in training set• Traversal of trie for k=3, m=1
EADLALGKAVF
ADLALGADQVFNG
AS1:
S2:
Computing the Mismatch Kernel• Use mismatch tree (trie) to organize lexical traversal of all
instances of k-mers (with mismatches) in training set• Traversal of trie for k=3, m=1
EADLALGKAVF
ADLALGADQVFNG
A
D
S1:
S2:
Computing the Mismatch Kernel• Use mismatch tree (trie) to organize lexical traversal of all
instances of k-mers (with mismatches) in training set• Traversal of trie for k=3, m=1
EADLALGKAVF
ADLALGADQVFNG
A
D
LUpdate kernel value for K(s1,s2) by adding contribution for feature ADL
Problem: Arbitrary mismatch does not model the mutation probability between amino acids
S1:
S2:
Profile Kernel• Profile kernel: specialized to protein sequences,
probabilistic profiles to capture homology information• Semi-supervised approach: profiles are estimated using
unlabeled data (sequences available for about 1 million proteins )
• E.g. PSI-BLAST profiles: estimated by iteratively aligning database homologs to query sequence
A Q K …A 3 -2 1 …C -1 0 2 …D -1 0 0 …… … … … …Y 2 -3 -3 …
query
profile
Profile-based k-mer Map• Use profile to
define position-dependent mutation neighborhoods:• E.g. k=3, σ=5 and a profile of negative log probabilities
P(x) = p j (b),b ∈ Σ, j =1K x{ }
AKQYKQ
(2+1+1
Efficient Computing with Trie• Use trie data structure to organize lexical traversal of all
instances of k-mers training profile. • Scales linearly with length, O(km_max+1|Σ|m_max(|x|+|y|)), where
m_max is maximum number of mismatches that occur in any mutation neighborhood.
• E.g. k=3, σ=5
AQ
A Q K …A 1 3 2 …C 3 2 1 …D 3 2 1 …… … … … …Q 3 1 2 …… … … … …Y 2 1 3 …
Query x… A Q Y …
A ….5 2 1 …C … 2 1 2 …D … 2 1 4 …… … … … … …Q … 2 .6 2 …… … … … … …Y … 3 3 3 …
Query y
D
x: 1+1+1 < σ> σ
C
x: 1+1+1< σy: .5+.6+2 < σ y: .5+.6+4
Update K(x, y) by adding contribution for feature AQC but not AQD
Experiments• SCOP benchmark with 54 experiments• Train PSI-BLAST profiles on NR database • Comparison against newer SVM methods:
– PSI-BLAST rank: use training sequence as query and rank testing sequences with PSI-BLAST e-value
– EMotif Kernel (Ben-Hur et al., 2003): features are known protein motifs, stored using trie
– SVM-pairwise (Liao & Noble, 2002): feature vectors of pairwise alignment scores (e.g. PSI-BLAST scores)
– Cluster Kernel (Weston et al., 2003): Implicitly average the feature vectors for sequences in the PSI-BLAST neighborhood of input sequence
Results
Performance Comparison
0.2930.743PSI-BLAST
0.2470.711EMOTIF
0.8740.984Profile(5,7.5)-5 Iteration0.8210.973Profile(5,7.5)-2 Iteration0.6990.923Cluster0.5330.866SVM-Pairwise0.4160.875Mismatch(5,1)
ROC50ROCKernels
Extracting Discriminative Motif Regions
• SVM training determines support vector sequence profiles and their weights: (P(xi), αi)
• SVM decision hyperplane normal vector:w = Σi yi αi Φ(P(xi))
• Positional contribution to classification score:
• Averaged positional score for positive sequences:
S x j +1: j + k[ ]( )= Φ P x j +1: j + k[ ]( )( ),w
Savg x j[ ]( )= max S x j − k + q : j −1+ q[ ]( ),0( )q=1Kk∑
Extracting Discriminative Motif Regions
• Sort positional scores: about 40%-50% of positions in positive training sequences contribute 90% of classification score
• Peaky positional plots discriminative motifs
Mapping Discriminative Regions to Structure
• In examined examples, discriminative motif regions correspond to conserved structural features of the protein superfamily
• Example: Homeodomain-like protein superfamily.
Ecoli MarAprotein (1bl0)
Conclusions and Future Work• Conclusions:
– Profile string kernels exploit compact representation of homology information
– Interpretation of profile-SVM classifier by discriminative motif regions: conserved structural components
• Future work– Use secondary structure information in profile kernel– Extend profile kernel for multi-class protein homology
detection problem
Acknowledgements• Asa Ben-Hur
University of Washington• Chris Bystroff
Rensselaer Polytechnic Institute
• Lan Xu The Scripps Research Institute
• Hairuo LiuColumbia University
• Eleazar EskinUniversity of California, San Diego
Profile-based String Kernels for Remote Homology Detection and Motif ExtractionAgendaRemote Protein Homology DetectionClassification of SCOP SuperfamiliesSupport Vector Machine (SVM) ClassifiersKernels for Discrete ObjectsProfile Kernel and its Family TreeSpectrum Kernel (Leslie, Eskin and Noble, PSB 2002)Mismatch Kernel (Leslie, Eskin, Weston and Noble, NIPS 2002)Computing the Mismatch KernelComputing the Mismatch KernelComputing the Mismatch KernelProfile KernelProfile-based k-mer MapEfficient Computing with TrieExperimentsResultsExtracting Discriminative Motif RegionsExtracting Discriminative Motif RegionsMapping Discriminative Regions to StructureConclusions and Future WorkAcknowledgements