+ All Categories
Home > Documents > Remote Protein Homology Detection - Columbia Universityrkuang/paper/Kuang_Profile_Kernel...–...

Remote Protein Homology Detection - Columbia Universityrkuang/paper/Kuang_Profile_Kernel...–...

Date post: 10-Feb-2021
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
23
Profile-based String Kernels for Remote Homology Detection and Motif Extraction Ray Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund and Christina Leslie. Department of Computer Science Columbia University
Transcript
  • Profile-based String Kernels for Remote Homology Detection and

    Motif Extraction

    Ray Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi,

    Yoav Freund and Christina Leslie.

    Department of Computer ScienceColumbia University

  • Agenda

    • Remote Protein Homology Detection• Classification of SCOP Superfamilies• SVM and Kernels• Profile Kernel and its Family Tree• Motif Extraction with Profile Kernel• Conclusion and Future Work

  • Remote Protein Homology Detection • Protein represented by sequence of amino acids• Easy to sequence proteins, difficult to obtain structure• Remote homologs: remote evolutionary relationship

    conserved structure/function, low sequence similarity

  • Classification of SCOP SuperfamiliesSCOP

    Negative Training Set

    Negative Test Set

    Fold

    Superfamily

    FamilyPositive Test SetPositive

    Training Set

    • Remote homologs: sequences that belong to the same superfamily but not the same family

    • Discriminative framework: use positive (+1) and negative (-1) training sequences to learn classifier

  • Support Vector Machine (SVM) Classifiers

    • Training examples mapped to (usually high-dimensional) feature space by a feature map

    Φ(x) = (φ1(x), … , φN(x))• Learn linear classifier in feature space

    f(x) = 〈 w, Φ(x) 〉 + b by solving optimization problem: trade-off between maximizing geometric margin and minimizing margin violations

    • Large margin: good generalization performance even in high dimensions

    +

    +

    +

    _

    +

    __

    _

    _

    +

    +_

    w

    b

  • Kernels for Discrete Objects• Kernel trick: To train an SVM, can use kernel rather

    than explicit feature map • Can define kernels for sequences, graphs, other discrete

    objects:{ sequences } RN

    Kernel value is inner product in feature space: K(x, y) = 〈 Φ(x), Φ(y) 〉

    • Original string kernels (Watkins, Haussler, Lodhi et al.) require quadratic time in sequence length, O(|x| |y|), to compute each kernel value K(x, y)

    • We introduce fast novel string kernels computed with a trie data structure

    Φ

  • Profile Kernel and its Family Tree• Three generations

    – Spectrum Kernel – Mismatch Kernel– Profile Kernel

    • Effective: one of the best performing methods • Fast: computation scales linearly with

    sequence length

  • Spectrum Kernel (Leslie, Eskin and Noble, PSB 2002)

    • Feature map indexed by all possible k-length subsequences (“k-mers”) from alphabet Σ of amino acids, |Σ| = 20

    Q1:AKQDYYYYE

    AKQKQDQDYDYYYYYYYYYYE

    Q2:DYYEIAKQY

    DYYYYEYEIEIAIAKAKQKQY

    Feature Space(AAA-YYY)1 AKQ 11 DYY 1 0 EIA 10 IAK 11 KQD 00 KQY 1 1 QDY 0 0 YEI 11 YYE 12 YYY 0

    K(Q1,Q2)==3Problem: K-mers capture some position-independent local similarity, but they do not model mutations

  • Mismatch Kernel (Leslie, Eskin, Weston and Noble, NIPS 2002)

    • For k-mer s, the mismatch neighborhood N(k,m)(s) is the set of all k-mers t within m mismatches from s

    • Size of mismatch neighborhood is O(|Σ|mkm)AKQ

    CKQDKQ AAQ

    … AKY…

    ( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 )AAQ AKY CKQ DKQ

    AKQ

  • Computing the Mismatch Kernel• Use mismatch tree (trie) to organize lexical traversal of all

    instances of k-mers (with mismatches) in training set• Traversal of trie for k=3, m=1

    EADLALGKAVF

    ADLALGADQVFNG

    AS1:

    S2:

  • Computing the Mismatch Kernel• Use mismatch tree (trie) to organize lexical traversal of all

    instances of k-mers (with mismatches) in training set• Traversal of trie for k=3, m=1

    EADLALGKAVF

    ADLALGADQVFNG

    A

    D

    S1:

    S2:

  • Computing the Mismatch Kernel• Use mismatch tree (trie) to organize lexical traversal of all

    instances of k-mers (with mismatches) in training set• Traversal of trie for k=3, m=1

    EADLALGKAVF

    ADLALGADQVFNG

    A

    D

    LUpdate kernel value for K(s1,s2) by adding contribution for feature ADL

    Problem: Arbitrary mismatch does not model the mutation probability between amino acids

    S1:

    S2:

  • Profile Kernel• Profile kernel: specialized to protein sequences,

    probabilistic profiles to capture homology information• Semi-supervised approach: profiles are estimated using

    unlabeled data (sequences available for about 1 million proteins )

    • E.g. PSI-BLAST profiles: estimated by iteratively aligning database homologs to query sequence

    A Q K …A 3 -2 1 …C -1 0 2 …D -1 0 0 …… … … … …Y 2 -3 -3 …

    query

    profile

  • Profile-based k-mer Map• Use profile to

    define position-dependent mutation neighborhoods:• E.g. k=3, σ=5 and a profile of negative log probabilities

    P(x) = p j (b),b ∈ Σ, j =1K x{ }

    AKQYKQ

    (2+1+1

  • Efficient Computing with Trie• Use trie data structure to organize lexical traversal of all

    instances of k-mers training profile. • Scales linearly with length, O(km_max+1|Σ|m_max(|x|+|y|)), where

    m_max is maximum number of mismatches that occur in any mutation neighborhood.

    • E.g. k=3, σ=5

    AQ

    A Q K …A 1 3 2 …C 3 2 1 …D 3 2 1 …… … … … …Q 3 1 2 …… … … … …Y 2 1 3 …

    Query x… A Q Y …

    A ….5 2 1 …C … 2 1 2 …D … 2 1 4 …… … … … … …Q … 2 .6 2 …… … … … … …Y … 3 3 3 …

    Query y

    D

    x: 1+1+1 < σ> σ

    C

    x: 1+1+1< σy: .5+.6+2 < σ y: .5+.6+4

    Update K(x, y) by adding contribution for feature AQC but not AQD

  • Experiments• SCOP benchmark with 54 experiments• Train PSI-BLAST profiles on NR database • Comparison against newer SVM methods:

    – PSI-BLAST rank: use training sequence as query and rank testing sequences with PSI-BLAST e-value

    – EMotif Kernel (Ben-Hur et al., 2003): features are known protein motifs, stored using trie

    – SVM-pairwise (Liao & Noble, 2002): feature vectors of pairwise alignment scores (e.g. PSI-BLAST scores)

    – Cluster Kernel (Weston et al., 2003): Implicitly average the feature vectors for sequences in the PSI-BLAST neighborhood of input sequence

  • Results

  • Performance Comparison

    0.2930.743PSI-BLAST

    0.2470.711EMOTIF

    0.8740.984Profile(5,7.5)-5 Iteration0.8210.973Profile(5,7.5)-2 Iteration0.6990.923Cluster0.5330.866SVM-Pairwise0.4160.875Mismatch(5,1)

    ROC50ROCKernels

  • Extracting Discriminative Motif Regions

    • SVM training determines support vector sequence profiles and their weights: (P(xi), αi)

    • SVM decision hyperplane normal vector:w = Σi yi αi Φ(P(xi))

    • Positional contribution to classification score:

    • Averaged positional score for positive sequences:

    S x j +1: j + k[ ]( )= Φ P x j +1: j + k[ ]( )( ),w

    Savg x j[ ]( )= max S x j − k + q : j −1+ q[ ]( ),0( )q=1Kk∑

  • Extracting Discriminative Motif Regions

    • Sort positional scores: about 40%-50% of positions in positive training sequences contribute 90% of classification score

    • Peaky positional plots discriminative motifs

  • Mapping Discriminative Regions to Structure

    • In examined examples, discriminative motif regions correspond to conserved structural features of the protein superfamily

    • Example: Homeodomain-like protein superfamily.

    Ecoli MarAprotein (1bl0)

  • Conclusions and Future Work• Conclusions:

    – Profile string kernels exploit compact representation of homology information

    – Interpretation of profile-SVM classifier by discriminative motif regions: conserved structural components

    • Future work– Use secondary structure information in profile kernel– Extend profile kernel for multi-class protein homology

    detection problem

  • Acknowledgements• Asa Ben-Hur

    University of Washington• Chris Bystroff

    Rensselaer Polytechnic Institute

    • Lan Xu The Scripps Research Institute

    • Hairuo LiuColumbia University

    • Eleazar EskinUniversity of California, San Diego

    Profile-based String Kernels for Remote Homology Detection and Motif ExtractionAgendaRemote Protein Homology DetectionClassification of SCOP SuperfamiliesSupport Vector Machine (SVM) ClassifiersKernels for Discrete ObjectsProfile Kernel and its Family TreeSpectrum Kernel (Leslie, Eskin and Noble, PSB 2002)Mismatch Kernel (Leslie, Eskin, Weston and Noble, NIPS 2002)Computing the Mismatch KernelComputing the Mismatch KernelComputing the Mismatch KernelProfile KernelProfile-based k-mer MapEfficient Computing with TrieExperimentsResultsExtracting Discriminative Motif RegionsExtracting Discriminative Motif RegionsMapping Discriminative Regions to StructureConclusions and Future WorkAcknowledgements


Recommended