Remote Protein Homology Detection - Columbia Universityrkuang/paper/Kuang_Profile_Kernel...–...

Profile-based String Kernels for Remote Homology Detection and

Motif Extraction

Ray Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi,

Yoav Freund and Christina Leslie.

Department of Computer ScienceColumbia University

Agenda

• Remote Protein Homology Detection• Classification of SCOP Superfamilies• SVM and Kernels• Profile Kernel and its Family Tree• Motif Extraction with Profile Kernel• Conclusion and Future Work

Remote Protein Homology Detection • Protein represented by sequence of amino acids• Easy to sequence proteins, difficult to obtain structure• Remote homologs: remote evolutionary relationship

conserved structure/function, low sequence similarity

Classification of SCOP SuperfamiliesSCOP

Negative Training Set

Negative Test Set

Fold

Superfamily

FamilyPositive Test SetPositive

Training Set

• Remote homologs: sequences that belong to the same superfamily but not the same family

• Discriminative framework: use positive (+1) and negative (-1) training sequences to learn classifier

Support Vector Machine (SVM) Classifiers

• Training examples mapped to (usually high-dimensional) feature space by a feature map

Φ(x) = (φ1(x), … , φN(x))• Learn linear classifier in feature space

f(x) = 〈 w, Φ(x) 〉 + b by solving optimization problem: trade-off between maximizing geometric margin and minimizing margin violations

• Large margin: good generalization performance even in high dimensions

+

+

+

_

+

__

_

_

+

+_

w

b

Kernels for Discrete Objects• Kernel trick: To train an SVM, can use kernel rather

than explicit feature map • Can define kernels for sequences, graphs, other discrete

objects:{ sequences } RN

Kernel value is inner product in feature space: K(x, y) = 〈 Φ(x), Φ(y) 〉

• Original string kernels (Watkins, Haussler, Lodhi et al.) require quadratic time in sequence length, O(|x| |y|), to compute each kernel value K(x, y)

• We introduce fast novel string kernels computed with a trie data structure

Φ

Profile Kernel and its Family Tree• Three generations

– Spectrum Kernel – Mismatch Kernel– Profile Kernel

• Effective: one of the best performing methods • Fast: computation scales linearly with

sequence length

Spectrum Kernel (Leslie, Eskin and Noble, PSB 2002)

• Feature map indexed by all possible k-length subsequences (“k-mers”) from alphabet Σ of amino acids, |Σ| = 20

Q1:AKQDYYYYE

AKQKQDQDYDYYYYYYYYYYE

Q2:DYYEIAKQY

DYYYYEYEIEIAIAKAKQKQY

Feature Space(AAA-YYY)1 AKQ 11 DYY 1 0 EIA 10 IAK 11 KQD 00 KQY 1 1 QDY 0 0 YEI 11 YYE 12 YYY 0

K(Q1,Q2)==3Problem: K-mers capture some position-independent local similarity, but they do not model mutations

Mismatch Kernel (Leslie, Eskin, Weston and Noble, NIPS 2002)

• For k-mer s, the mismatch neighborhood N(k,m)(s) is the set of all k-mers t within m mismatches from s

• Size of mismatch neighborhood is O(|Σ|mkm)AKQ

CKQDKQ AAQ

… AKY…

( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 )AAQ AKY CKQ DKQ

AKQ

Computing the Mismatch Kernel• Use mismatch tree (trie) to organize lexical traversal of all

instances of k-mers (with mismatches) in training set• Traversal of trie for k=3, m=1

EADLALGKAVF

ADLALGADQVFNG

AS1:

S2:



EADLALGKAVF

ADLALGADQVFNG

A

D

S1:

S2:



EADLALGKAVF

ADLALGADQVFNG

A

D

LUpdate kernel value for K(s1,s2) by adding contribution for feature ADL

Problem: Arbitrary mismatch does not model the mutation probability between amino acids

S1:

S2:

Profile Kernel• Profile kernel: specialized to protein sequences,

probabilistic profiles to capture homology information• Semi-supervised approach: profiles are estimated using

unlabeled data (sequences available for about 1 million proteins )

• E.g. PSI-BLAST profiles: estimated by iteratively aligning database homologs to query sequence

A Q K …A 3 -2 1 …C -1 0 2 …D -1 0 0 …… … … … …Y 2 -3 -3 …

query

profile

Profile-based k-mer Map• Use profile to

define position-dependent mutation neighborhoods:• E.g. k=3, σ=5 and a profile of negative log probabilities

P(x) = p j (b),b ∈ Σ, j =1K x{ }

AKQYKQ

(2+1+1

Efficient Computing with Trie• Use trie data structure to organize lexical traversal of all

instances of k-mers training profile. • Scales linearly with length, O(km_max+1|Σ|m_max(|x|+|y|)), where

m_max is maximum number of mismatches that occur in any mutation neighborhood.

• E.g. k=3, σ=5

AQ

A Q K …A 1 3 2 …C 3 2 1 …D 3 2 1 …… … … … …Q 3 1 2 …… … … … …Y 2 1 3 …

Query x… A Q Y …

A ….5 2 1 …C … 2 1 2 …D … 2 1 4 …… … … … … …Q … 2 .6 2 …… … … … … …Y … 3 3 3 …

Query y

D

x: 1+1+1 < σ> σ

C

x: 1+1+1< σy: .5+.6+2 < σ y: .5+.6+4

Update K(x, y) by adding contribution for feature AQC but not AQD

Experiments• SCOP benchmark with 54 experiments• Train PSI-BLAST profiles on NR database • Comparison against newer SVM methods:

– PSI-BLAST rank: use training sequence as query and rank testing sequences with PSI-BLAST e-value

– EMotif Kernel (Ben-Hur et al., 2003): features are known protein motifs, stored using trie

– SVM-pairwise (Liao & Noble, 2002): feature vectors of pairwise alignment scores (e.g. PSI-BLAST scores)

– Cluster Kernel (Weston et al., 2003): Implicitly average the feature vectors for sequences in the PSI-BLAST neighborhood of input sequence

Results

Performance Comparison

0.2930.743PSI-BLAST

0.2470.711EMOTIF

0.8740.984Profile(5,7.5)-5 Iteration0.8210.973Profile(5,7.5)-2 Iteration0.6990.923Cluster0.5330.866SVM-Pairwise0.4160.875Mismatch(5,1)

ROC50ROCKernels

Extracting Discriminative Motif Regions

• SVM training determines support vector sequence profiles and their weights: (P(xi), αi)

• SVM decision hyperplane normal vector:w = Σi yi αi Φ(P(xi))

• Positional contribution to classification score:

• Averaged positional score for positive sequences:

S x j +1: j + k[ ]( )= Φ P x j +1: j + k[ ]( )( ),w

Savg x j[ ]( )= max S x j − k + q : j −1+ q[ ]( ),0( )q=1Kk∑

Extracting Discriminative Motif Regions

• Sort positional scores: about 40%-50% of positions in positive training sequences contribute 90% of classification score

• Peaky positional plots discriminative motifs

Mapping Discriminative Regions to Structure

• In examined examples, discriminative motif regions correspond to conserved structural features of the protein superfamily

• Example: Homeodomain-like protein superfamily.

Ecoli MarAprotein (1bl0)

Conclusions and Future Work• Conclusions:

– Profile string kernels exploit compact representation of homology information

– Interpretation of profile-SVM classifier by discriminative motif regions: conserved structural components

• Future work– Use secondary structure information in profile kernel– Extend profile kernel for multi-class protein homology

detection problem

Acknowledgements• Asa Ben-Hur

University of Washington• Chris Bystroff

Rensselaer Polytechnic Institute

• Lan Xu The Scripps Research Institute

• Hairuo LiuColumbia University

• Eleazar EskinUniversity of California, San Diego

Profile-based String Kernels for Remote Homology Detection and Motif ExtractionAgendaRemote Protein Homology DetectionClassification of SCOP SuperfamiliesSupport Vector Machine (SVM) ClassifiersKernels for Discrete ObjectsProfile Kernel and its Family TreeSpectrum Kernel (Leslie, Eskin and Noble, PSB 2002)Mismatch Kernel (Leslie, Eskin, Weston and Noble, NIPS 2002)Computing the Mismatch KernelComputing the Mismatch KernelComputing the Mismatch KernelProfile KernelProfile-based k-mer MapEfficient Computing with TrieExperimentsResultsExtracting Discriminative Motif RegionsExtracting Discriminative Motif RegionsMapping Discriminative Regions to StructureConclusions and Future WorkAcknowledgements

Date post:	10-Feb-2021
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Remote Protein Homology Detection - Columbia Universityrkuang/paper/Kuang_Profile_Kernel...–...

Documents