+ All Categories
Home > Documents > Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with...

Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with...

Date post: 29-Dec-2015
Category:
Upload: magnus-poole
View: 213 times
Download: 0 times
Share this document with a friend
34
Directions in Protein Contact Map Mining Mohammed J. Zaki Computer Science Dept. joint work with Jingjing Hu & Xiaolan Shen, CS Dept. Yu Shao & Prof. Chris Bystroff, Biology Dept.
Transcript

Directions in Protein Contact Map Mining

Mohammed J. ZakiComputer Science Dept.

joint work withJingjing Hu & Xiaolan Shen, CS Dept.

Yu Shao & Prof. Chris Bystroff, Biology Dept.

Rensselaer Polytechnic Institute, Troy NY

Protein Structures Primary structure

Un-branched polymer 20 side chains (residues or amino acids) PDB file 2IGD: MTPAVTTYSLVINGLTLSGU…..

Higher order structures Secondary: local (consecutive) in sequence Tertiary: 3D fold of one polypeptide chain Quaternary: Chains packing together

PDB protein 2IGD

Anti-parallel Beta Sheets

Parallel Beta Sheets

Alpha Helix

The Protein Folding Problem

Contact Map

Amino acids Ai and Aj are in contact if their 3D distance is less than contact threshold (e.g., 7 Angstroms)

Sequence separation is given as |i-j| Contact map C is a symmetric N x N

matrix with C(i,j) = 1 if Ai and Aj are in contact C(i,j) = 0 otherwise

Consider all pairs with |i-j| >= 4

Contact Map (2IGD)

Anti-parallel Beta Sheets

Alpha Helix

Parallel Beta Sheets

Amino Acid Ai

Am

ino

Aci

d A

j

Characterizing Physical, Protein-like Contact Maps

A very small subset of all contact maps code for physically possible proteins (self-avoiding, globular chains)

A contact map must: Satisfy geometric constraints Represent low-energy structure

Characterizing Physical Contact Maps in Proteins

What are the typical non-local interactions? Frequent dense 0/1 sub-matrices in

contact maps 3-step approach

Dense pattern mining Pruning mined patterns Clustering dense patterns (non-local

pattern signatures)

Dense Pattern Mining

Frequent 2D Pattern Mining Use WxW sliding window; W window size Measure density under each window (N-W)2 / 2 possible windows for N length

protein Look for “minimum density” (number of

1’s) scale away from diagonal

Try different window sizes

Counting Dense Patterns Naïve Approach: for W=5, N=60 there are

1485 windows per protein. 28 million possible windows for 18,544 proteins (in PDB) Test if two sub-matrices are equal

Linear search: O(P x W2) with P current dense patterns

Hash based: O(W2)

Our Approach: 2-level Hashing O(W) time

Pattern (WxW Sub-matrix) Encoding

Encode sub-matrix as string (W ints)Sub-matrix Integer Value 00000 0 01100 12 01000 8 01000 8 00000 0Concatenated String: 0.12.8.8.0

String-ID(M) =

Level1 (approximate):

Level2 (exact): h2(M) = String-ID(M)

Two-level Hashing

W

iivMh

1)(1

Wvvv ...... 21

Binding Patterns to Protein Sequence and Structure

StringID:0.12.8.8.0, Support = 170 (window size W=5)0000001100010000100000000

Occurrences:pdb-name (X,Y) X_sequence Y_sequenceInteraction1070.0 52,30 ILLKN TFVRI alpha::beta1145.0 51,13 VFALH GFHIA alpha::strand1251.2 42,6 EVCLR GSKFG alpha::strand1312.0 54,11 HGYDE ATFAK alpha::beta1732.0 49,6 HRFAK KELAG alpha::beta2895.0 49,7 SRCLD DTIYY alpha::beta...

Frequent Dense Local Patterns

Submatrix 0 0 0 0 0 0 0 1

0 0 0 0 0 0 1 0

0 0 0 0 0 1 0 0

0 0 0 0 1 0 0 0

0 0 0 1 0 0 0 0

0 0 1 0 0 0 0 0

0 1 0 0 0 0 0 0

1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0

1 1 0 0 0 0 0 0

1 1 1 0 0 0 0 0

0 1 1 1 0 0 0 0

1 0 0 0 0 0 0 0

1 1 0 0 0 0 0 0

1 1 1 0 0 0 0 0

0 1 1 1 0 0 0 0

0 0 1 1 1 0 0 0

0 0 0 1 1 1 0 0

0 0 0 0 1 1 1 0

0 0 0 0 0 1 1 1

Frequency 2.0% 2.0% 2.2%

PhysicalPhenomenon

Parallel beta sheet

Anti-parallel beta sheet

Anti-parallel beta sheet

Pruning Patterns

0000001000010000100000000

0000000100001000010000000

0000000010000100001000000

Same pattern (shifted to right) but different String-IDs

Merge horizontally or vertically shifted patternsPrune away the local patterns (alpha/beta)

Dense Pattern Mining Results

2702 non-redundant proteins from PDB

Min-Support = 1 (exhaustive patterns)

Window size = 5, Min-Density = 5Contact Threshold Number of Patterns

5 Angstroms 2508

6 Angstroms 9929

7 Angstroms 21231

Frequent Dense Non-Local Patterns

Alpha – Alpha Alpha – Beta Sheet

Frequent Dense Non-Local Patterns

Alpha – Beta Turn Beta Sheet – Beta Turn

Clustering Dense Patterns Distance: Mi, Mj are dense sub-matrices

Use agglomerative hierarchical clustering Find each cluster’s (c) representative (n patterns)

Conceptually the super-imposition of n sub-matrices Compute contact probability at each position

Note a 1 whenever contact probability is more than a probability threshold

|][][|),(2

1

W

kjiji kMkMMMd

n

kMkp

n

ii

c

1

][][

Cluster RepresentativeContact Probabilities:0: 0.05 1: 0.05 2: 0.68 3: 0.85 4: 0.71 5: 0.03 6: 0.02 7: 0.14 8: 0.07 9: 0.09 10: 0.05 11: 0.05 12: 0.12 13: 0.09 14: 0.0315: 0.03 16: 0.05 17: 0.15 18: 0.27 19: 0.85 20: 0.25 21: 0.10 22: 0.59 23: 0.92 24: 0.83

Representative contact pattern: 00111 00000 00000 00001 00011

Clustering Quality

High and low value of pc[k] are good (most cluster members agree on k)

For a cluster c, define quality Qc:

Overall clustering quality (0.5 <= Q <= 1)

)5.0][(],[2

1

1

W

kccc kpkpS

NP

QcQ

NC

ici i

1

|| NC = Number of ClustersNP = Number of Patterns

)5.0][(],[12

1

0

W

kccc kpkpS

01ccc SSQ

Example 1: Mined Cluster

#1355

#3496

#6282

#7980

representative

0001100011011111100010000

0000100101111111100010000

0001000000110001000010000

0001100101111001000000000

0001100001111001000010000

Cluster patterns (beta-beta strand)

Example 2: Mined Cluster

#196 #503 #2834 #8697 representative

1101001111010000100011000

0100001110010000100011000

1100001100011100100001000

1101001110011000110001000

1100001110010000100001000

Cluster Patterns (beta-beta turn)

Clustering Results

Contact Threshold

Number of Patterns

Number of Clusters

Cluster Quality

5 A 2508 83 0.89

6 A 9929 99 0.86

7 A 21231 367 0.84

Future Work

Comprehensive list of non-local motifs I-sites library (by Prof. Bystroff)

catalogs local motifs Future Directions

Improving prediction of contact maps Mining heuristic rules for “physicality” Protein folding pathways

Improving Contact Map Prediction

Physically Impossible

Physically Impossible

Mining Physicality Rules Mining heuristic rules for “physicality”

Based on simple geometric constraints Rules governing contacts and non-contacts

Parallel Beta Sheets: If C(i,j) = 1 and C(i+2,j+2) = 1,

then C(i,j+2) = 0 and C(i+2,j) = 0 Anti-parallel Beta Sheets:

If C(i,j+2) = 1 and C(i+2,j) = 1, then C(i,j) = 0 and C(i+2,j+2) = 0

Alpha Helices: If C(i,i+4) = 1, C(i,j) = 1, and C(i+4,j) = 1,

then C(i+2,j) = 0

Heuristic Rules of Physicality

i

i+2 j

j+2

If C(i,j+2) = 1 and C(i+2,j) = 1, then C(i,j) = 0 and C(i+2,j+2) = 0

Anti-parallel Beta Sheets

If C(i,j) = 1 and C(i+2,j+2) = 1, then C(i,j+2) = 0 and C(i+2,j) = 0

Parallel Beta Sheets

i

i+2

j

j+2

Heuristic Rules of Physicality

Heuristic Rules of Physicality

j

i

i+4

i+2

Alpha Helix

If C(i,j) = 1 and C(i+4,j) = 1 and C(I,i+4) = 1, then C(i+2,j) = 0

Protein Folding Pathways Rules for Pathways in Contact Map Space

Pathway is time-ordered sequence of contacts

Consider only native contacts (those that are present in the true map)

Condensation rule: New contacts within Smax U(i,j) <= Smax; U(i,j) unfolded residues from i to j

Pathway prediction is complementary to structure prediction

Contact Map Folding Pathways


Recommended