+ All Categories
Home > Documents > Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique...

Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique...

Date post: 14-Dec-2015
Category:
Upload: todd-higgins
View: 218 times
Download: 2 times
Share this document with a friend
34
Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI
Transcript
Page 1: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Finding similarities in protein structures:

a string approach

IFBM 2004

• Atelier de Bio-Informatique (ABI) - Université Paris VI

Page 2: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Searching for strict repeated patterns : KMR (1972)

- Occurrences of strictly repeated 2k-length patterns are built from intersections of sets of strictly repeated k-length patterns which lie side-by-side

This leads to an O(n.log(kmax)) algorithm for finding kmax-length repeated patterns

k

2k

k k

2k

k

Page 3: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Flexible patterns : KMRC

• A pattern is no more defined as a succession of symbols, instead it is a succession of cliques of symbols

S = caaaabaaacb

Here m is a repeated flexible pattern of length 3.m = c1-c1-c2 at position 4 and 8

Note : several patterns may exist at the same position : here c1-c1-c1 has also an occurrence in position 4

a b cc1

c2 2 2sC = 21111111121

Page 4: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Flexible patterns : KMRC

• As in KMR, the 2-k length patterns are built from k-length patterns.

• Here, a pattern is a clique of (similar) patterns, and at one position in the string there may exist several cliques of patterns.

The algorithm is now O(n.log(k).gk) (g being the mean degeneracy, i.e. the mean number of cliques a symbol belongs to)

In biology, identity is trivial, similarity is interesting…

Page 5: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Flexible patterns in sequences (KMRC)

• This algorithm may be used to find flexible patterns in several protein sequences (multiple alignment by blocks):

• Cliques of “symbols” define similarity, e.g :- different overlapping sets of amino acids

clustered by their properties (e.g. hydrophobic, hydrophilic, small, large, polar, charged, etc…)

- different overlapping sets of amino acids clustered by setting a threshold value on their score in a similarity matrix (e.g. BLOSUM or PAM)

p1 p2 p3 p4 p5

Page 6: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

PAM250

Threshold -> cliques

Similarity is not

transitive !

Page 7: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Flexible patterns in structures (KMRC)

• Finding 3D structural patterns in several protein structures

• Structures must be described as strings of symbols, and similar structures must be composed of similar symbols

• -> use of discretized internal coordinates

(angles) as an alphabet : or angles

Page 8: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Internal coordinates

Page 9: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Internal coordinates to symbols

……

……

discretization

Absolute need of similarity (KMRC), not identity !

Page 10: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Flexible patterns : KMRC

Finding flexible patterns of -symbols :

Here, cliques of “symbols” are angle overlapping sets

Similarity is a critical point (identity would miss structural features).

-180° 180°0°

Page 11: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Flexible patterns : KMRC

3 CytochromesP450:TERP,BM3,CAM

PMWIATKHADVMQIGVTRYLSSQRLIKEACGHWIATRGQLIREAY

PTHTAYRGLTLNWFQPASIRKLEENIRRIAQASVQRKNWKKAHNILLPSFSQQAMKGYHAMMVDIAVQLVQKPEQRQFRALANQVVGMPVVDKLENRIQELACSLIES

CDFMTDCALYYPLHVVMTALGVPIEVPEDMTRLTLDTIGLCGFNYRCNFTEDYAEPFPIRIFMLLAGLP

EDDEPLMLKLTQDFITSMVRALDEAMNKEDIPHLKYLTDQMT

FHETIATFYDYFNGFTVDRRSFQEDIKVMNDLVDKIIADRKAFAEAKEALYDYLIPIIEQRRQ

CPKDDVMSLLANEQSDDLLTHMLNKPGTDAISIVAN

Page 12: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Bach BWV846

Similarity : nature of elements OR relations between them

Page 13: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Bach BWV846

Similarity: series of notes with same pitch ?-> transposed series rather…

Page 14: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Bach BWV846

-> Similarity of relations between elements

Page 15: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Relational patterns (KMRR)• Now, a similar pattern is not defined by a

succession of similar symbols, but instead by succession of elements that share the same relationships between them.

r23r12

r13

r23r12

r13

r13

r12 r23

Pattern m =Example of relations = “to be higher to”,

“to be lower to”, “to be equal to”,…

Page 16: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

One step further : Flexible relational patterns(KMRR)

• The relations between elements do not need to be the same, they just need to be similar

rbra

rc

rara

rb

CR2CR1 CR1

Pattern m =ra rb rc

CR1 CR2

Relational cliques Cliques of relations = {“to be higher”, “to be equal”}, {“to be lower”, “to be equal”}…

Page 17: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

An example: application to 3D structures:

relations are defined on discrete internal distances between points:r(i,j) = rk if and only if rk ≤ dist(i,j) < rk + ∆

The relations r(i,j) and r(i’,j’) are considered as similarif they belong to the same subset {rk, rk+1, rk+2}, i.e. if

| r(i, j) - r(i’, j’) | ≤ 2This implies for euclidian distances :| dist(i,j) - dist(i’,j’) | < 3∆

rkr1 r2 r3 rk+1 rk+2

d(i,j) d(i ’,j’)

Page 18: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Application to 3D structures:

Relational cliques: 1 2 3 4 5 6 7 8 9 10 ….(defined on distances)

p4

5

3

4

3

3

44

3

2

44 6p3

p2

p1

p5 p8 p9

p7p6

r(p1,p4)=5r(p1,p3)=4 r(p2,p4)=3r(p1,p2)=3 r(p2,p3)=4 r(p3,p4)=3

r(p6,p9)=6r(p6,p8)=3 r(p7,p9)=4r(p6,p7)=4 r(p7,p8)=4 r(p8,p9)=2

Page 19: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Application to3D structures:

1PCL: AEWDAAVIDNSTNVWVDHVT1IDJ: WGGDAITLDDCDLVWIDHVT2BSP: SQYDNITINGGTHIWIDHCT1PLU: KDGDMIRVDDSPNVWVDHNE

1PCL: LRVTFHNNVFDRVTERAPRV1IDJ: DLVTMKGNYIYHTSGRSPKV2BSP: LKITLHHNRYKNIVQKAPRV1PLU: RNITYHHNYYNDVNARLPLQ

1PCL: TERAPRVRFGSIHAYNNVYL1IDJ: GRSPKVQDNTLLHCVNNYFY2BSP: VQKAPRVRFGQVHVYNNYYE1PLU: NARLPLQRGGLVHAYNNLYT

1PCL: AQTMTSSLATSINNNAGYGK1IDJ: SASAYTSVASRVVANAGQGN2BSP: SIDASANVKSNVINQAGAGK1PLU: SPVSAQCVKDKLPGYAGVGK

Ex: multiple structural alignment of 4 pectate and pectin lyases: 1PCL,1IDJ,2BSP,1PLU

Page 20: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Structural similarity search in databases: YAKUSA

fast structuralscanning

query

database (PDB)

Page 21: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Structural similarity search in databases: YAKUSA

Structures encoded with angles

For each database protein:1- find structural similar seeds with the query2 - extend seeds to longer structural matches

Then rank the structural hits

Page 22: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Seeds of length kAll overlapping k-patterns of the query -> automaton

1

2

2

2

1

2 1

1

Dictionary/Automaton

2

2 1Leaf->one seedAdvantage: no moving backward in the database string; less moving backward in the patterns

Page 23: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Query seeds automaton

1

2

2

2

1

2 1

1

Dictionary/Automaton

dc(i, ’i) <

'1

'22'

2

'2

'1

1

'1

1

2

'1

2'2

'2

'1 1

'1

'2

'1

'2

'11

2'

2 2

'2

dc(i,’i) <

with degeneracy with similar patterns

Page 24: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Matching seeds to SHSP

seed

SHSP

- (many seeds, giving only several SHSP)

query

database

- SHSP=maximal scoring region around the seed, the scores being based upon the angular differences

- probability associated with SHSP (MTD: pair approximation of Markov model)

Page 25: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

SHSPs to results…

Database structures

found, ranked by

score

http://bioserv.rpbs.jussieu.fr/Yakusa

40 seconds for scanning 11000 PDB structures

Page 26: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

query (Deoxyribonuclease I)

database(Heat-Shock protein 70)

http://bioserv.rpbs.jussieu.fr/Yakusa

SHSPs

Example of output:

Page 27: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

selection of best 2-diagonals

Multiple structural alignment: « m-diagonals »

First step: finding pair “diagonals” (at the angle level)

Index in first protein

Ind

ex in s

eco

nd

pro

tein

Page 28: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Second step: combination of 2-diagonals into m-diagonals (in m dimensions).

Protein 1Protein 1

Protein 2Protein 2

Protein 3Protein 3m-diagonal in m dimensions : here, the 3-diagonal is the combination of three 2-diagonals of dimension 2.

Multiple structural alignment: « m-diagonals »

Page 29: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Protein 1Protein 1

Protein 3Protein 3

Protein 2Protein 2

2

3

5

Second step: combination of 2-diagonals into m-diagonals (in m dimensions).

m-diagonal in m dimensions : here, the 3-diagonal is the combination of three 2-diagonals of dimension 2.

Multiple structural alignment: « m-diagonals »

Page 30: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

“column” graphes : - a residu is a node- if two residues are in a

2-diagonal, they are connected by a link.

Protein 1Protein 1

Protein 2Protein 2

Protein 3Protein 3

Selection of best m-diagonals (most connected ones)

2

3

5

Second step: combination of 2-diagonals into m-diagonals (in m dimensions).

Multiple structural alignment: « m-diagonals »

Page 31: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

13 cytochromes P450

- In blue : non aligned parts

- Other colors : m-diagonals

Example:

Page 32: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Other method : «  Gibbs sampling »

« Taboo search »

Page 33: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

Alain ViariHenry Soldano

Mathilde Carpentier

Marie-France Sagot

Pascale Jean

Sophie Brouillet

Nadia Pisanti

Kmrc+gok

Cytochromes P450

Yakusam-diagonalsGibbs-taboo

relationalpatterns

Near future: classification of structural cores in PDB…

Page 34: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI.

a

b

7 5 4 2 1

6 3

1 2 3 4 5 6 7

a a b a a b a

P1

V1

_a

_b 2 5 ;

1 4 ; 6 3

aa

ba

4 1

6 3

P2

ab 5 2

V2

Q2

aa

k=1

1 2 3 4 5 6 7

aaab ba ab ba

Look at +k

Put position in set stack

k=2

2k-length stacks

k-length stacks


Recommended