Finding similarities in protein structures:
a string approach
IFBM 2004
• Atelier de Bio-Informatique (ABI) - Université Paris VI
Searching for strict repeated patterns : KMR (1972)
- Occurrences of strictly repeated 2k-length patterns are built from intersections of sets of strictly repeated k-length patterns which lie side-by-side
This leads to an O(n.log(kmax)) algorithm for finding kmax-length repeated patterns
k
2k
k k
2k
k
Flexible patterns : KMRC
• A pattern is no more defined as a succession of symbols, instead it is a succession of cliques of symbols
S = caaaabaaacb
Here m is a repeated flexible pattern of length 3.m = c1-c1-c2 at position 4 and 8
Note : several patterns may exist at the same position : here c1-c1-c1 has also an occurrence in position 4
a b cc1
c2 2 2sC = 21111111121
Flexible patterns : KMRC
• As in KMR, the 2-k length patterns are built from k-length patterns.
• Here, a pattern is a clique of (similar) patterns, and at one position in the string there may exist several cliques of patterns.
The algorithm is now O(n.log(k).gk) (g being the mean degeneracy, i.e. the mean number of cliques a symbol belongs to)
In biology, identity is trivial, similarity is interesting…
Flexible patterns in sequences (KMRC)
• This algorithm may be used to find flexible patterns in several protein sequences (multiple alignment by blocks):
• Cliques of “symbols” define similarity, e.g :- different overlapping sets of amino acids
clustered by their properties (e.g. hydrophobic, hydrophilic, small, large, polar, charged, etc…)
- different overlapping sets of amino acids clustered by setting a threshold value on their score in a similarity matrix (e.g. BLOSUM or PAM)
p1 p2 p3 p4 p5
PAM250
Threshold -> cliques
Similarity is not
transitive !
Flexible patterns in structures (KMRC)
• Finding 3D structural patterns in several protein structures
• Structures must be described as strings of symbols, and similar structures must be composed of similar symbols
• -> use of discretized internal coordinates
(angles) as an alphabet : or angles
Internal coordinates
Internal coordinates to symbols
……
……
discretization
Absolute need of similarity (KMRC), not identity !
Flexible patterns : KMRC
Finding flexible patterns of -symbols :
Here, cliques of “symbols” are angle overlapping sets
Similarity is a critical point (identity would miss structural features).
-180° 180°0°
Flexible patterns : KMRC
3 CytochromesP450:TERP,BM3,CAM
PMWIATKHADVMQIGVTRYLSSQRLIKEACGHWIATRGQLIREAY
PTHTAYRGLTLNWFQPASIRKLEENIRRIAQASVQRKNWKKAHNILLPSFSQQAMKGYHAMMVDIAVQLVQKPEQRQFRALANQVVGMPVVDKLENRIQELACSLIES
CDFMTDCALYYPLHVVMTALGVPIEVPEDMTRLTLDTIGLCGFNYRCNFTEDYAEPFPIRIFMLLAGLP
EDDEPLMLKLTQDFITSMVRALDEAMNKEDIPHLKYLTDQMT
FHETIATFYDYFNGFTVDRRSFQEDIKVMNDLVDKIIADRKAFAEAKEALYDYLIPIIEQRRQ
CPKDDVMSLLANEQSDDLLTHMLNKPGTDAISIVAN
Bach BWV846
Similarity : nature of elements OR relations between them
Bach BWV846
Similarity: series of notes with same pitch ?-> transposed series rather…
Bach BWV846
-> Similarity of relations between elements
Relational patterns (KMRR)• Now, a similar pattern is not defined by a
succession of similar symbols, but instead by succession of elements that share the same relationships between them.
r23r12
r13
r23r12
r13
r13
r12 r23
Pattern m =Example of relations = “to be higher to”,
“to be lower to”, “to be equal to”,…
One step further : Flexible relational patterns(KMRR)
• The relations between elements do not need to be the same, they just need to be similar
rbra
rc
rara
rb
CR2CR1 CR1
Pattern m =ra rb rc
CR1 CR2
Relational cliques Cliques of relations = {“to be higher”, “to be equal”}, {“to be lower”, “to be equal”}…
An example: application to 3D structures:
relations are defined on discrete internal distances between points:r(i,j) = rk if and only if rk ≤ dist(i,j) < rk + ∆
The relations r(i,j) and r(i’,j’) are considered as similarif they belong to the same subset {rk, rk+1, rk+2}, i.e. if
| r(i, j) - r(i’, j’) | ≤ 2This implies for euclidian distances :| dist(i,j) - dist(i’,j’) | < 3∆
rkr1 r2 r3 rk+1 rk+2
d(i,j) d(i ’,j’)
Application to 3D structures:
Relational cliques: 1 2 3 4 5 6 7 8 9 10 ….(defined on distances)
p4
5
3
4
3
3
44
3
2
44 6p3
p2
p1
p5 p8 p9
p7p6
r(p1,p4)=5r(p1,p3)=4 r(p2,p4)=3r(p1,p2)=3 r(p2,p3)=4 r(p3,p4)=3
r(p6,p9)=6r(p6,p8)=3 r(p7,p9)=4r(p6,p7)=4 r(p7,p8)=4 r(p8,p9)=2
Application to3D structures:
1PCL: AEWDAAVIDNSTNVWVDHVT1IDJ: WGGDAITLDDCDLVWIDHVT2BSP: SQYDNITINGGTHIWIDHCT1PLU: KDGDMIRVDDSPNVWVDHNE
1PCL: LRVTFHNNVFDRVTERAPRV1IDJ: DLVTMKGNYIYHTSGRSPKV2BSP: LKITLHHNRYKNIVQKAPRV1PLU: RNITYHHNYYNDVNARLPLQ
1PCL: TERAPRVRFGSIHAYNNVYL1IDJ: GRSPKVQDNTLLHCVNNYFY2BSP: VQKAPRVRFGQVHVYNNYYE1PLU: NARLPLQRGGLVHAYNNLYT
1PCL: AQTMTSSLATSINNNAGYGK1IDJ: SASAYTSVASRVVANAGQGN2BSP: SIDASANVKSNVINQAGAGK1PLU: SPVSAQCVKDKLPGYAGVGK
Ex: multiple structural alignment of 4 pectate and pectin lyases: 1PCL,1IDJ,2BSP,1PLU
Structural similarity search in databases: YAKUSA
fast structuralscanning
query
database (PDB)
Structural similarity search in databases: YAKUSA
Structures encoded with angles
For each database protein:1- find structural similar seeds with the query2 - extend seeds to longer structural matches
Then rank the structural hits
Seeds of length kAll overlapping k-patterns of the query -> automaton
1
2
2
2
1
2 1
1
Dictionary/Automaton
2
2 1Leaf->one seedAdvantage: no moving backward in the database string; less moving backward in the patterns
Query seeds automaton
1
2
2
2
1
2 1
1
Dictionary/Automaton
dc(i, ’i) <
'1
'22'
2
'2
'1
1
'1
1
2
'1
2'2
'2
'1 1
'1
'2
'1
'2
'11
2'
2 2
'2
dc(i,’i) <
with degeneracy with similar patterns
Matching seeds to SHSP
…
…
seed
SHSP
- (many seeds, giving only several SHSP)
query
database
- SHSP=maximal scoring region around the seed, the scores being based upon the angular differences
- probability associated with SHSP (MTD: pair approximation of Markov model)
SHSPs to results…
Database structures
found, ranked by
score
http://bioserv.rpbs.jussieu.fr/Yakusa
40 seconds for scanning 11000 PDB structures
query (Deoxyribonuclease I)
database(Heat-Shock protein 70)
http://bioserv.rpbs.jussieu.fr/Yakusa
SHSPs
Example of output:
selection of best 2-diagonals
Multiple structural alignment: « m-diagonals »
First step: finding pair “diagonals” (at the angle level)
Index in first protein
Ind
ex in s
eco
nd
pro
tein
Second step: combination of 2-diagonals into m-diagonals (in m dimensions).
Protein 1Protein 1
Protein 2Protein 2
Protein 3Protein 3m-diagonal in m dimensions : here, the 3-diagonal is the combination of three 2-diagonals of dimension 2.
Multiple structural alignment: « m-diagonals »
Protein 1Protein 1
Protein 3Protein 3
Protein 2Protein 2
2
3
5
Second step: combination of 2-diagonals into m-diagonals (in m dimensions).
m-diagonal in m dimensions : here, the 3-diagonal is the combination of three 2-diagonals of dimension 2.
Multiple structural alignment: « m-diagonals »
“column” graphes : - a residu is a node- if two residues are in a
2-diagonal, they are connected by a link.
Protein 1Protein 1
Protein 2Protein 2
Protein 3Protein 3
Selection of best m-diagonals (most connected ones)
2
3
5
Second step: combination of 2-diagonals into m-diagonals (in m dimensions).
Multiple structural alignment: « m-diagonals »
13 cytochromes P450
- In blue : non aligned parts
- Other colors : m-diagonals
Example:
Other method : « Gibbs sampling »
« Taboo search »
Alain ViariHenry Soldano
Mathilde Carpentier
Marie-France Sagot
Pascale Jean
Sophie Brouillet
Nadia Pisanti
Kmrc+gok
Cytochromes P450
Yakusam-diagonalsGibbs-taboo
relationalpatterns
Near future: classification of structural cores in PDB…
a
b
7 5 4 2 1
6 3
1 2 3 4 5 6 7
a a b a a b a
P1
V1
_a
_b 2 5 ;
1 4 ; 6 3
aa
ba
4 1
6 3
P2
ab 5 2
V2
Q2
aa
k=1
1 2 3 4 5 6 7
aaab ba ab ba
Look at +k
Put position in set stack
k=2
2k-length stacks
k-length stacks