+ All Categories
Home > Documents > Fast Approximate Database Searching of Polypeptide Structures

Fast Approximate Database Searching of Polypeptide Structures

Date post: 08-Jan-2016
Category:
Upload: rivka
View: 22 times
Download: 1 times
Share this document with a friend
Description:
Efficient Algorithms Group Prof. Ernst W. Mayr Technical University of Munich. Fast Approximate Database Searching of Polypeptide Structures. Hanjo Taeubig Arno Buchner Jan Griebsch. German Conference on Bioinformatics October 4th, 2004. Structure. motivation & problem definition - PowerPoint PPT Presentation
Popular Tags:
21
Fast Approximate Database Searching of Polypeptide Structures Hanjo Taeubig Arno Buchner Jan Griebsch Efficient Algorithms Group Prof. Ernst W. Mayr Technical University of Munich German Conference on Bioinformatics October 4th, 2004
Transcript
Page 1: Fast Approximate Database Searching of Polypeptide Structures

Fast Approximate Database Searchingof Polypeptide Structures

Fast Approximate Database Searchingof Polypeptide Structures

Hanjo Taeubig Arno Buchner Jan Griebsch

Efficient Algorithms GroupProf. Ernst W. Mayr

Technical University of Munich

German Conference on Bioinformatics

October 4th, 2004

Page 2: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

Structure

I. motivation & problem definition

II. suffix trees

III. polypeptide angles suffix trees

IV. application & future work

Page 3: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

I. Motivation

• the function of a protein is largely determined by it’s structure and geometric shape

• How to find similar structures in a database ?

• related work

– DALI, VAST, CE

– TopScan, ProtDex2

• existing methods are mostly based on the principlefilter heuristics + exhaustive search/pairwise comparison and scale at least linearly

Page 4: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

I. Motivation

• PDB – Protein Data Bank

– ca. 3.5GB compressed, 14GB decompressed

– > 23.000 entries

– 90% Proteins, 5% Nucleotidesequences, 4% Nucleotide-Protein complexes

– 85% x-ray cristalography, 15% NMR

• protein structure databases grow almost exponentially

• search methods with time complexity at most O(n) required

Page 5: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

I. Problem Definition

• search a given polypeptide structure in a protein database

• search the longest common substructure in the database

• identify frequent substructures (motifs) in the database

Page 6: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

II. Suffix Trees

Tries

• tree with a root node

• every edge is labeled with a letter

• labels of all edges to the child nodes of one node are pairwise distinct

Page 7: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

II. Suffix Trees

Suffixtries

• stores all suffixes of a string

• the sentinel $ ensures that every suffix is represented by a leaf

Suffixtree for the word aaabbb$

Page 8: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

II. Suffix Trees

Compressed Suffixtries

• collapse linear paths in the tree

• store only start- and end-index

• linear number of inner nodes

Page 9: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

II. Suffix Trees

Further Extensions

• generalized suffix trees

– stores suffixes of multiple strings in one tree

• online linear time construction

Time Complexity

• Finding an occurrence of the search pattern does not depend on the size of the searched database, but linearly on the length m of the pattern

• Finding all k occurrences of a pattern takes time proprtional to m+k

Page 10: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

III. Polypeptide Angles Suffix Tree

Idea

I. encode the geometry of the database proteins in a translation and rotation invariant linear description (“structural text”)

– torsion angle encoding of the protein backbone

II. adapt efficient text mining methods to the error tolerant substructure searching problem

– generalized suffix trees with fault tolerant search strategies

Page 11: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

III. Polypeptide Angles Suffix Tree

… (22,93), (112, 4) …

Discretization

… a b b a …

… (22,93), (112, 4) …

Discretization

… a b b a …

1a1f

Page 12: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

III. Polypeptide Angles Suffix Tree

… (22,93), (112, 4) …

Discretization

… a b b a …

… (22,93), (112, 4) …

Discretization

… a b b a …

1a1f

Page 13: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

III. Polypeptide Angles Suffix Tree

Fault Tolerant Searching

• accept a “neighborhood range” of intervals left and right

• worst case time complexity: exponential (!)

• average: O( )

)1*2(log || n

figure: branching with =1

Page 14: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

IV. Application

Example

• search occurrences the C2H2 zinc finger in the complete PDB

• discretization: 24 intervals of 15°

• compare with SCOP classification, sequence-based search, SPASM

Page 15: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

IV. Application

Score E

Sequences producing significant alignments: (bits) Value

gi|37926551|pdb|1LLM|C Chain C, Crystal Structure Of A Zif2... 47 6e-07

gi|15988358|pdb|1F2I|G Chain G, Cocrystal Structure Of Sele... 42 2e-05

gi|3319019|pdb|1A1H|A Chain A, Qgsr (Zif268 Variant) Zinc F... 42 3e-05

gi|3319013|pdb|1A1F|A Chain A, Dsnr (Zif268 Variant) Zinc F... 41 3e-05

gi|3319022|pdb|1A1I|A Chain A, Radr (Zif268 Variant) Zinc F... 41 3e-05

gi|16975178|pdb|1JK1|A Chain A, Zif268 D20a Mutant Bound To... 41 3e-05

gi|2098365|pdb|1AAY|A Chain A, Zif268 Zinc Finger-Dna Compl... 41 4e-05

gi|33357855|pdb|1P47|A Chain A, Crystal Structure Of Tandem... 41 5e-05

gi|443340|pdb|1ZAA|C Chain C, Zif268 Immediate Early Gene (... 40 8e-05

gi|15988466|pdb|1G2F|C Chain C, Structure Of A Cys2his2 Zin... 33 0.015

gi|15988460|pdb|1G2D|C Chain C, Structure Of A Cys2his2 Zin... 32 0.025

gi|1941952|pdb|1MEY|C Chain C, Crystal Structure Of A Desig... 28 0.44

gi|40889293|pdb|1P7A|A Chain A, Solution Stucture Of The Th... 27 0.64

gi|3318788|pdb|2ADR| Adr1 Dna-Binding Domain From Saccharo... 27 0.78

gi|2094895|pdb|1SP1| Nmr Structure Of A Zinc Finger Domain... 26 1.4

gi|1420993|pdb|1ARD| Yeast Transcription Factor Adr1 (Resi... 23 9.7

. . .

Page 16: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

IV. Application

Figure: SearchingPDBentry1a1fwithdifferentneighborhoodsettings

Searchrangein15° ±0 ±1 ±2 ±3 ±4 ±5 ±6 ±7 ±8True positives 11 12 64 05 56 16 26 46 5

1a1f False positives 13333 254Time[s] < 1 < 11 23458 12True positives 113 369 14 15 18

1mfs False positives 49 9Time[s] < 1 < 1 < 1 < 111236True positives 11 78 7 120 132 135 138 144 146

1a3n False positives 0Time[s] < 1 < 11 23468 12

Table 1: The numberof trueandfalsepositivesfort hestructure searches.

Page 17: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

IV. Application

Minimum RMSD superposition: 1a1f vs. 1f2i 1a1f vs. 6 other true positives

“False” positives: 1a1f vs. 1vl2

Page 18: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

IV. Application

Run Time

• decompression of the packed PDB files

• parsing of the PDB files and calculating the torsion angles

• discretization and building the PAST

• searching a structure

25min

55min

2min

seconds

Pre-processing

Searching

Page 19: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

Summary

• suffixtree-based protein (sub-)structure database search method

• preprocessing required

• fast search

• does not rely on heuristics, SSE recognition

• adaptable sensitivity and error models

• until gapped matching is modeled: applicable for shorter peptide chains, motifs

• surprisingly simple

Page 20: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

Future Work

• model matching with insertions & deletions

• consensus search pattern

• implementation and practical testing of further error models

and angle encoding

• identification of new motifs

• testing, testing, testing: evaluating the method further with real life problems from pharmaceutical researchers, biologists, patent offices, …

Page 21: Fast Approximate Database Searching of Polypeptide Structures

www14.informatik.tu-münchen.de/PAST {taeubig|buchner|griebsch}@in.tum.de

Acknowledgements

• Hanjo Taeubig, Arno Buchner

• Volker Heun, Moritz Maass

• BFAM/BMBF

• ALTANA


Recommended