+ All Categories
Home > Documents > A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced...

A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced...

Date post: 22-Dec-2015
Category:
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
23
A Study of Computational A Study of Computational Methods for Storing and Methods for Storing and Sequencing Genetic Sequencing Genetic Databases Databases CSC 545 – Advanced Database CSC 545 – Advanced Database Systems Systems By: Nnamdi Ihuegbu By: Nnamdi Ihuegbu 12/2/03 12/2/03
Transcript
Page 1: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

A Study of Computational A Study of Computational Methods for Storing and Methods for Storing and

Sequencing Genetic Sequencing Genetic DatabasesDatabases

CSC 545 – Advanced Database CSC 545 – Advanced Database SystemsSystems

By: Nnamdi IhuegbuBy: Nnamdi Ihuegbu12/2/0312/2/03

Page 2: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

AbstractAbstract

Scope of Study (i.e. aspect of Genetic Scope of Study (i.e. aspect of Genetic Databases)Databases) Types of Genetic DatabasesTypes of Genetic Databases Storage/organization/access/Storage/organization/access/

manipulation techniquesmanipulation techniques Sequencing (querying) of data in Sequencing (querying) of data in

Genetic DatabasesGenetic Databases Logical Layout of Genetic DatabasesLogical Layout of Genetic Databases

Page 3: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Brief IntroductionBrief Introduction Human Genome Project (and others) -> Vast Human Genome Project (and others) -> Vast

amount of biological dataamount of biological data Venture: Computer Science and Biology Venture: Computer Science and Biology

(BCB) -> Genetic Databases (BCB) -> Genetic Databases (map,genomic,proteomic)(map,genomic,proteomic)

Expected date of Completed map of human Expected date of Completed map of human genome: end of 2003genome: end of 2003

Next stage: Sequence comp. and Seq-Next stage: Sequence comp. and Seq-Protein function.Protein function.

Useful to Pharm. Companies (CADD – e.g. Useful to Pharm. Companies (CADD – e.g. SKB’s Relenza).SKB’s Relenza).

Page 4: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Results - SequenceResults - Sequence

Current Sequence Generation Current Sequence Generation TechnologiesTechnologies Maxam-Gilbert (use chemicals to cleave Maxam-Gilbert (use chemicals to cleave

DNA at a specific base/length)DNA at a specific base/length) Sanger (use enzymatic procedures to Sanger (use enzymatic procedures to

produce DNA based on specific base—produce DNA based on specific base—i.e. length)i.e. length)

Page 5: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Derivation of nucleotide Derivation of nucleotide sequence from human sequence from human

chromosomechromosome

Page 6: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Results - SequenceResults - Sequence Types of Sequence Comparisons/alignmts.Types of Sequence Comparisons/alignmts.

Global (“How similar are these two sequences?”)Global (“How similar are these two sequences?”) To find best overall alignment b/w two sequencesTo find best overall alignment b/w two sequences 1970: Needleman and Wunch (global, dynamic)1970: Needleman and Wunch (global, dynamic) Shortcomings: in small similarities w/in 2 subseq.Shortcomings: in small similarities w/in 2 subseq.

Local (“What sequences in a database are most Local (“What sequences in a database are most similar to this sequence?”)similar to this sequence?”)

To find the best subseq. match b/w two sequencesTo find the best subseq. match b/w two sequences 1981: Smith and Waterman (local, dynamic)1981: Smith and Waterman (local, dynamic) Shortcomings: not computationally efficient, slowShortcomings: not computationally efficient, slow

Page 7: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Results - SequenceResults - Sequence

Global alignment

Local alignment

?

?

Figure 3: Illustrating the differences between global and local sequence alignment

Page 8: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Results - SequenceResults - Sequence

Heuristic Search (Quick, Approximate)Heuristic Search (Quick, Approximate) Quickly search for “words” that match Quickly search for “words” that match

sequence. Then recursively perform local sequence. Then recursively perform local search on each matched word until no other search on each matched word until no other matchesmatches

FASTA (1998), BLAST(1990)FASTA (1998), BLAST(1990) Shortcomings: approximate not exact, E-Shortcomings: approximate not exact, E-

Value (sig if <0.05)Value (sig if <0.05)

Page 9: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Results – Sequence (CSC Results – Sequence (CSC Implementation)Implementation)

Sequence alignment can be Sequence alignment can be represented as matrices and graphs represented as matrices and graphs (using rules and costs)(using rules and costs)

When converted into a directed When converted into a directed acyclic graph, solution of the acyclic graph, solution of the sequence alignment is the longest-sequence alignment is the longest-path (max. path problem).path (max. path problem).

Page 10: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Results Sequence (CSC Results Sequence (CSC Implementation)Implementation)

Diag. edge = character matches; down edge = gap in string 2; across edge = gap in string 1

• Can be solved dynamically as a ‘running max score’ (RMS).

•For each D(i,j), best RMS = max(west+gap1, north+gap2, NW+current_score)

•Replace D(i,j) with max

•Needleman-Wunch Dynamic Program

Page 11: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Results – Sequence (CSC Results – Sequence (CSC Implementation)Implementation)

Similar to Smith-WatermanSimilar to Smith-Waterman Differences: Differences:

restricts RMS-discontinues if <0 after restricts RMS-discontinues if <0 after several iterationsseveral iterations

For each iteration, saves max for each For each iteration, saves max for each cell separately rather than replace-cell separately rather than replace->Trace back through max. scores for >Trace back through max. scores for best local alignmentbest local alignment

BLAST Implementation (BLAST Implementation (http://www.ebi.ac.uk/blast2/#http://www.ebi.ac.uk/blast2/#) )

Page 12: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Results - StorageResults - Storage EMBL Nucleotide Sequence Database (on EMBL Nucleotide Sequence Database (on

Oracle)Oracle) Scale: over 130 tables, 140 relationships Scale: over 130 tables, 140 relationships

(80 GB of data)(80 GB of data) Object Oriented Organization with Related 5 Object Oriented Organization with Related 5

packages.packages. Operations that return attribute type-Operations that return attribute type-

>supports on demand object creation>supports on demand object creation ‘‘live object cache’ – copying most accessed live object cache’ – copying most accessed

instance of DB into cache by Primary key instance of DB into cache by Primary key and performing queries on this cache.and performing queries on this cache.

Page 13: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Results - StorageResults - Storage

5 EMBL Packages:5 EMBL Packages: Sequence Info – general information on Sequence Info – general information on

biological sequence.biological sequence. Feature Info – sequence Feature Info – sequence

annotation/commentannotation/comment Reference Info – bibliographic ref. on seq.Reference Info – bibliographic ref. on seq. Taxonomy Info – taxonomy of organism’s Taxonomy Info – taxonomy of organism’s

sequence (i.e. kingdom, phyla, family, sequence (i.e. kingdom, phyla, family, genus, species, e.t.c.)genus, species, e.t.c.)

Location Info – location of sequence on Location Info – location of sequence on DNA/RNADNA/RNA

Page 14: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Results – Storage (Gen. Results – Storage (Gen. Relation B/W 5 packages)Relation B/W 5 packages)

Page 15: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Results – Storage (Sequence Results – Storage (Sequence Info)Info)

Page 16: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Results – Storage (Feature Results – Storage (Feature Info)Info)

Page 17: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Results – Storage (Reference Results – Storage (Reference Info)Info)

Page 18: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Results – Storage (Taxonomy Results – Storage (Taxonomy Info)Info)

Page 19: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Results – Storage (Location Results – Storage (Location Info)Info)

Page 20: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

ConclusionConclusion

Genetic Databases (3 main types) Genetic Databases (3 main types) are essential to store, manage, and are essential to store, manage, and query the massive bio-data from query the massive bio-data from studies like HGP.studies like HGP.

Object Oriented Design and data Object Oriented Design and data organizationorganization

Sequence Analysis: Global (N-W), Sequence Analysis: Global (N-W), Local (S-W), Heuristic (FASTA, BLAST)Local (S-W), Heuristic (FASTA, BLAST)

Page 21: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Conclusion - Future Conclusion - Future EnhancementsEnhancements

Storage/Management: highly dependent Storage/Management: highly dependent on hardware industry progresson hardware industry progress

Sequence Analysis: Sequence Analysis: Use of parallel prog. for faster analysis of 2 Use of parallel prog. for faster analysis of 2

sequences (BLAZE-Stanford)sequences (BLAZE-Stanford) Faster means of comparing and aligning Faster means of comparing and aligning

multiple sequences simultaneously (e.g. multiple sequences simultaneously (e.g. comparing novel protein sequence to comparing novel protein sequence to family).family).

Page 22: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Any Questions?Any Questions?

Page 23: A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.

Recommended