+ All Categories
Home > Documents > August 2007 Slide 1 SERC Research Symposium Database Engine Design a.k.a. Research@ DSL Jayant...

August 2007 Slide 1 SERC Research Symposium Database Engine Design a.k.a. Research@ DSL Jayant...

Date post: 18-Dec-2015
Category:
Upload: percival-hoover
View: 217 times
Download: 0 times
Share this document with a friend
Popular Tags:
28
August 2007 Slide 1 SERC Research Symposium Database Engine Design a.k.a. Research@ DSL Jayant Haritsa
Transcript

August 2007 Slide 1 SERC Research Symposium

Database Engine Designa.k.a. Research@ DSL

Jayant Haritsa

August 2007 Slide 2 SERC Research Symposium

Database Management Systems (DBMS)

• Efficient and convenient mechanisms for storing, querying and maintenance of enterprise data

• Cornerstone of computer industry– Uses more than 80 percent of computers worldwide– Employs more than 70 percent of computer professionals– Largest monetary sector of computer business

August 2007 Slide 3 SERC Research Symposium

• Handle data of arbitrary size – Income-Tax records are in Petabytes (1015)

• Self-contained – contains both data and meta-data

• Program-Data insulation– application s/w not affected by storage changes

DBMS FEATURES

SR No | Name | Address | Hostel | GPA

SR No | Name | Address | GPA | Hostel

August 2007 Slide 4 SERC Research Symposium

DBMS FEATURES (contd)

• Declarative Access– state what you want, not how to get it

• On-the-Fly Questions– ask new questions without writing new programs

• PEACE OF MIND– changes to the database are guaranteed to be

immune to subsequent system failures Sri Sri Ravishankar of the Information World

August 2007 Slide 5 SERC Research Symposium

Current Database Systems

• Commercial– IBM DB2 / Oracle / Microsoft SQL Server / Sybase

• Public-domain– PostgreSQL / MySQL / Berkeley DB

August 2007 Slide 6 SERC Research Symposium

DBMS Myths

• Databases? Isn’t that the boring part of accounting?

• Hazaar dumb Cobol programming!

• Maha-bore - almost as dull as watching Rahul Dravid bat!

• High-tech name for data entry!

• Will only get job with TCS!

• ...

August 2007 Slide 7 SERC Research Symposium

DBMS Realities

• Design of database engines has lots of really, really interesting intellectual problems with practical impact– theory, algorithms, data structures, experiments, prototypes

• Turing awards– 1981: Edgar Codd (relational data model)– 1999: Jim Gray (transaction model)

• Ullman, Silberschatz, Papadimitrou, …• Rajaraman, Patnaik, Balakrishnan,

Jacob/Govindarajan …

August 2007 Slide 8 SERC Research Symposium

Database Systems Lab

(DSL)

Established 1995

August 2007 Slide 9 SERC Research Symposium

Research Topics

– Real-Time Database Systems– Distributed Transaction Management– OODBMS– Web Databases– Data Mining– XML Databases– Biological Databases– Query Optimization– Multilingual Databases– Music Databases

1995-2000

2000-2005

Last few years

August 2007 Slide 10 SERC Research Symposium

Research Trajectory

Mining XML

MIDDLEWARE

OO Models

CORE DB TECHNOLOGY

AccessMethods

TransactionProcessing

Query Processing

August 2007 Slide 11 SERC Research Symposium

Research Techniques

• Theory– real-time, data mining, query optimization

• Simulation studies – real-time, distributed, web dbms

• Empirical evaluation – data mining, biological, multilingual dbms, query optimization

• Prototype development – OODBMS (Flexible Manufacturing [MIDAS], VLSI [DIAS], Bio-diversity

[Oshadhi,Bodhi] )– XML (Storage [LegoDB], Compression [XGrind] )– Query Optimization (Clustering [Plastic], Visualization [Picasso] )– Multilingual Databases (Cross-lingual SQL [Mira] )

August 2007 Slide 12 SERC Research Symposium

SPINE: Putting Backbone into Genomic Sequence Indexing

August 2007 Slide 13 SERC Research Symposium

1 51

A

CT$

GTTAATTACT$

T

A TA

ATTACT$

CT$ TTACT$

ATTACT$ CT$ATTACT$ CT$

3$

74

80

2 6

9

5

Standard Genomic Index: Suffix Tree [Weiner 1973]

Suffix Links

(xW → W)

Tree Edges

Search for Query = ‘TTA’

Vertically-compressed trie of suffixes augmented with links 0 1 2 3 4 5 6 7 8 9

Data = ‘GTTAATTACT$’

August 2007 Slide 14 SERC Research Symposium

Locate all Maximal Matching Substrings [Chang & Lawler 1990]

• For each position in query sequence Q , locate all longest matching substrings of length ≥ in the indexed data sequence D

– Example: D = ‘GTTAATTACT$’ Q = ‘CTAATGA’ and = 3

Result: { TAAT:<2,1> AAT:<3,2> }

August 2007 Slide 15 SERC Research Symposium

22

33

Maximal Substring Searchwith Suffix Tree Index

A

CT$

GTTAATTACT$

T

A TA

ATTACT$

CT$ TTACT$

ATTACT$ CT$ATTACT$ CT$

74

80

6 1

9

5

0 1 2 3 4 5 6 7 8 9

D = ‘GTTAATTACT$’ Q = ‘CTAATGA’ = 3

$

August 2007 Slide 16 SERC Research Symposium

• Accurate retrieval – no false negatives (unlike BLAST)

• Linear Time Complexity for both Constructionand Search! – because of Suffix-links

• Widely used– More than 40-50 applications over biological

sequences [Gusfield, 2002]– MUMmer [Celera Genomics], AVID, …

Features of Suffix Tree Index

August 2007 Slide 17 SERC Research Symposium

Crippling Limitation

• Viable only for sequences that are short enough for their associated suffix tree to fit completely in main memory … [Baeza-Yates and Navarro, 2000]

• Best that has been built so far is for sequences of ~ 10 Mbp (Human Genome is 300 times longer!)

August 2007 Slide 18 SERC Research Symposium

Difficulties in Supporting Suffix Trees on Long Sequences - 1

Space overheads are enormous– Order(s) of magnitude larger than data!– Human Genome can be easily stored in

main memory (~1 GB) but the index couldbe of the order of 10-100 GB

Disk-resident suffix trees for long sequences

August 2007 Slide 19 SERC Research Symposium

Difficulties in Supporting Suffix Trees on Long Sequences - 2

Tree Construction on Disk is Very Slow– Due to disk thrashing from random seeks

The active suffix creeps through the text like a caterpillar … corresponding active node swings through the tree like a butterfly

[Giegerich and Kurtz, 1995]

August 2007 Slide 20 SERC Research Symposium

Difficulties in Supporting Suffix Trees on Long Sequences - 3

Searching on Disk is Very Slow– Unbalanced Tree Structure

• Shape of tree depends onsequence stochastic properties

– “Multi-directional” traversals causes disk thrashing

• Tree-Edge “Vertical Walk-Down”

• Suffix-Link “Horizontal Jump-Across”

Suffix Tree Search

• Edge + Link mesh • Two phase Search

• Locate• Report

Combination of Batman and Spiderman !

August 2007 Slide 22 SERC Research Symposium

The SPINE* IndexA Horizontally-Compacted Trie Index

[*Sequence Processing INdexing Engine]

August 2007 Slide 23 SERC Research Symposium

Link

D = ‘ACCACAC’

Vertebra

Root node

Rib

Extension rib

SPINE Index Structure

• Nodes • Forward Edges

– Vertebras (Backbone)– Ribs / Ext-Ribs

• Backward Edges– Links

0

1

5

C

5

6

A

0

1

A

1

2

C

2

3

C

3

4

A

4

C(0)

A(1)

00

1

2

1(2)

2

7C

Complete horizontal compaction into single linear chain!!

August 2007 Slide 24 SERC Research Symposium

Structural Advantages of SPINE w.r.t. Suffix Trees

1) Number of nodes is equal to length of string, whereas in suffix tree can go up to double.

2) Entire data sequence explicitly embedded in index throw away the data!

3) On-line incremental algorithm (by definition)– do not need to possess entire data sequence in advance

4) Node creation order andlogical order are the same prefix-partitionable

0

1

2

3

4

A

C

C

A

C (0)

A (1)

D =‘ACCA’

August 2007 Slide 25 SERC Research Symposium

Advantages of SPINE (contd)

5) Each node represents a set of suffixes whereas in suffix tree each node represents only a single suffix

– Number of suffixes processed for construction and searching is smaller

6) Easy to develop buffering strategies forpersistent implementations

0

1

2

3

4

A

C

C

A

C (0)

A (1)

August 2007 Slide 26 SERC Research Symposium

SPINE Performance Summary

Data Sets Ecoli: 3.5 Mbp Celegans: 15.5 Mbp

HC 21: 28.5 Mbp HC19: 57.5 Mbp

Suffix Tree (MUMmer - Celera Genomics)

• Spine Space– ~ 2/3 of Suffix Tree

• Spine Time• Construction: ~ 1/2 of Suffix Tree• Searching: ~ 1/2 of Suffix Tree

August 2007 Slide 27 SERC Research Symposium

SPINE Summary

• First index based on horizontal (inter-path) compaction of the trie• Collapses into a single linear structure

• Improved features and performance w.r.t. suffix trees, the classical index

• Prefix-partitionable (first index to have this property)• Easily amenable to persistent disk implementation• Retains linear time/space complexity • Better construction speed and capacity• Better search response times

August 2007 Slide 28 SERC Research Symposium

Full details at http://dsl.serc.iisc.ernet.in

Questions?

August 2007 Slide 29 SERC Research Symposium

END PRESENTATION


Recommended