The Mimicking Octopus - appspot.comjindal-web.appspot.com/slides/VLDB2010_workshop.pdf · Cheap...

Information Systems Group

The Mimicking Octopus Towards a one-size-fits-all Architecture for Database Systems

Alekh JindalSupervisor: Prof. Dr. Jens Dittrich

VLDB PhD Workshop September 13, 2010

September 13, 2010 Towards a one-size-fits-all Database Architecture - Alekh Jindal

Information Systems GroupDatabase Landscape

2Motivation

OLTP

OLAP

StreamingSystem

ArchivalSystem

SearchEngine



2Motivation

OLTP

OLAP

StreamingSystem

ArchivalSystem

SearchEngine

Company Information System



2Motivation

OLTP

OLAP

StreamingSystem

ArchivalSystem

SearchEngine

Airline Company

Several ApplicationsEvolving Applications

ETL style data pipelinesEventual Integration

Licensing Cost

DBA CostMaintenance Cost

Engineering Cost

Integration Cost

Hard-coded optimizationsHard-coded data layouts

Reporting

Cheap Fares

Ticket Booking

Booking Archives

Flight Search


Information Systems GroupProblem Statement

• Single database system

• Automatic adaption

• Improved performance

• Lower cost

• Better maintainability

3Motivation


Information Systems GroupOctopusDB Overview

• One-size-fits-all architecture

• Abstract storage concept: Storage Views(SV)

• Single optimization problem: SV Selection

• Holistic SV optimizer

4OctopusDB


Information Systems GroupSystem Architecture

5

• No hard-coded store

• All operations recorded as logical log entries in a primary log on stable storage using WAL

OctopusDB

Storage View StorePrimary Log Store

Log SV

Storage View Catalog

API

Purging & Checkpointing

Recovery Manager

Holistic SV Optimizer

Transaction Manager

Result

Query Catalog



5



OctopusDB


Log SV


API


Recovery Manager


Transaction Manager

Result

Query Catalog



5



OctopusDB


Log SV


API


Recovery Manager


Transaction Manager

Result

Query Catalog


Information Systems GroupStorage Views

• Arbitrary physical representations of data

• Different layouts under a single umbrella

6Storage Views





6Storage Views

Log SVRow SVColumn SVIndex SV

Primary





6Storage Views


Partial Index SVBag-partitioned SVKey-consolidated SVVertically/Horizontally Partitioned SV

Primary Secondary





6Storage Views


Partial Index SVBag-partitioned SVKey-consolidated SVVertically/Horizontally Partitioned SV

Primary Secondary

... any hybrid combination of the above


Information Systems GroupUse-case Scenario*

• Flight booking system

• Tables: Tickets, Customers

• Tickets: several attributes, frequently updated

• Customers: fewer attributes

• Queries:SELECT C.* FROM Tickets T, Customers CWHERE T.customer_id=C.id AND T.a1=x1 AND T.a2=x2 ... AND T.an=xn

7

* Inspired from Unterbrunner et al. in PVLDB, 2009.

Storage Views


Information Systems GroupFlight Booking System

Log SV Result

tickets.customer_id

πcustomer.* ( ))σ

a1=x1 .... an=xn(

customers.id

8

SELECT C.* FROM Tickets T , Customers CWHERE T.customer_id=C.idAND T.a1=x1 .... AND T.an=xn

Customers

Tickets

customers, 01, <tom, 25, customers, 02, <marc, 23, customers 03, <felix, 20, customers, 03, <felix, 20, .....

tickets, 301, <paris, rome, E,...>tickets, 302, <moscow, berlin, B,...>tickets, 303, <tokyo, beijing, E,...>tickets, 303, <tokyo, beijing, B,..>.....

customers, 01, <tom, 25, [email protected], ...>customers, 02, <marc, 23, [email protected], ...>tickets, 301, <paris, rome, E,...>tickets, 302, <moscow, berlin, B,...>tickets, 303, <tokyo, beijing, E,...>customers 03, <felix, 20, [email protected], ...>customers, 03, <felix, 20, [email protected], ...>tickets, 303, <tokyo, beijing, B,..>..........

Storage Views


Information Systems GroupBag-partitioning

Log SV

Log SV

Log SV Result

σbag=customers

σbag=tickets

tickets.customer_id

πcustomers.*

( ))σa1=x1 ... an=xn( customer.id

tickets log

customers log

9

Customers

Tickets

customers, 01, <tom, 25, [email protected], ...>customers, 02, <marc, 23, [email protected], ...>customers 03, <felix, 20, [email protected], ...>customers, 03, <felix, 20, [email protected], ...>.....





customers, 01, <tom, 25, customers, 02, <marc, 23, tickets, 301, <paris, rome, E,...>tickets, 302, <moscow, berlin, B,...>tickets, 303, <tokyo, beijing, E,...>customers 03, <felix, 20, customers, 03, <felix, 20, tickets, 303, <tokyo, beijing, B,..>..........

Storage Views


Information Systems GroupKey-consolidation

Log SV

Log SV

Log SV Result

σbag=customers

Γbag,keyrecent

γ (( ))

σbag=ticketsΓ

bag,keyrecentγ (( ))

10

Customers

Tickets

customers, 01, <tom, 25, [email protected], ...>customers, 02, <marc, 23, [email protected], ...>customers, 03, <felix, 20, [email protected], ...>.....

tickets, 301, <paris, rome, E,...>tickets, 302, <moscow, berlin, B,...>tickets, 303, <tokyo, beijing, B,..>.....


tickets.customer_id

πcustomers.*





customers

Storage Views


Information Systems GroupStorage View Transformation

Col SV

Row SV

Log SV Result

tickets

customers

11

Customers

Tickets


tickets.customer_id

πcustomers.*


σbag=ticketsΓ

bag,keyrecentγ (( ))

σbag=customers

Γbag,keyrecent

γ (( ))




tickets, 301, <paris, rome, E,...>tickets, 302, <moscow, berlin, B,...>tickets, 303, <tokyo, beijing, B,..>.....

customers, 01, <tom, 25, [email protected], ...>customers, 02, <marc, 23, [email protected], ...>customers, 03, <felix, 20, [email protected], ...>.....

Storage Views


Information Systems GroupHot-Cold Storage Views

Customers

Tickets

Col SV

Row SV

Log SV Result

σbag=customers

Γbag,keyrecent

γ (( ))

σbag=ticketsΓ

bag,keyrecentγ ((

))σ

Col SV

σ time<

now-7d

ays

ticketsHot

ticketsCold

12

tickets, 303, <tokyo, beijing, B,..>.....

tickets, 301, <paris, rome, E,...>tickets, 302, <moscow, berlin, B,...>.....


tickets.customer_id

πcustomers.*





customers, 01, <tom, 25, customers, 02, <marc, 23, customers, 03, <felix, 20, .....

customers

Storage Views

time>=now-7days



σtime>=now-7days



Index Storage Views

Customers

Tickets

Col SV

Row SV

Log SV Result

σbag=customers

Γbag,keyrecent

γ (( ))

σbag=ticketsΓ

bag,keyrecentγ ((

))

Col SV

σtime<

now-7d

ays

13


tickets.customer_id

πcustomers.*


Index SV

Index SV

ticketsHotIndex

customersIndex

π

π

id,rid

price, rid



Storage Views





σtime>=now-7days



Index Storage Views

Customers

Tickets

Col SV

Row SV

Log SV Result

σbag=customers

Γbag,keyrecent

γ (( ))

σbag=ticketsΓ

bag,keyrecentγ ((

))

Col SV

σtime<

now-7d

ays

13


tickets.customer_id

πcustomers.*


Index SV

Index SV

ticketsHotIndex

customersIndex

π

π

id,rid

price, rid



Storage Views



Isn’t this same as Materialized Views?



σtime>=now-7days



Index Storage Views

Customers

Tickets

Col SV

Row SV

Log SV Result

σbag=customers

Γbag,keyrecent

γ (( ))

σbag=ticketsΓ

bag,keyrecentγ ((

))

Col SV

σtime<

now-7d

ays

13


tickets.customer_id

πcustomers.*


Index SV

Index SV

ticketsHotIndex

customersIndex

π

π

id,rid

price, rid



NO!Storage Views



Isn’t this same as Materialized Views?



σtime>=now-7days



Index Storage Views

Customers

Tickets

Col SV

Row SV

Log SV Result

σbag=customers

Γbag,keyrecent

γ (( ))

σbag=ticketsΓ

bag,keyrecentγ ((

))

Col SV

σtime<

now-7d

ays

13


tickets.customer_id

πcustomers.*


Index SV

Index SV

ticketsHotIndex

customersIndex

π

π

id,rid

price, rid



Storage Views



Materialized View knows what to materialize



σtime>=now-7days



Index Storage Views

Customers

Tickets

Col SV

Row SV

Log SV Result

σbag=customers

Γbag,keyrecent

γ (( ))

σbag=ticketsΓ

bag,keyrecentγ ((

))

Col SV

σtime<

now-7d

ays

13


tickets.customer_id

πcustomers.*


Index SV

Index SV

ticketsHotIndex

customersIndex

π

π

id,rid

price, rid



Storage Views



Materialized View knows what to materializeStorage View also knows how to materialize



σtime>=now-7days



Index Storage Views

Customers

Tickets

Col SV

Row SV

Log SV Result

σbag=customers

Γbag,keyrecent

γ (( ))

σbag=ticketsΓ

bag,keyrecentγ ((

))

Col SV

σtime<

now-7d

ays

13


tickets.customer_id

πcustomers.*


Index SV

Index SV

ticketsHotIndex

customersIndex

π

π

id,rid

price, rid



Storage Views



Materialized View knows what to materializeStorage View also knows how to materialize

A Materialized View still needs a Storage View


Information Systems GroupStorage View Selection

Customers

Tickets

Col SV

Row SV

Log SV Result

σbag=customers

Γbag,keyrecent

γ (( ))

σbag=ticketsΓ

bag,keyrecentγ ((

))σ

time>=now-7days

Col SV

σtime<

now-7d

ays

14


Index SV

Index SV

ticketsHotIndex

customersIndex

π

π

id,rid

price, rid

ticketsCold

Result

Result

ResultResult Result

Optimizer



Customers

Tickets

Col SV

Row SV

Log SV Result

σbag=customers

Γbag,keyrecent

γ (( ))

σbag=ticketsΓ

bag,keyrecentγ ((

))σ

time>=now-7days

Col SV

σtime<

now-7d

ays

14


Index SV

Index SV

ticketsHotIndex

customersIndex

π

π

id,rid

price, rid

ticketsCold

Result

Result

ResultResult Result

Pick right Storage Views to:create, update, query and drop

Optimizer



Customers

Tickets

Col SV

Row SV

Log SV Result

σbag=customers

Γbag,keyrecent

γ (( ))

σbag=ticketsΓ

bag,keyrecentγ ((

))σ

time>=now-7days

Col SV

σtime<

now-7d

ays

14


Index SV

Index SV

ticketsHotIndex

customersIndex

π

π

id,rid

price, rid

ticketsCold

Result

Result

ResultResult Result

Single Optimization Problem:“Storage View Selection”

Optimizer


Information Systems GroupHolistic Storage View Optimizer

• Storage totally dynamic:Any subset of data in Any storage structure

• Storage View selection

• Storage View update maintenance

• Pick physical execution plan

• Combine results spanning several Storage Views

15Optimizer


Information Systems GroupResearch Challenges

• Single umbrella for different storage layouts- storage layer abstraction - still layout specific specialization

• Automatic adaptive bifurcation- monolithic system- right online algorithms

• Simplicity vs Optimization- only as complex as required- mimic several specialized systems

16

Challenges & Related


Information Systems GroupRelated Work

• Materialized Views [Chirkova et. al. VLDBJ 2002]- as pointed before different from storage views

• Dynamic materialized views [Zhou et. al. ICDE 2007]- horizontal dynamism, storage view still open

• View matching, query containment [A. Y. Halevy VLDBJ 2001]- again operate on a higher level

• Cracked databases [Idreos et. al. CIDR 2007]- logical partitioning of data, only horizontal

• Rodent store [Cudre-Mauroux et. al. CIDR 2009]- still assumes a store

• GMAP [Tsatalos et. al. VLDB 1994]- does not adapt the stores

17

Challenges & Related


Information Systems GroupOptimizer Cost Model

18

Symbol Meaning Model

C logscan(N) Log SV scan cost

‰ PNi=1 colsize(logi)

m

ı· Crandom +


pageSize

ı/BW

Crowscan(N) Row SV scan cost

‰N·

PAi⇤A colsize(Ai)

m

ı· Crandom +

‰N·

PAi⇤A colsize(Ai)

pageSize

ı/BW

Ccolscan(N, S) Col SV scan cost

PAi⇤S

„‰N·

PAi⇤S colsize(Ai)

m

ı· Crandom +

lN·colsize(Ai)

pageSize

m/BW

«

C indexlookup(N) Index lookup cost Crandom · dlogF (N · (colsize(key) + pointerSize)/pageSize)e

Crow cl. indexscan (N, sel) Unclustered Indexed Row SV scan cost C index

lookup(N) + Crowscan(dsel · Ne)

Ccol. cl. indexscan (N, S, sel) Unclustered Indexed Col SV scan cost C index

lookup(N) + Ccolscan(dsel · Ne, S)

Crow uncl. indexscan (N, sel) Clustered Indexed Row SV scan cost C index

lookup + dsel · Ne · (Crandom + pageSize/BW)

Ccol. uncl. indexscan (N, S, sel) Unclustered Indexed Col SV scan cost C index

lookup + dsel · Ne · |S| · (Crandom + pageSize/BW)

Table 1: SV Query Cost modelSymbol Meaning ModelC log

update(Nu) Log SV update cost C logscan(Nu)

Crowupdate(N, Nu) Row SV update cost min

“Crandom +

lNNc

m· Crow

scan(2 · Nc),l

NNc

m· Crow

scan(Nc) + Nu · (Crandom + pageSize/BW)”

Ccolupdate(N, Nu, S) Col SV update cost min

“Crandom +

lNNc

m· Ccol

scan(2 · Nc),l

NNc

m· Ccol

scan(Nc) + Nu · |S| · (Crandom + pageSize/BW”

C indexsplit (d) Index split cost

“Pdi=1 (psplit)

i”

· Crandom

Crow cl. indexupdate (N, Nu, d) Cl. Index Row SV update cost C index

lookup(N) + 2 · Crowscan(Nu) + C index

split (d)

Ccol. cl. indexupdate (N, Nu, S, d) Cl. Index Col SV update cost C index

lookup(N) + 2 · Ccolscan(Nu, S) + C index

split (d)

Crow uncl. indexupdate (N, Nu, d) Uncl. Index Row SV update cost C index

lookup + Nu · (Crandom + pageSize/BW ) + C indexsplit (d)

Ccol. uncl. indexupdate (N, Nu, S, d) Uncl. Index Col SV update cost C index

lookup + Nu · |S| · (Crandom + pageSize/BW ) + C indexsplit (d)

Table 2: SV Update Cost model

transformation below. We consider Log, Row, Col, and Index SV.Table 3 describes the symbols used in our cost models.Query Cost. Table 1 shows the query cost models for Log, Row,Col, and Index SVs. We express each of the cost functions as asummation of random and sequential I/O costs. We consider thescan operation to be I/O-bound and hence neglect CPU costs. No-tice that the scan operations for Row and Col SVs are bufferedreads, i.e. OctopusDB reads as many tuples from a SV as can fitin the memory assigned to it. We need buffered reading for Col SV,because we need to join individual attributes to re-construct the tu-ple; for Row SV we also consider the additional random I/O costswhen reading multiple relations competing for the same hard disk,e.g. for join processing.

Symbol Meaning UnitN number of rows in tableNc number of rows in a table chunkBW sequential bandwidth of hard disk Pages/secCrandom costs for a random access secpageSize size of a page BytepointerSize size of a pointer in an index ByteF fan-out of index node

= 1 +j

pageSize�2·pointerSize2·colsize(key)

k

d depth of an index tree=

llogF

“lN·(colsize(key)+pointerSize)

pageSize

m”m

sel query selectivitym available main memory ByteA set of all attributes in tablekey indexed attributecolsize(Ai) size of a value of attribute i ByteS set of selected attributespsplit probability of a leaf split

Table 3: Symbols used in cost models

SV Transformation CostLog SV! Row SV C log

scan(N) + Crowscan(N)

Log SV! Col SV C logscan(N) + Ccol

scan(N, A)Row SV$ Col SV Crow

scan(N) + Ccolscan(N, A)

Row SV! Index SV Crowscan(N) +

“F d+1�1

F�1

”· Crandom

Col SV! Index SV Ccolscan(N, {key,rowID}) +

“F d+1�1

F�1

”· Crandom

Table 4: SV Transformation Cost modelUpdate Cost. All updates to OctopusDB are done in the Primary

Log and OctopusDB later propagates them recursively to the sub-sequent SVs using any appropriate maintenance algorithm. There-fore, update costs are a crucial factor when determining which SVsto keep. Table 2 shows our update cost model for Log, Row, Col,and Index SVs. For Row and Col SVs, we assume that either thetuples are scanned and updated in chunks of Nc; or each updatetriggers a random-I/O. We take the minimum cost among these twooptions as the update cost (as done by a cost-based optimizer). Forupdates in Index SVs, we also consider the costs to split leaves ornodes in the index structure (C index

split ). We model the probability ofhaving a node/leaf split at a level as exponentially proportional tothe depth of the level in the index tree.

For each call to registerSV or registerQuery, OctopusDB storesthe reference to output SV or Query in the SV or Query Catalogrespectively. Again, the holistic SV optimizer is responsible forpropagating updates from the primary log to all SVs recursively.There are several ways, e.g. lazy updates, to do such SV main-tenance. We believe that existing works from materialized viewscould be adapted in OctopusDB for SV maintenance. However,OctopusDB poses several new challenges, e.g. how to compute theoptimal number of stores. This also has to consider the amountof overlap among stores, i.e. to avoid extensive update costs forredundant data representations.Transformation Cost. Finally, we also model the costs to trans-form one type of SV to another in Table 4. We consider transfor-mation as a query scan on the input SV followed by a update scanon the output SV. For Index SV, only the index attributes and rowIDneed to be read; the index tree needs to be built on those attributesonly. The transformation cost model can be used by the holistic SVoptimizer when considering to transform one SV to another, e.g.whether to transform a Row SV into a Col SV. Transformation costis the price that OctopusDB has to pay while the benefit could bethe reduced scan costs i.e. the difference between the iteration costsof the old and new SVs, or reduced update cost. As SVs are fullyoptional, OctopusDB may balance the two cost factors based on agiven workload.

The three cost models discussed above form the backbone of theholistic SV optimizer. Based on these cost models the holistic SVoptimizer can create, maintain, scan, transform, or delete any SV.

5




m

ı· Crandom +


pageSize

ı/BW


‰N·

PAi⇤A colsize(Ai)

m

ı· Crandom +

‰N·

PAi⇤A colsize(Ai)

pageSize

ı/BW


PAi⇤S

„‰N·

PAi⇤S colsize(Ai)

m

ı· Crandom +

lN·colsize(Ai)

pageSize

m/BW

«













“Crandom +

lNNc

m· Crow

scan(2 · Nc),l

NNc

m· Crow



“Crandom +

lNNc

m· Ccol

scan(2 · Nc),l

NNc

m· Ccol



“Pdi=1 (psplit)

i”

· Crandom



split (d)



split (d)








= 1 +j


k


llogF


pageSize

m”m









“F d+1�1

F�1

”· Crandom


“F d+1�1

F�1

”· Crandom






5




m

ı· Crandom +


pageSize

ı/BW


‰N·

PAi⇤A colsize(Ai)

m

ı· Crandom +

‰N·

PAi⇤A colsize(Ai)

pageSize

ı/BW


PAi⇤S

„‰N·

PAi⇤S colsize(Ai)

m

ı· Crandom +

lN·colsize(Ai)

pageSize

m/BW

«













“Crandom +

lNNc

m· Crow

scan(2 · Nc),l

NNc

m· Crow



“Crandom +

lNNc

m· Ccol

scan(2 · Nc),l

NNc

m· Ccol



“Pdi=1 (psplit)

i”

· Crandom



split (d)



split (d)








= 1 +j


k


llogF


pageSize

m”m









“F d+1�1

F�1

”· Crandom


“F d+1�1

F�1

”· Crandom






5

Query Cost Model

Update Cost Model

Transform Cost Model

Further Directions


Information Systems GroupComparing Different Stores

19Further Directions

0

0.2

0.4

0.6

0.8

1

Row Store

Column Store

Indexed Row Store

Indexed Column Store

Fractured Mirrors

Indexed Fractured Mirrors

OctopusDB

work

load tim

e [s

econds]

Query Costs Update Costs

Tickets Customers

Tuples 100,000 20,000

Selectivity 0.9 0.1

AttributesReferenced 4/20 20/20


Information Systems GroupNext Steps

1.Automatically picking the right layout- row, column, partitioned, cracked, more?

2.Storage View compression - adaptive compression

3.Storage View maintenance- maintaining heterogenous SVs

4.OctopusDB benchmarking and evaluation - one-size-fits-all benchmark

20Further Directions


Information Systems GroupSummary


Log SV


API


Recovery Manager


Transaction Manager

Result

Query Catalog


Customers

Tickets

Col SV

Row SV

Log SV Result

!bag=customers

!bag,keyrecent

" (( ))

!bag=tickets!

bag,keyrecent" ((

))!

time>=now-7days

Col SV

!time<

now-7d

ays

tickets.customer_id

"customers.*

( ))!a1=x1 ... an=xn( customer.id

Index SV

Index SV

ticketsHotIndex

customersIndex

"

"

id,rid

price, rid

customers

ticketsCold

Thanks!

0

0.2

0.4

0.6

0.8

1

Row Store

Column Store

Indexed Row Store

Indexed Column Store

Fractured Mirrors

Indexed Fractured Mirrors

OctopusDB

wo

rklo

ad

tim

e

[se

co

nd

s]

Query Costs Update Costs



2

Motivation

OLTP

OLAP

StreamingSystem

ArchivalSystem

SearchEngine

Airline Company

Several ApplicationsEvolving Applications

ETL style data pipelinesEventual Integration

Licensing Cost

DBA CostMaintenance Cost

Engineering Cost

Integration Cost

Hard-coded optimizationsHard-coded data layouts

Reporting

Cheap Fares

Ticket Booking

Booking Archives

Flight Search

Friday, September 10, 2010

Date post:	26-Sep-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

The Mimicking Octopus - appspot.comjindal-web.appspot.com/slides/VLDB2010_workshop.pdf · Cheap...

Documents