indexing & sortingprof. Stratos Idreos
HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/
class 8
http://daslab.seas.harvard.edu/classes/cs165/
CS165, Fall 2016 Stratos Idreos /362
first part done: basic concepts in modern systems
coming up: indexing and fast scans
CS165, Fall 2016 Stratos Idreos /363
column-storage row-storage
A B C D A B C D
it all starts with how we layout the data (bits)
CS165, Fall 2016 Stratos Idreos /364
essential column-stores featuresvirtual ids late tuple reconstruction (if ever) vectorized execution compression fixed-width columns
0"
5"
10"
15"
20"
25"
30"
35"
40"
45"
Column"Store" Row"Store"
Run$
me'(sec)'
Performance'of'Column3Oriented'Op$miza$ons'
–Late"Materializa:on"
–Compression"
–Join"Op:miza:on"
–Tuple@at@a@:me"
Baseline"
Column-stores vs. row-stores: how different are they really? D. Abadi, S. Madden, and N. Hachem
ACM SIGMOD Conference on Management of Data, 2008
CS165, Fall 2016 Stratos Idreos /365
registers
on chip cache2x
on board cache10x
memory100x
disk100Kx
Jim Gray, IBM, Tandem, DEC, Microsoft ACM Turing award ACM SIGMOD Edgar F. Codd Innovations award
Pluto2 years
New York1.5 hours
this building10 min
this room1 min
my head~0
CS165, Fall 2016 Stratos Idreos /366
random access & page-based access
…
need to only read x… but have to read all of page 1
page1 page2 page3
data value x
registers
on chip cache
on board cache
memory
disk
CPU
data
mov
e
CS165, Fall 2016 Stratos Idreos /367
and then from ideas to practice: implementation, tuning, analysis
CS165, Fall 2016 Stratos Idreos /368
as you are starting your projects, remembercome to office hours - read/post in piazza
distributed API, DSL, code is supposed to help you start fast diverging is perfectly ok when there is a good reason to do so
functionality goal: select max(R.a), min(S.a) from R, S where R.j=S.j and R.b10 and S.d
CS165, Fall 2016 Stratos Idreos /369
vectorwised processing: how to
select max(A) from R where B
CS165, Fall 2016 Stratos Idreos /3610
midtermshow to prepare
open book, notes, no laptop/discussion
material from lectures only
check all quizzes and questions
quiz-like questions - no exact answer
expectations: describe the design space - chose what you think is the best approach (>1 if we ask for it) and then analyze in detail all requests - if you made the wrong choice in the begging it is ok - but say so if you find out in the end and explain as much as possible
explain all steps and tradeoffs
Sunday before midterm: Stratos office hours 2pm+
10/12
CS165, Fall 2016 Stratos Idreos /3611
today+3data access made better
CS165, Fall 2016 Stratos Idreos /3612
select
data data data
join
aggr
selectselect
join
it all starts with the select operator
it touches all the data
CS165, Fall 2016 Stratos Idreos /3613
filtering data: point/range queriesindex
data
index knows structure of the data
an alternative data representation (data structure) of all or part of the data
CS165, Fall 2016 Stratos Idreos /3614
why bother with creating/maintaining another data structure?
but wait, why not just sort the data (array) +
binary search?
CS165, Fall 2016 Stratos Idreos /3615
ok let’s go with sorting for a while
A B C
initial state columns in
insertion order
sorted A B C
select B+C from R where A
CS165, Fall 2016 Stratos Idreos /3616
a1 a2 a3 a4 a5
b1 b2 b3 b4 b5
c1 c2 c3 c4 c5
A B Ca5 a3 a2 a1 a4
Ab1 b2 b3 b4 b5
c1 c2 c3 c4 c5
B Cvalues are out of order
CS165, Fall 2016 Stratos Idreos /3616
a1 a2 a3 a4 a5
b1 b2 b3 b4 b5
c1 c2 c3 c4 c5
A B Ca5 a3 a2 a1 a4
Ab1 b2 b3 b4 b5
c1 c2 c3 c4 c5
B Cvalues are out of order
5 3 2 1 4
CS165, Fall 2016 Stratos Idreos /3616
a1 a2 a3 a4 a5
b1 b2 b3 b4 b5
c1 c2 c3 c4 c5
A B Ca5 a3 a2 a1 a4
Ab1 b2 b3 b4 b5
c1 c2 c3 c4 c5
B Cvalues are out of order
5 3 2 1 4
a5 a3 a2 a1 a4
A5 3 2 1 4
select2 1 4
b1 b2 b3 b4 b5
B
intermediate out of order
CS165, Fall 2016 Stratos Idreos /36
sort or cluster
17
CS165, Fall 2016 Stratos Idreos /3618
database kernel
data data data
algo
rithm
s/op
erat
ors
applications
sql
disk
memory
cpu
CS165, Fall 2016 Stratos Idreos /3619
A B C
initial state columns in
insertion order
sorted A B C
sorted A B C
propagate order of A
CS165, Fall 2016 Stratos Idreos /3620
pos1 pos2
1 0 1 0
A
sort
edselect max(D),min(E) from R where (A>10 and A20 and B20 and B
CS165, Fall 2016 Stratos Idreos /3621
A
sort
edselect max(D),min(E) from R where (A>10 and A20 and B20 and B
CS165, Fall 2016 Stratos Idreos /3622
A B C
base data
A B C
sort
ed
B A C
sort
ed
queries that filter on A benefit
…C-Store: A Column-oriented DBMSMichael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth J. O'Neil, Patrick E. O'Neil, Alex Rasin, Nga Tran, Stanley B. Zdonik In Proc. of the Very Large Databases Conference (VLDB), 2005
queries that filter on B benefit
CS165, Fall 2016 Stratos Idreos /36
initial state columns in
insertion order
23
A B C
base data
A B C
sort
ed
B A C
sort
ed
…
space overhead - update overhead - which ones to build?
CS165, Fall 2016 Stratos Idreos /3624
declarative interface ask what you want
db system
DBAindexes/views/tuning knobs
CS165, Fall 2016 Stratos Idreos /36
online
25
initial state columns in
insertion order
A B C
base datastorage budget
CS165, Fall 2016 Stratos Idreos /3626
level 1
level 2
CPU
(assume simplified memory hierarchy)
cost to sort array Cs?cost to find a value once sorted Ca?optimized algorithm to minimize Cs & Ca
data does not fit in level 1 memory CPU can read/write directly from/to level 1 only
CS165, Fall 2016 Stratos Idreos /3627
memory level L
memory level L-1
(size=3 pages)
initial state: 8 unordered pages
CS165, Fall 2016 Stratos Idreos /3627
memory level L
memory level L-1
(size=3 pages)
quicksort in place
initial state: 8 unordered pages
CS165, Fall 2016 Stratos Idreos /3627
memory level L
memory level L-1
(size=3 pages)
initial state: 8 unordered pages
CS165, Fall 2016 Stratos Idreos /3627
memory level L
memory level L-1
(size=3 pages)
quicksort in place
initial state: 8 unordered pages
CS165, Fall 2016 Stratos Idreos /3627
memory level L
memory level L-1
(size=3 pages)
initial state: 8 unordered pages
CS165, Fall 2016 Stratos Idreos /3627
memory level L
memory level L-1
(size=3 pages)
quicksort in place
initial state: 8 unordered pages
CS165, Fall 2016 Stratos Idreos /3627
memory level L
memory level L-1
(size=3 pages)
initial state: 8 unordered pages
each page is now sorted we read and wrote every page once
data movement cost is 2N pages
CS165, Fall 2016 Stratos Idreos /3628
memory level L
memory level L-1
(size=3 pages)
initial state: 8 sorted pages
CS165, Fall 2016 Stratos Idreos /3628
memory level L
memory level L-1
(size=3 pages)
merge to new page
initial state: 8 sorted pages
CS165, Fall 2016 Stratos Idreos /3628
memory level L
memory level L-1
(size=3 pages)
merge to new page
initial state: 8 sorted pages
CS165, Fall 2016 Stratos Idreos /3628
memory level L
memory level L-1
(size=3 pages)
merge to new page
initial state: 8 sorted pages
CS165, Fall 2016 Stratos Idreos /3628
memory level L
memory level L-1
(size=3 pages)
merge to new page
initial state: 8 sorted pages
CS165, Fall 2016 Stratos Idreos /3628
memory level L
memory level L-1
(size=3 pages)
merge to new page
initial state: 8 sorted pages
CS165, Fall 2016 Stratos Idreos /3628
memory level L
memory level L-1
(size=3 pages)
initial state: 8 sorted pages
CS165, Fall 2016 Stratos Idreos /3628
memory level L
memory level L-1
(size=3 pages)
initial state: 8 sorted pages
each pair of pages is now sorted we read and wrote every page once
data movement cost is 2N pages (total 2N+2N)
CS165, Fall 2016 Stratos Idreos /3629
1 pass to sort each page (2N pages)
1 pass to merge into 2 sorted pages (2N pages)
1 pass to merge into 4 sorted pages (2N pages)
1 pass to merge into 8 sorted pages (2N pages)
CS165, Fall 2016 Stratos Idreos /3629
1 pass to sort each page (2N pages)
1 pass to merge into 2 sorted pages (2N pages)
1 pass to merge into 4 sorted pages (2N pages)
1 pass to merge into 8 sorted pages (2N pages)
CS165, Fall 2016 Stratos Idreos /3629
1 pass to sort each page (2N pages)
1 pass to merge into 2 sorted pages (2N pages)
1 pass to merge into 4 sorted pages (2N pages)
1 pass to merge into 8 sorted pages (2N pages)
log2(N)
CS165, Fall 2016 Stratos Idreos /3629
1 pass to sort each page (2N pages)
1 pass to merge into 2 sorted pages (2N pages)
1 pass to merge into 4 sorted pages (2N pages)
1 pass to merge into 8 sorted pages (2N pages)
log2(N)+1
CS165, Fall 2016 Stratos Idreos /3629
1 pass to sort each page (2N pages)
1 pass to merge into 2 sorted pages (2N pages)
1 pass to merge into 4 sorted pages (2N pages)
1 pass to merge into 8 sorted pages (2N pages)
2N(log2(N)+1)
CS165, Fall 2016 Stratos Idreos /3629
1 pass to sort each page (2N pages)
1 pass to merge into 2 sorted pages (2N pages)
1 pass to merge into 4 sorted pages (2N pages)
1 pass to merge into 8 sorted pages (2N pages)
2N(log2(N)+1) x bytesPerPage
CS165, Fall 2016 Stratos Idreos /3630
in general, we have M pages in memory not just 3 so
in our first pass we can immediately sort groups of M pages
&
CS165, Fall 2016 Stratos Idreos /3630
in general, we have M pages in memory not just 3 so
2N(log2(N)+1) -> 2N(logM-1(N)+1)
in our first pass we can immediately sort groups of M pages
2N(logM-1(N)+1) -> 2N(logM-1(N/M)+1)
&
CS165, Fall 2016 Stratos Idreos /3631
data size: N pages memory size: M pages
how much memory M do we need to sort N data in p passes only?
or
how much data can we sort in p passes if we have M memory?
logM-1(N/M)+1
CS165, Fall 2016 Stratos Idreos /36
previous discussion holds for all levels of memory hierarchy
32
CS165, Fall 2016 Stratos Idreos /3633
other usage of sorting, e.g.,:order by group by sort merge join remove duplicates sort positions when unordered to avoid random access
CS165, Fall 2016 Stratos Idreos /3634
indexing helps navigate data faster than scan
indexing is (some times) just another way to organize data
we need to consider all levels of memory hierarchy
when we design our algorithms
and to optimally use all available bytes
Notes to remember
CS165, Fall 2016 Stratos Idreos /3635
textbook: Chapter 13
Self-organizing tuple reconstruction in column-storesStratos Idreos, Martin Kersten, Stefan Manegold In Proc. of the ACM SIGMOD Inter. Conference on Management of Data, 2009
DATA SYSTEMSprof. Stratos Idreos
class 8
indexing & sorting