Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | wesley-byrd |
View: | 212 times |
Download: | 0 times |
C-Store: Column Stores over Solid State Drives
Jianlin FengSchool of SoftwareSUN YAT-SEN UNIVERSITYJun 19, 2009
Solid State Drives (SSDs) vs. Hard Disk Drives (HDDs) HDDs (traditional magnetic hard drives) perform seq
uential reads much faster than random reads. The traditional wisdom is to avoid random I/O as much as p
ossible.
SSDs perform random reads more than 100x faster than HDDs, and offer comparable sequential read and write performance.
SSDs’ random write performance is much worse than random read performance.
Characteristics of HDD and SSD (NAND Flash)
How to leverage the fast random reads of SSDs? Avoid reading unnecessary attributes during
selections and projections. The idea of Column store.
Reduce I/O requirements during join by minimizing passes over related tables.
Minimize the I/O needed to fetch attribute values by late materialization.
Page Layouts: NSM vs PAX
NSM: traditional row store.
PAX: A hybrid approach of row store and column store.
•each page is divided into n minipages.
•Each minipage stores the values of a column contiguously.
FlashScan Operator
It is a scan operator that leverages the PAX layout to improve selections and projections on flash SSDs.
Basic ideas: Once a page is brought into main memory, read only the mi
nipages of the attributes that are in need.
The goal is to reduce memory bandwidth. The cache line is 128 Bytes long, suggesting ideally a mini
page should take the same size.
An Example Running of FlashScan Consider a scan that simply project the 1st an
d 3rd column of the relation in Figure 3. For each page, FlashScan initially reads the minip
age of the 1st column And then “seeks” to the start of the 3rd minipage a
nd read it. Then it “seeks” again to the first minipage of the n
ext page. This procedure continues over the entire relation,
resulting in a random access pattern.
Processing Random Reads in a Batch Mode In general, every “seek” results in a random r
ead. FlashScan coalesces the reads and performs
one random access of SSD for each set of contiguous minipages.
Implementing FlashScan in Postgres Divide every page into dense-packed minipages.
Modify the buffer manager and bulk loader to work with the PAX page layout. A page in the buffer pool may be partially full containin
g only the minipages transferred by FlashScan.
Make FlashScan output tuples in row-format. i.e., tuple reconstruction.
Optimization for Selection Predicates The technique
Read only the minipages that satisfying the selection conditions.
This technique is beneficial for highly selective conditions and for selection conditions that are applied to sor
ted or partially sorted attributes.
FlashScan and Column Stores
FlashScan needs to “seek” between minipages. Column stores need to “seek” between columns.
Column stores on HDDs read a large portion of a single column at a time to amortize the “seek” overhead.
If a column store is built on SSDs, it should have similar behavior as FlashScan. This assertion needs experimental validation.
And a PAX-based system can be easily integrated with a row-store DBMS. So we have no need to build a column store on SSDs?
Overview of FlashJoin
FlashJoin is a multi-way equi-join algorithm.
It is implemented as a pipeline stylized binary joins.
Each binary join in the pipeline consists of two separate operators:
a join kernel a fetch kernel.
An Example of FlashJoin Using Late Materialization
Join Kernel
The join kernel leverages FlashScan to fetch only the join attributes needed from base relations. i.e., FlashJoin uses late materialization. Hence the join kernel needs less memory, which may lead t
o less passes for computing the join.
The join kernel computes the join and output a join index. For example, Join 2 in previous slide produces a join index
containing three RIDs (id1, id2, id3) pointing to rows of R1, R2, and R3.
Fetch Kernel
The fetch kernel uses the join index to do tuple reconstruction. i.e., retrieve values of projected attributes for tuple
s in the join result.
A naïve strategy is to do tuple reconstruction in a tuple-at-a-time fashion. If several tuples belonging to the same page, that
page may be read several times. This is not a problem if we have enough memory.
An Optimization in Fetch Kernel Makes multiple passes over the join index to f
etch attributes in row order from one relation at a time. In each pass, the join index is sorted based on the
RIDs of the current relation R to be scanned. Then, it retrieves the needed attributes from that r
elation R for each tuple and augments the join index with those attributes.
Why Does the Fetching in Row Order? Sorting ensures that once a minipage from a relation
has been accessed, it will not need to accessed again. Thus placing minimal demands on the buffer pool.
However sorting does not ensure sequential access to the underlying relation. Because pages corresponding to the sorted RIDs can be fa
r apart. Hence this optimization is better performed on SSDs than o
n HDDs.
Conclusion
SSDs consititute a significant shift in hardware characteristics. comparable to large CPU caches and many-core
processors. SSDs can improve performance for read-mos
t applications. A column-based page data layout is shown to
be a natural choice for speeding up selections and projections on SSDs.
References
Dimitris Tsirogiannis, Stavros Harizopoulos, Mehul A. Shah, Janet L. Wiener, and Goetz Graefe. Query Processing Techniques for Solid State Drives. In SIGMOD, 2009.