+ All Categories
Home > Documents > files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data...

files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data...

Date post: 16-Mar-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
79
Query processing on raw files Vítor Uwe Reus
Transcript
Page 1: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Query processing on raw files

Vítor Uwe Reus

Page 2: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Outline

1. Introduction

2. Adaptive Indexing

3. Hybrid MapReduce

4. NoDB

5. Summary

Page 3: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Outline

1. Introduction

2. Adaptive Indexing

3. Hybrid MapReduce

4. NoDB

5. Summary

Page 4: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Information Storing

Sometimes human-readable, open format

Not physically optimized for querying

Might be useful in some cases

Raw Files

Page 5: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference
Page 6: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Big Data

Traditional DBMS may not be a good option

Internet-scale business

Scientific data

Page 7: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

The fourth paradigm

For scientific discovery

Experimental

Theoretical

Computational (simulations)

Data driven

Page 8: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Interoperability

Page 9: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Interoperability

Information interoperability

Application interoperability

Page 10: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Human-sourced Information

Page 11: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

How to query raw files?

Page 12: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

State of the art

Raw file as storage

A-priori loading

Page 13: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Raw file parsing

AWK

Oracle external table

MySQL CSV engine

MapReduce

Read entire data all times

No indexing features

Page 14: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

A-priori loading

Load into a DBMS and then queryBenefit from indexes

TimeLabor intensive

Loading scripts, schemasData duplication

Big dataVersioning

Page 15: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Workload behavior

Page 16: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Load time vs Query time

Page 17: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Hybrid querying techniques

Page 18: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Outline

1. Introduction

2. Adaptive Indexing

3. Hybrid MapReduce

4. NoDB

5. Summary

Page 19: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Automatic tuning based on workload

Keep an auxiliary structure

Can benefit raw file parsing

Database Cracking

Adaptive Merging

Adaptive indexing

Page 20: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Database cracking

Physical reorganization of columns

Implemented on MonetDB

A column store, but can be generalized (raw)

Page 21: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Database Cracking

Cracking a column

Page 22: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Database Cracking

Column A → Copy to cracker column ACRK

AVL tree indexing

Refinement

Page 23: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Tuple reconstruction

Fast if columns are in same order

Cracking compromises original positions

Cracker columns: Value selection

Original columns: Tuple reconstruction

Page 24: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Adaptive merging

Incremental index creation as in cracking

Partitioned B-trees

Focus on merging instead of partitioning

Page 25: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Merging vs cracking

Typical result of merging compared to cracking*In this case, all queries focus on the same 106 keys in the center of the domain

Page 26: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Merging vs cracking

Cracking MergingConverge Stable FasterStorage AVL B-TreeData is Partitioned ...and Sortedas in.. Quick Sort Merge Sort

Page 27: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Outline

1. Introduction

2. Adaptive Indexing

3. Hybrid MapReduce

4. NoDB

5. Summary

Page 28: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Hybrid MapReduce

What is needed

Page 29: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

HadoopDB

MapReduce using a DBMS instead of HDFS

Page 30: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference
Page 31: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference
Page 32: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference
Page 33: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference
Page 34: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference
Page 35: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference
Page 36: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference
Page 37: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference
Page 38: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference
Page 39: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference
Page 40: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference
Page 41: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference
Page 42: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

SMS Planner

SQL MapReduce SQL

Page 43: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Hive query processor

1. Convert HiveQL query to AST

2. Get schema from catalog

3. Create a Query Plan

4. Optimize

5. Converted plan to one or more MR Jobs

Page 44: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

SMS Planner

1. Convert HiveQL query to AST

Update Catalog with DB information

2. Get schema from catalog

3. Create a Query Plan

4. Optimize

Reconstruc some SQL to push it to the DB

5. Converted plan to one or more MR Jobs

Page 45: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

SMS Planner

Page 46: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

HadoopDB Performance

Group By2,500,000 unique groups over 20gb of data

Join134,000 joined records

over 20gb of date

Page 47: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

HadoopDB loading times

Page 48: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

HadoopDB

☺ Good performance

☺ Scalable

☺ Fault tolerant

☺ Heterogeneous node compatible

☺ Make any DBMS a distributed system

☹ Data Loader: All a-priori loading problems

Page 49: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Invisible loading

Load DBMS with data from Hadoop at run-time

Invisibility objective

Minimal human effort

Minimal increase in response time

Use a DBMS as a cache for the raw data

Page 50: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Invisible loading

Use code for tuple parsing and extraction to invisibly load the parsed tuples into a DBMS

Read Write

Page 51: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Invisible loading

On next data access, it can be read from DBMS

Read

Page 52: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Invisible loading: Parser

Parser extends inputFormat

getAttribute(int index)

Code for tuple parsing and extraction

Map takes a Parser as input

Page 53: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Invisible loading

☺ Incremental data reorganization

☺ Almost no overhead on MR Jobs

☺ Optimizes future access speeds

☹ Data duplication (No GC)

Page 54: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Outline

1. Introduction

2. Adaptive Indexing

3. Hybrid MapReduce

4. NoDB

5. Summary

Page 55: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

New DBMS paradigm

Do not require data loading

Maintains feature set of modern DBMS

Replaces physical storage with raw files

NoDB

Page 56: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

PostgresRaw

NoDB Implementation

Replaces TableScan Operator

CSV Files

Optimizations

Page 57: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

PostgresRaw Optimizations

Selective...a. Tokenizingb. Parsingc. Tuple formation

Indexing

Auto Tuning

Caching

Statistics

Page 58: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

a. Selective tokenizing

111;222;"third";garbage;...

Supposing we want attributes 1 and 3

We can stop tokenizing at the third

Saves CPU time

Page 59: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

b. Selective parsing

111;222;"third";garbage;...In memory:

111 6F Parsed to int222 32 32 32 Keep as string"third" 74 68 69 72 64

Also: delayed parsing

Page 60: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

c. Selective tuple formation

111;222;"third";garbage;...

(111, "third")

Final tuple containing only attributes 1 and 3

CPU bound

Page 61: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Indexing

Year; Make; Model; Liters

1997; BMW; E89; 2,34

2011; Mercedes; SLS; 2

Looks nice :)

Page 62: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Indexing

NOT :(

Year;Make;Model;Liters¶1997;BMW;E89;2,34¶2011;Mercedes;SLS;2

Sequentially reading each time is not an option

SolutionKeep an index of the already used attributesSkip file reading to this positions

Page 63: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Indexing

Positional MapDynamically created according to queries

Year;Make;Model;Liters¶1997;BMW;E89;2,34¶2011;Mercedes;SLS;2

Tuple 1 Tuple 2 Tuple 3Attribute 1 Attribute 3 Attribute 1 Attribute 3 Attribute 1 Attribute 3

0 10 23 32 41 55

Page 64: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Updates

First case, no positions change

Year;Make;Model;Liters¶1989;BBB;CCC;4,44¶2011;Mercedes;SLS;2

Tuple 1 Tuple 2 Tuple 3Attribute 1 Attribute 3 Attribute 1 Attribute 3 Attribute 1 Attribute 3

0 10 23 32 41 55

Page 65: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Updates

Second case, positions change.First option, update index.

Year;Make;Model;Liters¶1989;B;C;4,44¶2011;Mercedes;SLS;2

Tuple 1 Tuple 2 Tuple 3Attribute 1 Attribute 3 Attribute 1 Attribute 3 Attribute 1 Attribute 3

0 10 23 32 30 (-2) 41 37 (-4) 55 51 (-4)

Page 66: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Updates

Second case, positions change.Second option, throw it partially (or fully) away.

Year;Make;Model;Liters¶1989;B;C;4,44¶2011;Mercedes;SLS;2

Index will automatically reconstruct itself

Tuple 1 Tuple 2Attribute 1 Attribute 3 Attribute 1

0 10 23

Page 67: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Traditional optimizations

Caching

Statistics

Page 68: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

NoDB Performance Compared

Page 69: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

NoDB

☺ Great DBMS + Raw hybrid

☺ Competitive performance with traditional DBs

☺ Eliminates loading times

☺ Queries get faster with time

☹ Updates

Page 70: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Outline

1. Introduction

2. Adaptive Indexing

3. Hybrid MapReduce

4. NoDB

5. Summary

Page 71: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Summary

Mature solutions: high load or query timeNo index → High query timeLoad all data → High delay (load time)

Hybrid solutionsBring indexes to in-situ processingAdaptive indexingHadoopDBNoDB

Page 72: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Remember..

Page 73: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Conclusions

Page 74: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

References1. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and Alexander Rasin.

HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proceedings of the VLDB Endowment,

2. Azza Abouzied, Daniel J. Abadi, and Avi Silberschatz. Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference on Extending Database Technology, pages 1–10, 2013.

3. Renata Borovica, Stratos Idreos, and Anastasia Ailamaki. NoDB : Efficient Query Execution on Raw Data Files Categories and Subject Descriptors. pages 241–252.

4. Goetz Graefe and Harumi Kuno. Adaptive indexing for relational keys. 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pages 69–74, 2010.

5. Felix Halim, S Idreos, P Karras, and RHC Yap. Stochastic database cracking: Towards robust adaptive indexing in main-memory column-stores. Proceedings of the VLDB Endowment (PVLDB),

6. Tony Hey, Stewart Tansley, and Kristin Tolle, editors. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Washington, 2009.

7. Stratos Idreos, Ioannis Alagiannis, Ryan Johnson, and Anastasia Ailamaki. Here are my data files. here are my queries. where are my results. Proceedings of 5th Biennial Conference on Innovative Data Systems Research, pages 57–68, 2011.

8. Christopher Olston, Benjamin Reed, Ravi Kumar, and Andrew Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing.

9. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive - A Ware-housing Solution Over a Map-Reduce Framework. PVLDB

Page 75: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Questions?

Page 76: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Thank you!

Page 77: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

MapReduce

Can be classified as distributed raw file parsing

Page 78: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Adaptive merging

Page 79: files Query processing on raw - - TU Kaiserslautern · Invisible loading: access-driven data transfer from raw files into database systems. Proceedings of the 16th International Conference

Database Cracking


Recommended