Date post: | 09-Jul-2015 |
Category: |
Data & Analytics |
Upload: | manos-karpathiotakis |
View: | 97 times |
Download: | 0 times |
Adaptive Query Processing on RAW Data
Manos Karpathiotakis, Miguel Branco, Ioannis Alagiannis, Anastasia Ailamaki
Many “data problems” lack solutions
• Petabytes of data
• No a priori knowledge about data
• Multiple file formats
2
Many “data problems” lack solutions
• Petabytes of data
• No a priori knowledge about data
• Multiple file formats
3Need declarative, interactive access to arbitrary datasets
CSV JSONXML
DBMSTr
ansf
orm
QueryRAW: A query engine adapting to data
Raw files as native storage format
RAW: Generate plug-in per file
4
Traditional DBMS: Data adapts to query engine
CSV JSONXML
QueryRAW: A query engine adapting to data
Raw files as native storage format
RAW: Generate plug-in per file
5
Traditional DBMS: Data adapts to query engine
CSV JSONXML
QueryRAW: A query engine adapting to data
Raw files as native storage format
RAW: Generate plug-in per file
6
High-performance querying… while keeping data formats, files, and scripts
Traditional DBMS: Data adapts to query engine
How RAW adapts to data
CSVROOT
Vectorized Query Execution
CSVROOT
join
scanplaceholder
scanplaceholder
filter
How RAW adapts to data
scanplaceh.
SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumberAND root. EF_2mu13 == TRUE AND …
8
Code Generate the Access Paths
Vectorized Query Execution
CSVROOT
join
filter
How RAW adapts to data
scancsv
scanroot
SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumberAND root. EF_2mu13 == TRUE AND … scan
csv
9
Code Generate the Access Paths
Build Position and Data Caches
Vectorized Query Execution
CSVROOT
join
filter
How RAW adapts to data
scancsv
scanroot
SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumberAND root. EF_2mu13 == TRUE AND … scan
csv
10
Code Generate the Access Paths
Build Position and Data Caches
Vectorized Query Execution
CSVROOT
join
filter
How RAW adapts to data
scancsv
scanroot
SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumberAND root. EF_2mu13 == TRUE AND … scan
csv
11
Adapt to format, file instance and queryjust-in-time
Adapting to schema & query
[CSV input]
∀col:
if col needed:
if col isInt
readInt();
if col isFloat
readFloat();
if ...
else:
skipField();
GENERAL-PURPOSE
readInt();
readInt();
skipField();
readFloat();
skipRestLine();
JUST-IN-TIME
Remove overhead of generic operators
12
Adapting to format
• Unroll Columns
• Free navigation in files
• Embedded indexes/existing APIs
13
Adapting to format
• Unroll Columns
• Free navigation in files
• Embedded indexes/existing APIs
∀col:if col needed:
if col isInt...
readInt();skipField();readFloat();skipRest();
14
Adapting to format
• Unroll Columns
• Free navigation in files
• Embedded indexes/existing APIs
∀col:if col needed:
if col isInt...
readInt();skipField();readFloat();skipRest();
- fieldLength:10- tupleLength:100- Need fields 2 & 5
of 2nd row
moveTo(110);readInt();moveTo(140);readFloat();
15
Adapting to format
• Unroll Columns
• Free navigation in files
• Embedded indexes/existing APIs
∀col:if col needed:
if col isInt...
readInt();skipField();readFloat();skipRest();
- fieldLength:10- tupleLength:100- Need fields 2 & 5
of 2nd row
moveTo(110);readInt();moveTo(140);readFloat();
- Bitmaps, R-Trees etc.- readNextField() vs. readField(filename,id)
16
Ad-Hoc Operators for Raw Data Fine-grained, raw-data-aware decisions
Ad-Hoc Operators for Raw Data
Col9
Scan CSVColumns 1,9
Filter
Tuple Construction
Col1
Col1
JIT – OPTION 1
Col1 Col9
CSV file:SELECT col9 WHERE col1 < [X]
Fine-grained, raw-data-aware decisions
18
JIT – OPTION 2
Col1
Ad-Hoc Operators for Raw Data
Col9
Scan CSVColumns 1,9
Filter
Tuple Construction
Col1
Col1
JIT – OPTION 1
Col1 Col9
CSV file:SELECT col9 WHERE col1 < [X]
Scan CSVColumn 1
Fine-grained, raw-data-aware decisions
19
JIT – OPTION 2
Filter
Col1
Col1
543
12
Ad-Hoc Operators for Raw Data
Col9
Scan CSVColumns 1,9
Filter
Tuple Construction
Col1
Col1
JIT – OPTION 1
Col1 Col9
CSV file:SELECT col9 WHERE col1 < [X]
Scan CSVColumn 1
Fine-grained, raw-data-aware decisions
20
JIT – OPTION 2
Filter
Tuple Construction
Col1 Col9
Col1
Col9
Col1
543
12
Ad-Hoc Operators for Raw Data
Col9
Scan CSVColumns 1,9
Filter
Tuple Construction
Col1
Col1
JIT – OPTION 1
Col1 Col9
CSV file:SELECT col9 WHERE col1 < [X]
Scan CSVColumn 1
Fine-grained, raw-data-aware decisions
Scan CSVColumn 9
21
Processes 3/5 raw col9 fields
Electron
eventID INT
eta FLOAT
pt FLOAT
Jet
eventID INT
eta FLOAT
pt FLOATEvent
eventID INT
runNumber INT
Muon
eventID INT
eta FLOAT
pt FLOAT
ROOT - C++ RAWclass Event {
class Muon {float pt, eta;…
}class Electron {
float pt, eta;…
}class Jet {
float pt, eta;…
}int runNumber;vector<Muon> muons;vector<Electron> electrons;vector<Jet> jets; }
Finding the Higgs Boson: Data
22
Finding the Higgs Boson: Queries“Identify events of interest → Filter out background events
→ Plot aggregated results in a histogram”
SELECT eventFROM root:/data1/ATLAS/*.root ,
csv:/data1/ATLAS/events.csvWHERE( csv.id = event.id AND
event.EF_e24vhi_medium1 OR event.EF_e60_medium1 OR event.EF_2e12Tvh_loose1 OR event.EF_mu24i_tight OR event.EF_mu36_tight OR event.EF_2mu13) ANDevent.muon.mu_ptcone20 < 0.1 * event.muon.mu_pt ANDevent.muon.mu_pt > 20000. ANDABS(event.muon.mu_eta) < 2.4 AND
…..
1000+ lines of C++
for (unsigned int imuon = 0 ; imuon<((*curr_entries)[jentry].mu_pt)->size(); imuon++) {if (((*curr_entries)[jentry].
mu_ptcone20)->at(imuon) < 0.1 * ((*curr_entries)[jentry].mu_pt)->at(imuon) && ((*curr_entries)[jentry].mu_pt)->at(imuon) > 20000. &&fabs(((*curr_entries)[jentry].mu_eta)->at(imuon)) < 2.4 &&…
}...
ROOT - C++ RAW
23
0.1
1
10
100
1000
10000
Query 1(Cold)
Query 2 Query 3 Query 4 Query 5 Query 6
Exe
cuti
on
Tim
e (
sec)
RAW ROOT
RAW vs. the ROOT framework[Xeon CPU E7-28867 @ 2.13GHz1TB HDD - 7200RPM,192GB RAM]
ROOT: 900 GB in 127 files
CSV: 1 “table” of IDs
24
0.1
1
10
100
1000
10000
Query 1(Cold)
Query 2 Query 3 Query 4 Query 5 Query 6
Exe
cuti
on
Tim
e (
sec)
RAW ROOT
RAW vs. the ROOT framework[Xeon CPU E7-28867 @ 2.13GHz1TB HDD - 7200RPM,192GB RAM]
ROOT: 900 GB in 127 files
CSV: 1 “table” of IDs
25Declarative access and up to 90x improvement
Accessing raw data is DBMS business
• Adapt to data and queries
• CERN: 90x
• Application in Banking / Healthcare / Astronomy
http://dias.epfl.ch/RAW
26
Image Sources
http://www.nasa.gov
http://home.web.cern.ch/
http://www.alacergroup.com/
http://www.humanbrainproject.eu
28