Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

transcript

Ohio State University Department of Computer Science and Engineering

Automatic Data Virtualization - Automatic Data Virtualization - Supporting XML based abstractions Supporting XML based abstractions

on HDF5 Datasetson HDF5 DatasetsSwarup Kumar Sahoo

Gagan Agrawal

RoadmapRoadmap• Motivation • Introduction• System Overview• XQuery, Low and High Level schema and HDF5

storage• Compiler Analysis and Algorithm• Experiment • Summary and Future Work

MotivationMotivation• Emergence of grid-based data repositories

– Can enable sharing of data• Emergence of applications that process large datasets

– Complicated by complex and specialized storage formats• Need for easily portable applications

– Compatibility with web/grid services

Data VirtualizationData Virtualization An abstract view of data

datasetData Service

DataVirtualization

By Global Grid Forum’s DAIS working group:• A Data Virtualization describes an abstract view of data.• A Data Service implements the mechanism to access and process data through the Data Virtualization

Introduction : Automatic Data Introduction : Automatic Data VirtualizationVirtualization

• Goal : Enable Automatic creation of efficient data services

– Support a high-level or abstract view of data– Data is stored in low-level format

• Application development: – assume a high-level or virtual view

• Application Execution: – On actual low-level layout

Overview of Our Automatic Data Overview of Our Automatic Data Virtualization WorkVirtualization Work

• Previous work on XML Based virtualization – Techniques for XQuery Compilation (Li and Agrawal, ICS

2003, DBPL 2003) – Supporting XML Based high-level abstraction on flat-file

datasets (LCPC 2003, XIME-P 2004)• Relational Table/SQL Based Implementation

– Supporting SQL Select and Where (HPDC 2004) – Supporting SQL-3 Aggregations (LCPC 2004)

XML-based VirtualizationXML-based Virtualization

NetCDF

XQuery

Challenges and ContributionsChallenges and Contributions

• Challenges – Compiler generates efficient data processing code

» Uses the information about the low-level layout and mapping between virtual and low-level layout

– Challenge in compilation» High level to low level » to ensure high locality in processing of large datasets

• Contributions of this paper – An improved data- centric transformation algorithm – An implementation specific to HDF5 as the low-level format

System OverviewSystem OverviewHigh levelXML Schema

Mapping Schema

XQuery Source Code

Compiler

Generated Code

Processor and Disk

System OverviewSystem OverviewLow levelXML Schema

HDF5 Library

XQuery and HDF5XQuery and HDF5

• High-level declarative languages ease application development– XQuery is a high-level language for processing XML datasets– Derived from database, declarative, and functional languages!

• HDF5:– Hierarchical Data Format – Widely used in scientific communities – A case study with a format which has optimized access

libraries

Use of XML SchemasUse of XML Schemas

• High-level schema– XML is used to provide a virtual view of the dataset

• Low-level schema – reflects actual physical layout in HDF5

• Mapping schema:– describes mapping between each element of high-level

schema and low-level schema

Oil Reservoir SimulationOil Reservoir Simulation• Support cost-effective Oil

Production• Simulations on a 3-D grid• 17 variables and cell

locations in 3-D grid at each time step

• Computation of bypassed regions

– Expression to determine if a cell is bypassed for a time-step

– Within a spatial region and range of time steps

– Grid cells that are bypassed for every time-step in the range

Oil Reservoir management

High-Level SchemaHigh-Level Schema< xs:element name="data" maxOccurs="unbounded" >

< xs:complexType > < xs:sequence >

< xs:element name="x" type="xs:integer"/ > < xs:element name="y" type="xs:integer"/ > < xs:element name="z" type="xs:integer"/ > < xs:element name="time" type="xs:integer"/ > < xs:element name="velocity" type="xs:float"/ > < xs:element name="mom" type="xs:float"/ >

< /xs:sequence > < /xs:complexType >

< /xs:element >

High-Level XQuery Code Of Oil High-Level XQuery Code Of Oil Reservoir managementReservoir management

unordered( for $i in ($x1 to $x2)

for $j in ($y1 to $y2) for $k in ($z1 to $z2)

let $p := document("OilRes.xml")/datawhere ($p/x=$i) and ($p/y = $j) and ($p/z = $k) and ($p/time >= $tmin) and ($p/time <= $tmax) return <info> <coord> {$i, $j, $k} </x-coord> <summary> { analyze($p) } </summary> </info>

Low-Level SchemaLow-Level Schema<file name="info">

<attribute name="time"> <datatype> integer </datatype> <dataspace> <rank> 1 </rank> <dimension> [1] </dimension> </dataspace> </attribute>

<dataset name="velocity"> <datatype> float </datatype> <dataspace> <rank> 1 </rank> <dimension> [x] </dimension> </dataspace> </dataset>

..............

</group> </sequence>

</file>

Mapping SchemaMapping Schema//high/data/velocity //low/info/data/velocity //high/data/time //low/info/data/time //high/data/mom //low/info/data/mom [index(//low/info/data/velocity, 1)]

//high/data/x //low/coord/x [index(//low/info/data/velocity, 1)]

Compiler AnalysisCompiler Analysis

• Problem with direct translation :– Each let expression involves complete scan over dataset– So final code will need several passes over the data

• Solution :– Apply Data Centric Transformations to read a portion HDF5

dataset only once

Naïve Strategy Naïve Strategy

DatasetOutput

Requires 3 Scans

Data Centric StrategyData Centric Strategy

DatasetsOutput

Requires just one scan

Data Centric TransformationData Centric Transformation

• Overall Idea in Data-Centric Transformation – Iterate over each data element in actual storage – Find out iterations of the original loop in which they are accessed.– Execute computation corresponding to those iterations.

• Previous Work – Pingali et al.: blocking – Ferreira and Agrawal: data-parallel Java on disk-resident datasets– Li and Agrawal: XQuery, invert getData functions

• Our contribution: – Use Low-Level and Mapping Schema – Extend the idea when multiple datasets need to be accessed

• Mapping Function T :Iteration space → High-Level data

• Mapping Function C : High-Level data → Low-Level data

• Mapping Function C · T = M : Iteration space → Low-Level data

• Our Goal is to compute M-1.

• Choose one dataset as base dataset S1 from n datasets to be accessed

• Apply M1-1 to compute set of iterations.

• The expression Mi · M1

-1 gives the portion of dataset Si that needs to be accessed along with S1

• Choice of base dataset might impact the data locality.

Choice of Base DatasetChoice of Base Dataset

• Min-IO-Volume Strategy – Minimize repeated access to any dataset

• Min-Seek-Time Strategy – Minimize any discontinuity in access

Template for Generated CodeTemplate for Generated CodeGenerated_Query { Create an abstract iteration space using Source code. Allocate and initialize an array of output element corresponding to

iteration space. For k = 1, …, NO_OF_CHUNKS

{ Read kth chunk of dataset S1 using HDF5 functions and structural tree. Foreach of the other datasets S2, … , Sn

access the required chunk of the dataset. Foreach data element in the chunks of data

{ compute the iteration instance. apply the reduction computation and update the output.

ExperimentExperimentImpact of Strategy and Chunk-Size, Dataset1

0200400600800

100012001400

1 5 15 31 62 125

Read Chunk-Size(x1000 elements)

Min-Seek-Time

Min-IO-Volume

200*200*200 grid with 10 time steps (1.28 GB)

50*50*50 Storage Chunk Size

ExperimentExperimentImpact of Strategy and Chunk-Size, Dataset2

1 5 15 31 62

Read Chunk-Size(x1000 elements)

Min-Seek-TimeMin-IO-Volume

50*50*50 grid with 200 time steps (400 MB)

25*25*25 Storage Chunk Size

Key ObservationsKey Observations

• Overall minimum execution time – Min-IO-Volume strategy when read chuck size matches

storage chunk size

• Execution time – Very sensitive to Read Chunk-Size in Min-IO-Volume

Strategy– Not sensitive to Read Chunk-Size in Min-Seek-Time

Strategy due to buffering of Storage chunks

SummarySummary• Compiler techniques

– Support High-level abstractions on complex low-level data formats

– Enables use of the same source code across a variety of data formats

– Perform data centric transformations automatically– Experimental result shows minor change in strategy can affect

performance significantly • Future Work

– Cost models to guide strategy and chunk size selection – Compare performance with manual implementations – parallelizing data processing– extend applicability of the algorithm to more general class of

queries

Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Documents