11
Projection Indexes in Projection Indexes in HDF5HDF5
Rishi Rakesh SinhaRishi Rakesh Sinha
The HDF GroupThe HDF Group
2
Science Produces Large DatasetsScience Produces Large Datasets
Observation/experiment drivenObservation/experiment driven
Simulation driven
Information driven
144 MB/hr
200 GB/run
> 7GB/expt
3
Why Not Commercial DMBSs?Why Not Commercial DMBSs?
Proprietary formatProprietary format Lack of portabilityLack of portability Low scalabilityLow scalability Lack of desirable access modesLack of desirable access modes Presence of expensive concurrency Presence of expensive concurrency
control and logging mechanismcontrol and logging mechanism Expensive parallel versionsExpensive parallel versions
4
State of the Art Not EnoughState of the Art Not Enough
Scientific file formatsScientific file formats and associated and associated I/O APIsI/O APIs Concentrating on HDF5Concentrating on HDF5
Data recovery is Data recovery is navigationalnavigational
SubsettingSubsetting only on a small set of only on a small set of attributesattributes
5
Why Indexes?Why Indexes?
Easy
Not So Easy
6
Previous Indexing EffortsPrevious Indexing Efforts
Implicit indexing in HDF5Implicit indexing in HDF5 JPL use of HDF VdatasJPL use of HDF Vdatas HDF-EOS point dataHDF-EOS point data PyTablesPyTables HDF5 internal B-Tree structuresHDF5 internal B-Tree structures
7
Why a Standard Indexing API?Why a Standard Indexing API?
Avoid duplication of effortAvoid duplication of effort PyTablesPyTables
Standardize indexing in HDF5Standardize indexing in HDF5 Standard API can be differently Standard API can be differently
implementedimplemented Make indexes portableMake indexes portable
Store indexes in HDF5 filesStore indexes in HDF5 files
8
H5IN APIH5IN API
Create_indexCreate_index Parameters: location of index, location of Parameters: location of index, location of
data, binning information, memory limitsdata, binning information, memory limits Returns: location of the indexReturns: location of the index
QueryQuery Parameters: dataset to query, query stringParameters: dataset to query, query string Returns: selection representing subset of the Returns: selection representing subset of the
data corresponding to the querydata corresponding to the query
9
Design DecisionsDesign Decisions
Limited scope of the prototypeLimited scope of the prototype Index stored in a separate datasetIndex stored in a separate dataset Returns a selectionReturns a selection Projection indexProjection index Support for simple boolean queriesSupport for simple boolean queries
10
Limited ScopeLimited Scope
11stst indexing prototype in HDF5 indexing prototype in HDF5 Presence of implicit indexingPresence of implicit indexing
Index on single datasetsIndex on single datasets Query over single datasetsQuery over single datasets
Conditions should be over a single datasetConditions should be over a single dataset Result could be mapped to a separate datasetResult could be mapped to a separate dataset
11
Index StorageIndex StorageRoot Group: /
DAY1 DAY2 DAY3 DAY4
F3F3F2F2F1F1
Location DataPressureTemperature
12
Index StorageIndex StorageRoot Group: /
DAY3
F3F3F2F2F1F1
Location Data
LD_INDEX
F1 F2
13
Index StorageIndex StorageRoot Group: /
DAY3
PressureTemperature
T_IN P_IN
PressureTemperature
14
Returns a SelectionReturns a Selection
Temperature Pressure
Concise StorageConcise Storage Efficient Boolean operationsEfficient Boolean operations
FIND PRESSURE WHERE TEMP IN [100, 200]
15
Projection IndexProjection Index
TempTemp CategoryCategory PressurePressure
5252 AA 3232
4242 DD 3434
5757 FF 2121
2222 AA 2222
6767 DD 2727
AA
DD
FF
AA
FF
DD
16
BinningBinning
11 22 33 44 55 66 77 88 99 1010 1111 1212 1313 1414 1515
1-31-3 4-64-6 7-97-9 10-1210-12 13-1513-15
17
Projection IndexProjection Index
605040
313029Pressure
Temp
18
Why Projection Index ?Why Projection Index ?
Data is read onlyData is read only Mostly dataset once written is not changedMostly dataset once written is not changed
Index does not need to be updatedIndex does not need to be updated Projection indexes well suitedProjection indexes well suited
Number of disk accesses is same as in case Number of disk accesses is same as in case of a B-Treeof a B-Tree
Are not considering multidimensional Are not considering multidimensional queriesqueries
19
Only Simple Boolean QueriesOnly Simple Boolean Queries
Query FormatQuery FormatSELECT SELECT SELECTIONSELECTION
WHEREWHERE c11 < Attribute1 < c12c11 < Attribute1 < c12
AND c21 < Attribute2 < c22AND c21 < Attribute2 < c22
…… Results being selections boolean operations Results being selections boolean operations
can be done inside the library can be done inside the library
20
ConclusionConclusion
Developing a standard indexing API in Developing a standard indexing API in HDF5HDF5
Creating a proof of concept prototype Creating a proof of concept prototype using projection indexesusing projection indexes
Take first step towards developing a Take first step towards developing a query language for HDF5query language for HDF5
21
Future WorkFuture Work
Multi-dimensionalityMulti-dimensionality Multiple datasets in same fileMultiple datasets in same file Multiple datasets across filesMultiple datasets across files Indexes on attributesIndexes on attributes Allow user to index subset of datasetsAllow user to index subset of datasets