Towards BenchmarkingLarge Arrays in Databases
H. Stamerjohanns P. Baumann
Computer ScienceJacobs University Bremen
WBDB12
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 1 / 23
An Array DBMS: RasdamanGoal of rasdaman database:
handle raster datamassive n-dimensionalSensor-, Image-,Model & Statistics DB 1
Tile-based architecturen-D array → set of n-D tilesadapting storage to access pattern(preserve locality of reference) 1
984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
Grid DataBlade
Rasdaman
TerraLib
PostGis Raster
Oracle genraster
ESRI ArcSDE
SciQL
SciDB
Paradisepicdms
SpatiaLite
Grid & Gridfield
AQuery
RAMAML
AQL
EXTRA/EXCESS
OpenTSD, ExtaScid
1Baumann 1992, Baumann VLDBJ 1994H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 2 / 23
An Array DBMS: Rasdaman
declarative, minimal, safe Array Algebra:Intensive user studies: statistics, image, signal processing
minimally invasive DBMS integrationnew attribute type: array<celltype,extent>
maps d-dimensional Euclidean hypercube Xonto value set V
Array is function a : X → V
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 3 / 23
An Array DBMS: Rasdaman
implements SQL-embedded DML with array operatorsselect / insert / update / delete + partial update
select img.scene.green[x0:x1,y0:y1] > 130from LandsatArchive as imgwhere some_cells(img.scene.nir > 127)
Web mapping, image & signalprocessing statistics,linear algebra, pattern mining,scientific analytics
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 4 / 23
What is Big Data?
somehow connected to volumebut volume is moving targetnot only petabytes are Big Data
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 5 / 23
What is Big Data?
unless you are reeeaally big, storage volume is not biggestproblemto do proper analysis then is the difficultysuboptimal access patterns show up→ inability of existing DB to scale
cardinality of data is typically small compared to volumerepeated observations of time or spacemany datasets have inherent temporal or spatial dimensionsbut not ordered accordingly to preserve localityanalysis then results in random-access patterns → sloow.
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 6 / 23
What is Big Data?
ETL may not be the right solution...big volumes need to be transferred for further processing
Meta-definition:"Any point in time when data volume forces us to look beyondthe tried-and-true methods that are prevalent at that time"2
2A. Jacobs 2009H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 7 / 23
Array database domain
Diverse worlddifferent approaches to implement arrays on databasesexist
MonetDB3
SciDB4
no unified query language availabledifferent usage scenarios
(web-) service providing access to many usersbut also personal research tool to analyse data
3van Ballegoji et al., 2005, www.monetdb.org4P. Cudre-Mauroux et al., 2009, www.scidb.org
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 8 / 23
Benchmarking Array DBMS
Benchmarks should be... [Gray 1993]relevant
→ map real-world needs→ rather practice driven
systematically cover features and data properties→ apply to different application domains
simpleobviously some trade-off to previous point needed
portableas no unified query language available→ high level description of tasks to fulfill
scalable
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 9 / 23
Benchmarking Array DBMS
Need to testfurther details follow...
array featuresdimensionality, cell types
data propertiesvolume, sparsity
array query operationsdomain specific features
special operations, transformations
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 10 / 23
What needs to be tested... relevance
number of dimensionslow-dimensional (1-D - 5-D)1-D environmental sensor time series2-D satellite images, seafloor maps3-D x/y/t image time seriesand x/y/z geophysics data4-D x/y/z/t climate and ocean datamedium-dimensional (6-D - 12-D)OLAPhigh-dimensional (up to thousands)Data-Mining, collection of features
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 11 / 23
What needs to be tested... relevance
number of dimensionslow-dimensional (1-D - 5-D)1-D environmental sensor time series2-D satellite images, seafloor maps3-D x/y/t image time seriesand x/y/z geophysics data4-D x/y/z/t climate and ocean datamedium-dimensional (6-D - 12-D)OLAPhigh-dimensional (up to thousands)Data-Mining, collection of features
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 11 / 23
What needs to be tested... relevance
number of dimensionslow-dimensional (1-D - 5-D)1-D environmental sensor time series2-D satellite images, seafloor maps3-D x/y/t image time seriesand x/y/z geophysics data4-D x/y/z/t climate and ocean datamedium-dimensional (6-D - 12-D)OLAPhigh-dimensional (up to thousands)Data-Mining, collection of features
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 11 / 23
What needs to be tested... relevance
number of dimensionslow-dimensional (1-D - 5-D)1-D environmental sensor time series2-D satellite images, seafloor maps3-D x/y/t image time seriesand x/y/z geophysics data4-D x/y/z/t climate and ocean datamedium-dimensional (6-D - 12-D)OLAPhigh-dimensional (up to thousands)Data-Mining, collection of features
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 11 / 23
What needs to be tested... relevance
number of dimensionslow-dimensional (1-D - 5-D)1-D environmental sensor time series2-D satellite images, seafloor maps3-D x/y/t image time seriesand x/y/z geophysics data4-D x/y/z/t climate and ocean datamedium-dimensional (6-D - 12-D)OLAPhigh-dimensional (up to thousands)Data-Mining, collection of features
precipitationx/y/z/t
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 11 / 23
What needs to be tested... relevanceSpace time cube
Satellite creates several scenesSatellite scene referenced by latitude/longitude + timeat least twice per year each point should be mappedset of scenes that have temporal and spatial overlap
Example query:give me the Near-field infrared (NIR) values between 2007and 2009 in Vienna
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 12 / 23
What needs to be tested...
Dimensions and cell type constitute array model featurescell types
singlerecords (e.g. colored pixel)domain specific data structures
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 13 / 23
What needs to be tested... scaleability
Data propertiesVolume of data
range MB to PBSparsity of data
sparse arrays like statistical data cubesdense arrays like satellite imagery
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 14 / 23
Relevance in array database domain
Array is function a : X → V
Query operationson X : trimming, slicing
on V : pixel-wise addition of images
on the function itself: histogram
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 15 / 23
Relevance in array database domain
Array is function a : X → V
Query operationsde-arraying functions: aggregations
querying irregular time axis (most rain in june in last years)
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 16 / 23
Relevance in array database domain
Array is function a : X → V
Irregular time axiscalendar is highly irregular,month lengths differ, leap yearsbut need to analyse by month, season→ create additional dimensionshas effect on tiling strategies
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 17 / 23
Ease of use in array database domain
Array is function a : X → V
Query operation supportnatively supported?via User Defined Functions (UDF)?
expertise neededadditional costs involved
.
..how to implement in benchmark?
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 18 / 23
Suitability cube
Combination of assessments can be called a suitability cubeaddresses challenges from all relevant sidesdevelopers want to address all possibilitiesusers want one single number...
Does modern technology help?
(modified image from qrarts.com)
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 19 / 23
Existing array DB benchmarks
Early attempts: Sequoia 20005, Paradise6
Standard Science DBMS Benchmark (SS-DB)7
applies space-science use caserelevant, performs nine queries on astronomical data
load dataqueries raw datacreates derived data (cooking)queries derived data
portable, source-code available (but difficult to find...)→ repeatablescalable, covers small to big data volumes, data generator
5Stonebraker 19936Patel et al. 19977Cudre-Mauroux et al. 2010
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 20 / 23
Existing array DB benchmarks, SS-DB
However...only single-user queriesselection of queries seems rather limiteddoes not address higher-dimensions, such as 4-d, 5-d→ does not fully cover other application domains, such asgeophysics, climate and ocean dataonly regular time axis
Trade-off between simplicity and functional coverageease of use, no analysis of array queries used
natively supported?user defined functions
result is not a single number...
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 21 / 23
Conclusion
arrays inherent in Big Databenchmarks for big data shouldconsider array operations as wellsuitability cube tries to address many metricsSS-DB good basis for discussion
benchmarks will make us work harder...
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 22 / 23
Conclusion
H. Stamerjohanns, P. Baumann (Jacobs University Bremen)Benchmarking Large Arrays in Databases WBDB12 23 / 23