1
Using Tiling to Scale Parallel Datacube Implementation
Ruoming JinKarthik Vaidyanathan
Ge Yang Gagan Agrawal
The Ohio State University
2
Introduction to Data Cube Construction
Data cube construction involves computing aggregates for all values across all possible subsets of dimensions.
If the original dataset is n dimensional, the data cube construction includes computing and storing nCm m-dimensional arrays.
Three-dimensional data cube construction involves computing arrays AB, AC, BC, A, B, C and a scalar value all.
Part I
3
Motivation• Datasets for off-line processing are becoming larger.
– A system storing and allowing analysis on such datasets is a data warehouse.
• Frequent queries on data warehouses require aggregation along one or more dimensions.– Data cube construction performs all aggregations in
advance to facilitate fast responses to all queries.
• Data cube construction is a compute and data-intensive problem.– Memory requirements become the bottleneck for
sequential algorithms.
Part I
Construct data cubes in parallel in cluster environments!
4
Our Earlier Work
• Parallel Algorithms for Small Dimensional Cases and Use of a Cluster Middleware (CCGRID 2002, FGCS 2003)
• Parallel algorithms and theoretical results (ICPP 2003, HiPC 2003)
• Evaluating parallel algorithms (IPDPS 2003)
5
Using Tiling
• One important issue: memory requirements for intermediate results – From a Sparse m dimensional array, we compute m m-
1 dimensional dense arrays
• Tiling can help scale sequential and parallel datacube algorithms
• Two important issues: – Algorithms for Using Tiling
– How to tile so as to have minimum overhead
6
Outline
• Main Issues and Data Structures
• Parallel algorithms without tiling
• Tiling for Sequential Datacube construction
• Theoretical analysis
• Tiling for Parallel Datacube construction
• Experimental evaluation
8
Main Issues
• Cache and Memory Reuse– Each portion of the parent array is read only once to
compute its children. Corresponding portions of each child should be updated simultaneously.
• Using Minimal Parents– If a child has more than one parent, it uses the
minimal parent which requires less computation to obtain the child.
• Memory Management– Write back the output array to the disk if there is no
child which is computed from this array.
– Manage available main memory effectively
• Communication Volume– Appropriately partition along one or more dimensions
to guarantee minimal communication volume.
Part I
9
Aggregation Tree
Given a set X = {1, 2, …, n} and a prefix tree P(n), the corresponding aggregation tree A(n) is constructed by complementing every node in P(n) with respect to X.
Part III
Prefix lattice Prefix tree Aggregation tree
10
Theoretical Results
• For data cube construction using aggregation tree– The total memory requirement for holding the results is
bounded.
– The total communication volume is bounded.
– It is guranteed that all arrays are computed from their minimal parents.
– A procedure of partitioning input datasets exists for minimizing interprocessor communication.
Part III
11
Level One Parallel Algorithm
Main ideas
• Each processor computes a portion of each child at the first level.
• Lead processors have the final results after interprocessor communication.
• If the output is not used to compute other children, write it back; otherwise compute children on lead processors.
Part III
12
Example
• Assumption– 8 processors
– Each of the three dimensions is partitioned in half
• Initially– Each processor computes
partial results for each of D1D2, D1D3 and D2D3
D1D2D3
D2D3 D1D3 D1D2
D3 D2 D1
all
Three-dimensional array D1D2D3 with |D1| |D2| |D3|
Part III
13
Example (cont.)
• Lead processors for D1D2
(l(l11, l, l22, 0), 0) (l1, l2, 1)
(0, 0, 0)(0, 0, 0) (0, 0, 1)
(0, 1, 0)(0, 1, 0) (0, 1, 1)
(1, 0, 0)(1, 0, 0) (1, 0, 1)
(1, 1, 0)(1, 1, 0) (1, 1, 1)
• Write back D1D2 on lead processors
D1D2D3
D2D3 D1D3 D1D2
D3 D2 D1
all
Three-dimensional array D1D2D3 with |D1| |D2| |D3|
Part III
14
Example (cont.)• Lead processors for D1D3
(l(l11, 0, l, 0, l33)) (l1, 1, l3)
(0, 0, 0)(0, 0, 0) (0, 1, 0)
(0, 0, 1)(0, 0, 1) (0, 1, 1)
(1, 0, 0)(1, 0, 0) (1, 1, 0)
(1, 0, 1)(1, 0, 1) (1, 1, 1)
• Compute D1 from D1D3 on lead processors; write back D1D3 on lead
processors • Lead processors for D1
(l(l11, 0, 0), 0, 0) (l1, 0, 1)
(0, 0, 0)(0, 0, 0) (0, 0, 1)
(1, 0, 0)(1, 0, 0) (1, 0, 1)
• Write back D1 on lead processors
D1D2D3
D2D3 D1D3 D1D2
D3 D2 D1
all
Three-dimensional array D1D2D3 with |D1| |D2| |D3|
Part III
15
Tiling-based Approach
• Motivation– Parallel machines are not always available
– Memory of individual computer is limited
• Tiling-based Approaches– Sequential: Tile along dimensions on one processor
– Parallel: Partition among processors and on each processor tile along dimensions
Part IV
16
Sequential Tiling-based Algorithm• Main Idea
A portion of a node in aggregation tree is expandable (can be used to compute its children) once enough tiles of the portion of this node have been processed.
• Main MechanismEach tile is given a label
D1D2D3
D1D2 D1D3 D2D3
D1 D2 D3
all
Three-dimensional array D1D2D3 with |D1| |D2| |D3|
4 tiles, tile along D2, D3.Each tile is given a lable (0, l2, l3)Tile 0 – (0, 0, 0)Tile 1 – (0, 0, 1)Tile 2 – (0, 1, 0)Tile 3 – (0, 1, 1)
Part IV
17
Example
D1D2D3
D1D2 D1D3 D2D3
D1 D2 D3
all
Three-dimensional array D1D2D3 with |D1| |D2| |D3|
D2D3 D1D3 D1D2
Tile
(0 0 0)done Portion 0 Portion 0
Tile
(0 0 1) done Portion 1Portion 0
Merge & expand
Tile
(0 1 0) donePortion 0
Merge & expand
Portion 1
Tile
(0 1 1) donePortion 1
Merge & expand
Portion 1
Merge & expand
Part IV
18
Tiling Overhead • Tiling based algorithm requires writing
back and rereading portions of results
• Want to tile to minimize the overhead
• Tile the dimension Di 2ki times
• We can compute the total tiling overhead as
19
Minimizing Tiling Overhead
• Tile the largest dimension first, change its effective size
• Keep choosing the largest dimension, till the memory requirements are below the available memory
20
Parallel Tiling-based Algorithm
• Assumptions– Three-dimensional
partition (0 1 1 1)
– Two-dimensional tiling (0 0 1 1)
D1D2D3D4
D1D2D3 D1D2D4 D1D3D4 D2D3D4
D1D2 D1D3 D2D3 D1D4 D2D4 D3D4
D1 D2 D3 D4
all
Four-dimensional aggregation tree with|D1| |D2| |D3| |D4|
Part IV
• Solutions–Apply tiling-based approaches to first level nodes only
–Apply Level One Parallel Algorithm to other nodes
21
Choosing Tiling Parameters
0
50
100
150200
250
300
350
400
Time (s)
25 5
Sparsity Level (percent)
128^4 Dataset, 1 Processor, 8 Tiles
Sequential Algorithmw/o TilingThree-dimensionalTilingTwo-dimensional Tiling
One-dimensional Tiling
Tiling overhead exists.
Tiling along multiple dimensions can reduce tiling overhead.
Part IV
22
Parallel Tiling-based Algorithm Results
0
10
20
30
40
50
60
70
Time (s)
25 5
Sparsity Level (percent)
128^4 Dataset, 8 Processors, Three-dimensional Partition
Tiling Parameter (1 0 0 1)Tiling Parameter (0 0 1 1)Tiling Parameter (0 0 0 2)
Algorithm of choosing tiling parameters to reduce tiling overhead still takes effect in parallel environments!
Part IV
23
More data goes here
24
Conclusions
• Tiling can help scale parallel datacube construction
• Algorithms and analytical results in our work