Post on 12-Jan-2016
transcript
1
Cube Computation and Indexes for Data Warehouses
CPS 196.03Notes 7
2
Processing
ROLAP servers vs. MOLAP servers Index Structures Cube computation What to Materialize? Algorithms
Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
3
ROLAP Server
Relational OLAP Server
relationalDBMS
ROLAPserver
tools
utilities
sale prodId date sump1 1 62p2 1 19p1 2 48
Special indices, tuning;
Schema is “denormalized”
4
MOLAP Server
Multi-Dimensional OLAP Server
multi-dimensional
server
M.D. tools
utilitiescould also
sit onrelational
DBMS
Pro
du
ctCity
Date1 2 3 4
milk
soda
eggs
soap
AB
Sales
5
MOLAP
Total annual salesof TV in U.S.A.Date
Produ
ct
Cou
ntr
ysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
6
MOLAP
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
6463626148474645
a1a0
c3c2
c1c 0
b3
b2
b1
b0
a2 a3
C
4428 56
4024 52
3620
60
B
7
Challenges in MOLAP
Storing large arrays for efficient access Row-major, column major Chunking Compressing sparse arrays
Creating array data from data in tables Efficient techniques for Cube computation
Topics are discussed in the paper for reading
8
Index Structures
Traditional Access Methods B-trees, hash tables, R-trees, grids, …
Popular in Warehouses inverted lists bit map indexes join indexes text indexes
9
Inverted Lists
2023
1819
202122
232526
r4r18r34r35
r5r19r37r40
rId name ager4 joe 20
r18 fred 20r19 sally 21r34 nancy 20r35 tom 20r36 pat 25r5 dave 21
r41 jeff 26
. .
.
ageindex
invertedlists
datarecords
10
Using Inverted Lists
Query: Get people with age = 20 and name = “fred”
List for age = 20: r4, r18, r34, r35 List for name = “fred”: r18, r52 Answer is intersection: r18
11
Bit Maps
2023
1819
202122
232526
id name age1 joe 202 fred 203 sally 214 nancy 205 tom 206 pat 257 dave 218 jeff 26
. .
.
ageindex
bitmaps
datarecords
110110000
0010001011
12
Bitmap Index Index on a particular column Each value in the column has a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the
value for the indexed column not suitable for high cardinality domains
Cust Region TypeC1 Asia RetailC2 Europe DealerC3 Asia DealerC4 America RetailC5 Europe Dealer
RecID Retail Dealer1 1 02 0 13 0 14 1 05 0 1
RecID Asia Europe America1 1 0 02 0 1 03 1 0 04 0 0 15 0 1 0
Base table Index on Region Index on Type
13
Using Bit Maps
Query: Get people with age = 20 and name = “fred”
List for age = 20: 1101100000 List for name = “fred”: 0100000001 Answer is intersection: 010000000000
Good if domain cardinality small Bit vectors can be compressed
14
Join
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
• “Combine” SALE, PRODUCT relations• In SQL: SELECT * FROM SALE, PRODUCT WHERE ...
product id name pricep1 bolt 10p2 nut 5
joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4
15
Join Indexes
product id name price jIndexp1 bolt 10 r1,r3,r5,r6p2 nut 5 r2,r4
sale rId prodId storeId date amtr1 p1 c1 1 12r2 p2 c1 1 11r3 p1 c3 1 50r4 p2 c2 1 8r5 p1 c1 2 44r6 p1 c2 2 4
join index
16
Cube Computation for Data Warehouses
17
Counting Exercise
How many cuboids are there in a cube? The full or nothing case When dimension hierarchies are present
What is the size of each cuboid?
18
Lattice of Cuboids
city, product, date
city, product city, date product, date
city product date
all
day 2c1 c2 c3
p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3p1 67 12 50
129
19
Dimension Hierarchies
all
state
city
cities city statec1 CAc2 NY
20
Dimension Hierarchies
city, product
city, product, date
city, date product, date
city product date
all
state, product, date
state, date
state, product
state
not all arcs shown...
21
Efficient Data Cube Computation
Data cube can be viewed as a lattice of cuboids The bottom-most cuboid is the base cuboid The top-most cuboid (apex) contains only one cell How many cuboids in an n-dimensional cube with L
levels?
Materialization of data cube Materialize every (cuboid) (full materialization), none (no
materialization), or some (partial materialization) Selection of which cuboids to materialize
Based on size, sharing, access frequency, etc.
)11(
n
i iLT
22
Derived Data
Derived Warehouse Data indexes aggregates materialized views (next slide)
When to update derived data? Incremental vs. refresh
23
Idea of Materialized Views
Define new warehouse tables/arrays
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
product id name pricep1 bolt 10p2 nut 5
joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4
does not existat any source
24
Efficient OLAP Processing
Determine which operations should be performed on available cuboids
Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,
e.g., dice = selection + projection
Determine which materialized cuboid(s) should be selected for OLAP:
Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
Explore indexing structures & compressed vs. dense arrays in MOLAP
25
What to Materialize?
Store in warehouse results useful for common queries
Example:day 2
c1 c2 c3p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3p1 67 12 50
c1p1 110p2 19
129
. . .
total sales
materialize
26
Materialization Factors
Type/frequency of queries Query response time Storage cost Update cost
Will study a concrete algorithm later
27
Iceberg Cube Computing only the cuboid cells whose count
or other aggregates satisfying the condition like
HAVING COUNT(*) >= minsup
Motivation Only a small portion of cube cells may be “above the
water’’ in a sparse cube Only calculate “interesting” cells—data above certain
threshold
28
Challenges in MOLAP
Storing large arrays for efficient access Row-major, column major Chunking Compressing sparse arrays
Creating array data from data in tables Efficient techniques for Cube computation
Topics are discussed in the paper for reading