Date post: | 15-Dec-2015 |
Category: |
Documents |
Upload: | megan-rone |
View: | 216 times |
Download: | 0 times |
Towards Estimating the Number of Distinct Value
Combinations for a Set of Attributes
Xiaohui Yu1, Calisto Zuzarte2, Ken Sevcik1
1University of Toronto2IBM Toronto [email protected]
November 3, 2005 CIKM 2005 2
Distinct value combinations
Country City Hotel Name
Germany Bremen Hilton
Germany Bremen Best Western
Germany Frankfurt InterCityCanada Toronto Four Seasons
Canada Toronto Intercontinental
3 distinct value combinations
1
2
3
COLSCARD (COlumn Set CARDinality) = 3
The problem: estimating COLSCARD for a given set of attributes
November 3, 2005 CIKM 2005 3
Motivation Cardinality estimation for query
optimization, e.g., Estimating the size of Estimating the size of the aggregation
Approximate query answering, e.g., COUNT queries
Hotelcitycountry ),(
SELECT sales_date, sales_person, SUM(sales_quantity) AS unit_soldFROM salesGROUP BY sales_date, sales_person
November 3, 2005 CIKM 2005 4
Roadmap Related work Estimation with known marginal
distributions Upper/lower bounds An estimator
Estimation with histograms Experimental results Conclusions
November 3, 2005 CIKM 2005 5
Related work
Previous work has focused on the case of single attribute. [HÖT88],[HÖT89],[HNSS’95],[HS’98],[CCMN’00]
Sampling approach is used. Estimation through sampling is difficult
[CCMN’00]
No existing statistical information is exploited.
November 3, 2005 CIKM 2005 6
Our solution Considering multiple-attributes Utilizing existing statistics on individual
attributes Readily available in most database systems Does not require access to the data
Granularity of statistics Exact marginal frequency distributions Approximate distributions: histograms etc.
November 3, 2005 CIKM 2005 7
Estimation with known marginals Number of distinct values in attribute Ai,
frequency vector
),...,2,1( midi
i
i
d
jijidiii ffff
121 1),,...,,(f
)4.0,6.0(1 f
Country City Hotel Name
Germany Bremen Hilton
Germany Bremen Best Western
Germany Frankfurt InterCity
Canada Toronto Four Seasons
Canada Toronto Intercontinental
)4.0,2.0,4.0(2 f )2.0,2.0,2.0,2.0,2.0(3 f
November 3, 2005 CIKM 2005 8
The naïve estimator
COLSCARD = Nd
m
ii ,min
1
Number of possible value combinations
di: the number of distinct values in attribute Ai
Sanity bound: COLSCARD cannot be greater than the table size
The problem: Some value combinations with low occurrence probabilities may not appear in the table!
November 3, 2005 CIKM 2005 9
Upper/Lower bounds
Trivial bounds Upper bound: (the naïve
estimator) Lower bound:
Tighter bounds? In the case of two attributes, tighter bounds
are available.
mddd ,...,,max 21
Ndm
ii ,min
1
November 3, 2005 CIKM 2005 10
Tighter bounds
N = 10
4
4
2
def
1
1
8
abc
A2A1
Naïve bounds: 3, 9 Lower bound = 2+1+1 = 4
1
1
value freqvalue freq
[2, 3]
Upper bound = 3+1+1 = 5
November 3, 2005 CIKM 2005 11
Expected number of combinations
Assumptions1. The data distributions of individual columns are
independent2. The occurrence of each combination in the table
is independent Each element of f represents the
frequency of a specific value combination. An estimate of the probability of occurrence
mffff 21
November 3, 2005 CIKM 2005 12
Estimator The probability of the i-th combination
not appearing in a particular tuple is
The probability of the i-th combination not appearing in the table (of size N) is
The expected number of value combinations is
)1( if
i
NifMCOLSCARDE )1(][ )(
1
m
jjdM
Nif )1(
November 3, 2005 CIKM 2005 13
Example revisited Estimate the COLSCARD for attribute set (A1, A2, A3),
given
)6.0,3.0,1.0(1 f )99.0,01.0(2 f )95.0,05.0(3 f 100N
New estimate: 5.94
Naïve estimate: 3*2*2 = 12
,09405.0,00495.0,00095.0,00005.0(321 ffff,28215.0,01485.0,00285.0,00015.0)05643.0,02970.0,00570.0,00030.0
November 3, 2005 CIKM 2005 14
Roadmap
Related work Estimation with known marginal
distributions Upper/lower bounds An estimator
Estimation with histograms Experimental results Conclusions
November 3, 2005 CIKM 2005 15
Estimation with histograms
Histograms exist on individual attributes Two classes of histograms
Partition-based End-biased
Marginals can be (approximately) reconstructed from histograms
Optimal histograms in each class?
November 3, 2005 CIKM 2005 16
Optimal histograms
Minimizing the error incurred by histograms ERR = |ESThist – ESTexact|
Partition-based histograms A dynamic programming algorithm similar t
o that for V-optimal histogram construction [Jagadish et al. 98] can be used.
November 3, 2005 CIKM 2005 17
Optimal end-biased histograms An end-biased histogram with B buckets
stores The exact frequencies of B-1 attribute values The average of the remaining values
Which B-1 values to store exactly? Most widely used end-biased histograms
store the frequencies of the most frequent values Not always optimal for COLSCARD estimation!!
November 3, 2005 CIKM 2005 18
Example
)9.0,1.0(1 f
0.94) 0.03, 0.02, (0.01, :1 case 2 f0.39) 0.31, 0.29, (0.01, :2 case '
2 f
Attributes (A1, A2)
Choose 1 frequency to store exactly
Index of the frequency stored
1 2 3 4
1.68 2.01 2.17 0.15
0.01 1.10 1.09 1.02
2f'
2f
Error table
N=10
November 3, 2005 CIKM 2005 19
Optimal end-biased histograms Exhaustive search takes time proportional to We prove that the optimal choices can be one o
f the following Most frequent values Least frequent values A combination of most frequent and least frequent v
alues Only need to search both ends
Cost is linear in B, independent of dj!
1Bd j
C
November 3, 2005 CIKM 2005 20
Roadmap
Related work Estimation with known marginal
distributions Upper/lower bounds An estimator
Estimation with histograms Experimental results Conclusions
November 3, 2005 CIKM 2005 21
Experiments – Data sets
Synthetic data Skew: Zipfian parameter z=0 (uniform) to 4 (highly ske
wed) Number of tuples: 10K to 1M
Real data Cover Type: 581,012 tuples, 10 attributes Census Income: 32,561 tuples, 14 attributes
Error measure: ratio error ERR = max{true/est-1, est/true-1}
November 3, 2005 CIKM 2005 22
Effect of data skew
0
1
2
3
4
5
6
7
8
9
ER
R
Proposed estimator 0.000237 0.000933 0.000982 0.0654
Naive estimator 0.0516 6.5171 5.9423 8.4921
z1 = 0,z2=0
z1 = 0,z2=2
z1 = 0,z2=4
z1 = 4,z2=4
N=100K
di=1k
November 3, 2005 CIKM 2005 23
Effect of number of tuples
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1000 10000 100000 1000000
N
ER
R
z=0
z=2
z=4
November 3, 2005 CIKM 2005 24
Results on real data
(a) Cover Type
31
4
3
52
ERR≤0.05 0.05<ERR≤0.1 0.1<ERR≤0.5 0.5<ERR≤1 ERR>1
(b) Census Income
59
19
102 1
45 pairs 91 pairs
November 3, 2005 CIKM 2005 25
Accuracy of end-biased histograms
0
0.05
0.1
0.15
0.2
0.25
0.3
10 20 30 50
Number of buckets
ER
R
Results on the “capital-gain” attribute of Census Income data set
November 3, 2005 CIKM 2005 26
Conclusions Utilizing existing knowledge
maintained in database systems Proposed upper/lower bounds as well
as an estimator Considered two cases
exact marginal frequencies Histograms: optimal histograms
Experimental results show the effectiveness of the proposed method