+ All Categories
Home > Documents > Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1,...

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1,...

Date post: 15-Dec-2015
Category:
Upload: megan-rone
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
26
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1 , Calisto Zuzarte 2 , Ken Sevcik 1 1 University of Toronto 2 IBM Toronto Lab [email protected]
Transcript
Page 1: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

Towards Estimating the Number of Distinct Value

Combinations for a Set of Attributes

Xiaohui Yu1, Calisto Zuzarte2, Ken Sevcik1

1University of Toronto2IBM Toronto [email protected]

Page 2: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 2

Distinct value combinations

Country City Hotel Name

Germany Bremen Hilton

Germany Bremen Best Western

Germany Frankfurt InterCityCanada Toronto Four Seasons

Canada Toronto Intercontinental

3 distinct value combinations

1

2

3

COLSCARD (COlumn Set CARDinality) = 3

The problem: estimating COLSCARD for a given set of attributes

Page 3: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 3

Motivation Cardinality estimation for query

optimization, e.g., Estimating the size of Estimating the size of the aggregation

Approximate query answering, e.g., COUNT queries

Hotelcitycountry ),(

SELECT sales_date, sales_person, SUM(sales_quantity) AS unit_soldFROM salesGROUP BY sales_date, sales_person

Page 4: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 4

Roadmap Related work Estimation with known marginal

distributions Upper/lower bounds An estimator

Estimation with histograms Experimental results Conclusions

Page 5: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 5

Related work

Previous work has focused on the case of single attribute. [HÖT88],[HÖT89],[HNSS’95],[HS’98],[CCMN’00]

Sampling approach is used. Estimation through sampling is difficult

[CCMN’00]

No existing statistical information is exploited.

Page 6: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 6

Our solution Considering multiple-attributes Utilizing existing statistics on individual

attributes Readily available in most database systems Does not require access to the data

Granularity of statistics Exact marginal frequency distributions Approximate distributions: histograms etc.

Page 7: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 7

Estimation with known marginals Number of distinct values in attribute Ai,

frequency vector

),...,2,1( midi

i

i

d

jijidiii ffff

121 1),,...,,(f

)4.0,6.0(1 f

Country City Hotel Name

Germany Bremen Hilton

Germany Bremen Best Western

Germany Frankfurt InterCity

Canada Toronto Four Seasons

Canada Toronto Intercontinental

)4.0,2.0,4.0(2 f )2.0,2.0,2.0,2.0,2.0(3 f

Page 8: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 8

The naïve estimator

COLSCARD = Nd

m

ii ,min

1

Number of possible value combinations

di: the number of distinct values in attribute Ai

Sanity bound: COLSCARD cannot be greater than the table size

The problem: Some value combinations with low occurrence probabilities may not appear in the table!

Page 9: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 9

Upper/Lower bounds

Trivial bounds Upper bound: (the naïve

estimator) Lower bound:

Tighter bounds? In the case of two attributes, tighter bounds

are available.

mddd ,...,,max 21

Ndm

ii ,min

1

Page 10: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 10

Tighter bounds

N = 10

4

4

2

def

1

1

8

abc

A2A1

Naïve bounds: 3, 9 Lower bound = 2+1+1 = 4

1

1

value freqvalue freq

[2, 3]

Upper bound = 3+1+1 = 5

Page 11: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 11

Expected number of combinations

Assumptions1. The data distributions of individual columns are

independent2. The occurrence of each combination in the table

is independent Each element of f represents the

frequency of a specific value combination. An estimate of the probability of occurrence

mffff 21

Page 12: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 12

Estimator The probability of the i-th combination

not appearing in a particular tuple is

The probability of the i-th combination not appearing in the table (of size N) is

The expected number of value combinations is

)1( if

i

NifMCOLSCARDE )1(][ )(

1

m

jjdM

Nif )1(

Page 13: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 13

Example revisited Estimate the COLSCARD for attribute set (A1, A2, A3),

given

)6.0,3.0,1.0(1 f )99.0,01.0(2 f )95.0,05.0(3 f 100N

New estimate: 5.94

Naïve estimate: 3*2*2 = 12

,09405.0,00495.0,00095.0,00005.0(321 ffff,28215.0,01485.0,00285.0,00015.0)05643.0,02970.0,00570.0,00030.0

Page 14: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 14

Roadmap

Related work Estimation with known marginal

distributions Upper/lower bounds An estimator

Estimation with histograms Experimental results Conclusions

Page 15: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 15

Estimation with histograms

Histograms exist on individual attributes Two classes of histograms

Partition-based End-biased

Marginals can be (approximately) reconstructed from histograms

Optimal histograms in each class?

Page 16: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 16

Optimal histograms

Minimizing the error incurred by histograms ERR = |ESThist – ESTexact|

Partition-based histograms A dynamic programming algorithm similar t

o that for V-optimal histogram construction [Jagadish et al. 98] can be used.

Page 17: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 17

Optimal end-biased histograms An end-biased histogram with B buckets

stores The exact frequencies of B-1 attribute values The average of the remaining values

Which B-1 values to store exactly? Most widely used end-biased histograms

store the frequencies of the most frequent values Not always optimal for COLSCARD estimation!!

Page 18: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 18

Example

)9.0,1.0(1 f

0.94) 0.03, 0.02, (0.01, :1 case 2 f0.39) 0.31, 0.29, (0.01, :2 case '

2 f

Attributes (A1, A2)

Choose 1 frequency to store exactly

Index of the frequency stored

1 2 3 4

1.68 2.01 2.17 0.15

0.01 1.10 1.09 1.02

2f'

2f

Error table

N=10

Page 19: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 19

Optimal end-biased histograms Exhaustive search takes time proportional to We prove that the optimal choices can be one o

f the following Most frequent values Least frequent values A combination of most frequent and least frequent v

alues Only need to search both ends

Cost is linear in B, independent of dj!

1Bd j

C

Page 20: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 20

Roadmap

Related work Estimation with known marginal

distributions Upper/lower bounds An estimator

Estimation with histograms Experimental results Conclusions

Page 21: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 21

Experiments – Data sets

Synthetic data Skew: Zipfian parameter z=0 (uniform) to 4 (highly ske

wed) Number of tuples: 10K to 1M

Real data Cover Type: 581,012 tuples, 10 attributes Census Income: 32,561 tuples, 14 attributes

Error measure: ratio error ERR = max{true/est-1, est/true-1}

Page 22: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 22

Effect of data skew

0

1

2

3

4

5

6

7

8

9

ER

R

Proposed estimator 0.000237 0.000933 0.000982 0.0654

Naive estimator 0.0516 6.5171 5.9423 8.4921

z1 = 0,z2=0

z1 = 0,z2=2

z1 = 0,z2=4

z1 = 4,z2=4

N=100K

di=1k

Page 23: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 23

Effect of number of tuples

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1000 10000 100000 1000000

N

ER

R

z=0

z=2

z=4

Page 24: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 24

Results on real data

(a) Cover Type

31

4

3

52

ERR≤0.05 0.05<ERR≤0.1 0.1<ERR≤0.5 0.5<ERR≤1 ERR>1

(b) Census Income

59

19

102 1

45 pairs 91 pairs

Page 25: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 25

Accuracy of end-biased histograms

0

0.05

0.1

0.15

0.2

0.25

0.3

10 20 30 50

Number of buckets

ER

R

Results on the “capital-gain” attribute of Census Income data set

Page 26: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

November 3, 2005 CIKM 2005 26

Conclusions Utilizing existing knowledge

maintained in database systems Proposed upper/lower bounds as well

as an estimator Considered two cases

exact marginal frequencies Histograms: optimal histograms

Experimental results show the effectiveness of the proposed method


Recommended