BITMAPS & Starjoins. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP):...

Post on 13-Dec-2015

217 views 2 download

transcript

BITMAPS & Starjoins

Indexing datacubes

Objective: speed queries up.

Traditional databases (OLTP): B-Trees

• Time and space logarithmic to the amount of indexed keys.

• Dynamic, stable and exhibit good performance under updates. (But OLAP is not about updates….)

Bitmaps:

• Space efficient

• Difficult to update (but we don’t care in DW).

• Can effectively prune searches before looking at data.

BitmapsR = (…., A,….., M)

R (A) B8 B7 B6 B5 B4 B3 B2 B1 B0

3 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 1 0 0 8 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 7 0 1 0 0 0 0 0 0 0 5 0 0 0 1 0 0 0 0 0 6 0 0 1 0 0 0 0 0 0 4 0 0 0 0 1 0 0 0 0

Query optimization

Consider a high-selectivity-factor query with predicates on two attributes.

Query optimizer: builds plans(P1) Full relation scan (filter as you go).(P2) Index scan on the predicate with lower selectivity

factor, followed by temporary relation scan, to filter out non-qualifying tuples, using the other predicate. (Works well if data is clustered on the first index key).

(P3) Index scan for each predicate (separately), followed by merge of RID.

Query optimization (continued)

(P2)

Blocks of data

Pred. 2

answer

t1

tn

Index Pred1

(P3)

t1

tn

Index Pred2

Tuple list1

Tuple list2

Merged list

Query optimization (continued)

When using bitmap indexes (P3) can be an easy winner!

CPU operations in bitmaps (AND, OR, XOR, etc.) are more efficient than regular RID merges: just apply the binary operations to the bitmaps

(In B-trees, you would have to scan the two lists and select tuples in both -- merge operation--)

Of course, you can build B-trees on the compound key, butwe would need one for every compound predicate (exponential number of trees…).

Bitmaps and predicates

A = a1 AND B = b2

Bitmap for a1 Bitmap for b2

AND =

Bitmap for a1 and b2

Tradeoffs

Dimension cardinality small dense bitmaps

Dimension cardinality large sparse bitmaps

Compression

(decompression)

Bitmap for prod

Bitmap for prod

…..

Query strategy for Star joinsMaintain join indexes between fact table and dimension tables

Prod.

Fact tableProduct Type Location

Dimension table

a ... k

Bitmap for type a

Bitmap for type k

…..Bitmap for loc.

Bitmap for loc.

…..

Strategy example

Aggregate all sales for products of location , or

Bitmap for Bitmap for Bitmap for

OR OR =

Bitmap for predicate

Star-JoinsSelect F.S, D1.A1, D2.A2, …. Dn.An

from F,D1,D2,Dn where F.A1 = D1.A1

F.A2 = D2.A2 … F.An = Dn.An

and D1.B1 = ‘c1’ D2.B2 = ‘p2’ ….

Likely strategy:

For each Di find suitable values of Ai such that Di.Bi = ‘xi’ (unless you have a bitmap index for Bi). Use bitmap index on Ai’ values to form a bitmap for related rows of F (OR-ing the bitmaps).

At this stage, you have n such bitmaps, the result can be found AND-ing them.

Example

Selectivity/predicate = 0.01 (predicates on the dimension tables) n predicates (statistically independent)Total selectivity = 10 -2n

Facts table = 108 rows, n = 3, tuples in answer = 108/ 106 = 100 rows. In the worst case = 100 blocks… Still better than all the blocks in the relation (e.g., assuming 100 tuples/block, this would be 106 blocks!)

Design Space of Bitmap Indexes

The basic bitmap design is called Value-list index. The focus there is on the columns. If we change the focus to the rows, the index becomes a set of attribute values (integers) in each tuple (row), that can be represented in a particular way.

5 0 0 0 1 0 0 0 0 0

We can encode this row in many ways...

Attribute value decomposition

C = attribute cardinality Consider a value of the attribute, v, and a sequence of numbers <bn-1, bn-2 , …,b1>. Also, define bn = C / bi , then v can be decomposed into a sequence of n digits <vn, vn-1, vn-2 , …,v1> as follows:

v = V1

= V2 b1 + v1

= V3(b2b1) + v2 b1 + v1

… n-1 i-1 = vn ( bj) + …+ vi ( bj) + …+ v2b1 + v1

where vi = Vi mod bi and Vi = Vi-1/bi-1

<10,10,10> (decimal system!)

576 = 5 x 10 x 10 + 7 x 10 + 6

576/100 = 5 | 76

76/10 = 7 | 6

6

Number systems

How do you write 576 in:

<2,2,2,2,2,2,2,2,2>

576 = 1 x 29 + 0 x 28 + 0 x 27 + 1 x 26 + 0 x 25 + 0 x 24 + 0 x 23 +

0 x 22+ 0 x 21 + 0 x 20

576/ 29 = 1 | 64, 64/ 28 = 0|64, 64/ 27 = 0|64, 64/ 26 = 1|0,

0/ 25 = 0|0, 0/ 24= 0|0, 0/ 23= 0|0, 0/ 22 = 0|0, 0/ 21 = 0|0, 0/

20 = 0|0

< 7,7,5,3>

576/(7x7x5x3) = 576/735 = 0 | 576, 576/(7x5x3)=576/105=5|51

576 = 5 x (7x5x3)+51

51/(5x3) = 51/15 = 3 | 6

576 = 5 x (7x5x3) + 3 (5 x 3) + 16

6/3 =2 | 0

576 = 5 x (7x 5 x 3) + 3 x (5 x 3 ) + 2 x (3)

BitmapsR = (…., A,….., M) value-list index

R (A) B8 B7 B6 B5 B4 B3 B2 B1 B0

3 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 1 0 0 8 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 7 0 1 0 0 0 0 0 0 0 5 0 0 0 1 0 0 0 0 0 6 0 0 1 0 0 0 0 0 0 4 0 0 0 0 1 0 0 0 0

Examplesequence <3,3> value-list index (equality)

R (A) B22

B12

B02 B2

1 B11 B0

1

3 (1x3+0) 0 1 0 0 0 1 2 0 0 1 1 0 0 1 0 0 1 0 1 0 2 0 0 1 1 0 0 8 1 0 0 1 0 0 2 0 0 1 1 0 0 2 0 0 1 1 0 0 0 0 0 1 0 0 1 7 1 0 0 0 1 0 5 0 1 0 1 0 0 6 1 0 0 0 0 1 4 0 1 0 0 1 0

Encoding scheme

Equality encoding: all bits to 0 except the one that corresponds to the value

Range Encoding: the vi righmost bits to 0, the remaining to 1

Range encodingsingle component, base-9

R (A) B8 B7 B6 B5 B4 B3 B2 B1 B0

3 1 1 1 1 1 1 0 0 0 2 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 8 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 7 1 1 0 0 0 0 0 0 0 5 1 1 1 1 0 0 0 0 0 6 1 1 1 0 0 0 0 0 0 4 1 1 1 1 1 0 0 0 0

Example (revisited)sequence <3,3> value-list index(Equality)

R (A) B22

B12

B02 B2

1 B11 B0

1

3 (1x3+0) 0 1 0 0 0 1 2 0 0 1 1 0 0 1 0 0 1 0 1 0 2 0 0 1 1 0 0 8 1 0 0 1 0 0 2 0 0 1 1 0 0 2 0 0 1 1 0 0 0 0 0 1 0 0 1 7 1 0 0 0 1 0 5 0 1 0 1 0 0 6 1 0 0 0 0 1 4 0 1 0 0 1 0

Examplesequence <3,3> range-encoded index

R (A) B12

B02 B1

1 B01

3 1 0 1 1 2 1 1 0 0 1 1 1 1 0 2 1 1 0 0 8 0 0 0 0 2 1 1 0 0 2 1 1 0 0 0 1 1 1 1 7 0 0 1 0 5 1 0 0 0 6 0 0 1 1 4 1 0 1 0

Design Space

b Value-list

log2C b,b,…,b

Bit-Sliced

<b2,b1>

….

equality range

RangeEval

Evaluates each range predicate by computing two bitmaps: BEQ bitmap and either BGT or BLT

RangeEval-Opt uses only <=

A < v is the same as A <= v-1

A > v is the same as Not( A <= v)

A >= v is the same as Not (A <= v-1)

RangeEval-OPT

• Classification: – predicts categorical class labels– classifies data (constructs a model) based on the training

set and the values (class labels) in a classifying attribute and uses it in classifying new data

• Prediction: – models continuous-valued functions, i.e., predicts

unknown or missing values

• Typical Applications– credit approval– target marketing– medical diagnosis– treatment effectiveness analysis

Classification vs. Prediction

• Pros:– Fast.

– Rules easy to interpret.

– High dimensional data

• Cons:– No correlations

– Axis-parallel cuts.

• Supervised learning (classification)– Supervision: The training data

(observations, measurements, etc.) are accompanied by labels indicating the class of the observations

– New data is classified based on the training set

• Unsupervised learning (clustering)– The class labels of training data is unknown– Given a set of measurements, observations,

etc. with the aim of establishing the existence of classes or clusters in the data

• Decision tree – A flow-chart-like tree structure– Internal node denotes a test on an attribute– Branch represents an outcome of the test– Leaf nodes represent class labels or class distribution

• Decision tree generation consists of two phases– Tree construction

• At start, all the training examples are at the root• Partition examples recursively based on selected attributes

– Tree pruning• Identify and remove branches that reflect noise or outliers

• Use of decision tree: Classifying an unknown sample– Test the attribute values of the sample against the decision tree

Algorithm for Decision Tree Induction

• Basic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-conquer

manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they are

discretized in advance)– Examples are partitioned recursively based on selected attributes– Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain)

• Conditions for stopping partitioning– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning – majority

voting is employed for classifying the leaf– There are no samples left

Decision tree algorithms• Building phase:

– Recursively split nodes using best splitting attribute and value for node

• Pruning phase:– Smaller (yet imperfect) tree achieves better

prediction accuracy.– Prune leaf nodes recursively to avoid over-fitting.

DATA TYPES• Numerically ordered: values are ordered and they can

be represented in real line. ( E.g., salary.)• Categorical: takes values from a finite set not having

any natural ordering. (E.g., color.)• Ordinal: takes values from a finite set whose values

posses a clear ordering, but the distances between them are unknown. (E.g., preference scale: good, fair, bad.)

Some probability...S = casesfreq(Ci,S) = # cases in S that belong to CiGain entropic meassure:Prob(“this case belongs to Ci”) = freq(Ci,S)/|S|Information conveyed: -log (freq(Ci,S)/|S|)Entropy = expected information =- (freq(Ci,S)/|S|) log (freq(Ci,S)/|S|) = info(S)

GAIN

Test X:

infoX (T) = |Ti|/T info(Ti)

gain(X) = info (T) - infoX(T)

PROBLEM:What is best predictor to segment on?- windy or the outlook?

Problem with Gain

Strong bias towards test with many outcomes.

Example: Z = Name

|Ti| = 1 (each name unique)

infoZ (T) = 1/|T| (- 1/N log (1/N)) 0

Maximal gain!! (but useless division--- overfitting--)

Split

Split-info (X) = - |Ti|/|T| log (|Ti|/|T|)

gain-ratio(X) = gain(X)/split-info(X)

Gain <= log(k)

Split <= log(n)

ratio small

• The generated tree may overfit the training data –Too many branches, some may reflect anomalies due to noise or outliers–Result is in poor accuracy for unseen samples

• Two approaches to avoid overfitting –Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold• Difficult to choose an appropriate threshold

–Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees• Use a set of data different from the training data to decide which is the “best pruned tree”

• Approaches to Determine the Final Tree Size• Separate training (2/3) and testing (1/3) sets• Use cross validation, e.g., 10-fold cross validation• Use all the data for training• but apply a statistical test (e.g., chi-square) to estimate whether expanding or

pruning a node may improve the entire distribution• Use minimum description length (MDL) principle: • halting growth of the tree when the encoding is minimized

Gini Index (IBM IntelligentMiner)

• If a data set T contains examples from n classes, gini index, gini(T) is defined as

where pj is the relative frequency of class j in T.• If a data set T is split into two subsets T1 and T2 with sizes N1

and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as

• The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).

n

jp jTgini

1

21)(

)()()( 22

11 Tgini

NN

TginiNNTginisplit

Age Risk Tuple17 H 120 H 523 H 032 L 443 H 368 L 2

Car Type Risk TupleFamily H 0Sports H 1Sports H 2Family L 3Truck L 4

Family H 5

Age Car Type Risk23 Family H17 Sports H43 Sports H68 Family L32 Truck L20 Family H

Training set

Age Car

Attribute lists

Problem: What is the best way to determine risk? Is it Age or Car?

SplitsAge Risk Tuple

17 H 120 H 523 H 032 L 443 H 368 L 2

Age < 27.5

Car Type Risk TupleFamily H 0Sports H 1Family H 5

Car Type Risk TupleSports H 2Family L 3Truck L 4

Age Risk Tuple17 H 120 H 523 H 0

Age Risk Tuple32 L 443 H 268 L 3

Car Type Risk TupleFamily H 0Sports H 1Sports H 2Family L 3Truck L 4

Family H 5

Group1 Group2

Histograms

For continuous attributes

Associated with node (Cabove, Cbelow)

to process already processed

ANSWER

The winner is Age <= 18.5

Age Risk Tuple17 H 120 H 523 H 032 L 443 H 368 L 2

Car Type Risk TupleFamily H 0Sports H 1Sports H 2Family L 3Truck L 4

Family H 5

H

Y N

Age Risk Tuple20 H 523 H 032 L 443 H 368 L 2

Car Type Risk TupleFamily H 0

Sports H 2Family L 3Truck L 4

Family H 5

Summary

• Classification is an extensively studied problem (mainly in

statistics, machine learning & neural networks)

• Classification is probably one of the most widely used

data mining techniques with a lot of extensions

• Scalability is still an important issue for database

applications: thus combining classification with database

techniques should be a promising topic

• Research directions: classification of non-relational data,

e.g., text, spatial, multimedia, etc..

Association rules a* priori paper – student plays basketball example

Chapter 6: Mining Association Rules in Large Databases

• Association rule mining

• Mining single-dimensional Boolean association rules from transactional databases

• Mining multilevel association rules from transactional databases

• Mining multidimensional association rules from transactional databases and data warehouse

• From association mining to correlation analysis

• Summary

Association Rules

• Market basket data: your ``supermarket’’ basket contains {bread, milk, beer, diapers…}

• Find rules that correlate the presence of one set of items X with another Y.– Ex: X = diapers, Y= beer, X Y with

confidence 98%– Maybe constrained: e.g., consider only

female customers.

Applications

• Market basket analysis: tell me how I can improve my sales by attaching promotions to “best seller” itemsets.

• Marketing: “people who bought this book also bought…”

• Fraud detection: a claim for immunizations always come with a claim for a doctor’s visit on the same day.

• Shelf planning: given the “best sellers,” how do I organize my shelves?

Association Rule: Basic Concepts

• Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)

• Find: all rules that correlate the presence of one set of items with that of another set of items– E.g., 98% of people who purchase tires and auto

accessories also get automotive services done

Association Rule Mining: A Road Map

• Boolean vs. quantitative associations (Based on the types of values handled)

– buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”) [0.2%, 60%]

– age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]

• Single dimension vs. multiple dimensional associations (see ex. Above)

Road-map (continuation)

• Single level vs. multiple-level analysis– What brands of beers are associated with what brands of

diapers?

• Various extensions– Correlation, causality analysis

• Association does not necessarily imply correlation or causalityCausality: Does Beer Diapers or Diapers Beer (I.e., did the

customer buy the diapers because he bought the beer or was it the other way around)

Correlation: 90% buy coffee, 25 % buy tea, 20% buy both--- support is less than expected support = 0.9*0.25 = 0.225--

– Maxpatterns and closed itemsets– Constraints enforced

• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?

Chapter 6: Mining Association Rules in Large Databases

• Association rule mining

• Mining single-dimensional Boolean association rules from transactional databases

• Mining multilevel association rules from transactional databases

• Mining multidimensional association rules from transactional databases and data warehouse

• From association mining to correlation analysis

• Summary

Mining Association Rules—An Example

For rule A C:support = support({A C}) = 50%

confidence = support({A C})/support({A}) = 66.6%

The Apriori principle:Any subset of a frequent itemset must be frequent

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

Mining Frequent Itemsets: the Key Step

• Find the frequent itemsets: the sets of items that

have minimum support

– A subset of a frequent itemset must also be a

frequent itemset

• i.e., if {AB} is a frequent itemset, both {A} and {B} should be

a frequent itemset

– Iteratively find frequent itemsets with cardinality from

1 to k (k-itemset)

• Use the frequent itemsets to generate

association rules.

Problem decomposition

Two phases:

• Generate all itemsets whose support is above a threshold. Call them large (or hot) itemsets. (Any other itemset is small.)

How? Generate all combinations? (exponential!) (HARD.)

• For a given large itemset

Y = I1 I2 … Ik k >= 2

Generate (at most k rules) X Ij X = Y - {Ij}

confidence = c support(Y)/ support (X)

So, have a threshold c and decide which ones you keep. (EASY.)

Examples

Tid Items 1 {a,b,c} 2 {a,b,d} 3 {a,c} 4 {b,e,f}

Minimum support: 50 % itemsets {a,b} and {a,c}

Rules: a b with support 50 % and confidence 66.6 %

a c with support 50 % and confidence 66.6 %

c a with support 50% and confidence 100 %

b a with support 50% and confidence 100%

Assume s = 50 % and c = 80 %

The Apriori Algorithm

• Join Step: Ck is generated by joining Lk-1with itself

• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

• Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

The Apriori Algorithm — Example

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2

Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

How to Generate Candidates?

• Suppose the items in Lk-1 are listed in an order

• Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

• Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

Candidate generation (example)

C2 L2itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2 L2{1 2 3 }{1 3 5}{2 3 5}

C3

itemset{2 3 5}

Since {1,5} and {1,2} do not have enough support

Is Apriori Fast Enough? — Performance Bottlenecks

• The core of the Apriori algorithm:– Use frequent (k – 1)-itemsets to generate candidate frequent k-

itemsets– Use database scan and pattern matching to collect counts for the

candidate itemsets

• The bottleneck of Apriori: candidate generation– Huge candidate sets:

• 104 frequent 1-itemset will generate 107 candidate 2-itemsets

• To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates.

– Multiple scans of database: • Needs (n +1 ) scans, n is the length of the longest pattern

Mining Frequent Patterns Without Candidate Generation

• Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure– highly condensed, but complete for frequent pattern

mining

– avoid costly database scans

• Develop an efficient, FP-tree-based frequent pattern mining method– A divide-and-conquer methodology: decompose

mining tasks into smaller ones

– Avoid candidate generation: sub-database test only!

Construct FP-tree from a Transaction DB

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

min_support = 0.5

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Steps:

1. Scan DB once, find frequent 1-itemset (single item pattern)

2. Order frequent items in frequency descending order

3. Scan DB again, construct FP-tree

Chapter 6: Mining Association Rules in Large Databases

• Association rule mining

• Mining single-dimensional Boolean association rules from transactional databases

• Mining multilevel association rules from transactional databases

• Mining multidimensional association rules from transactional databases and data warehouse

• From association mining to correlation analysis

• Summary

Chapter 6: Mining Association Rules in Large Databases

• Association rule mining

• Mining single-dimensional Boolean association rules from transactional databases

• Mining multilevel association rules from transactional databases

• Mining multidimensional association rules from transactional databases and data warehouse

• From association mining to correlation analysis

• Constraint-based association mining

• Summary

Chapter 6: Mining Association Rules in Large Databases

• Association rule mining

• Mining single-dimensional Boolean association rules from transactional databases

• Mining multilevel association rules from transactional databases

• Mining multidimensional association rules from transactional databases and data warehouse

• From association mining to correlation analysis

• Summary

Interestingness Measurements

• Objective measuresTwo popular measurements: support; and confidence

• Subjective measures (Silberschatz & Tuzhilin, KDD95)A rule (pattern) is interesting ifit is unexpected (surprising to the user); and/oractionable (the user can do something with it)

Criticism to Support and Confidence

• Example 1: (Aggarwal & Yu, PODS98)– Among 5000 students

• 3000 play basketball• 3750 eat cereal• 2000 both play basket ball and eat cereal

– play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.

– play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000

Criticism to Support and Confidence (Cont.)

• We need a measure of dependent or correlated events

• If Corr < 1 A is negatively correlated with B (discourages B)• If Corr > 1 A and B are positively correlated• P(AB)=P(A)P(B) if the itemsets are independent. (Corr =

1)• P(B|A)/P(B) is also called the lift of rule A => B (we want

positive lift!)

)(

)/(

)()(

)(, BP

ABP

BPAP

BAPcorr BA

Chapter 6: Mining Association Rules in Large Databases

• Association rule mining

• Mining single-dimensional Boolean association rules from transactional databases

• Mining multilevel association rules from transactional databases

• Mining multidimensional association rules from transactional databases and data warehouse

• From association mining to correlation analysis

• Summary

Why Is the Big Pie Still There?

• More on constraint-based mining of associations – Boolean vs. quantitative associations

• Association on discrete vs. continuous data

– From association to correlation and causal structure analysis.

• Association does not necessarily imply correlation or causal relationships

– From intra-trasanction association to inter-transaction associations

• E.g., break the barriers of transactions (Lu, et al. TOIS’99).

– From association analysis to classification and clustering analysis

• E.g, clustering association rules

Summary

• Association rule mining – probably the most significant contribution from the

database community in KDD

– A large number of papers have been published

• Many interesting issues have been explored

• An interesting research direction– Association analysis in other types of data: spatial

data, multimedia data, time series data, etc.

Business Miner http://www.businessobjects.comClementine http://www.isl.co.uk/clem.htmlDarwin http://www.oracle.com/ip/analyze/warehouse/datamining/Data Surveyor http:// www. ddi. nl/DBMiner http://db.cs.sfu.ca/DBMinerDelta Miner http://www.bissantz.de Decision Series http://www.neovista.comIDIS http://wwwdatamining.comIntelligent Miner http://www.software.ibm.com/data/intelli-mineMineSet http://www.sgi.com/software/mineset/MLC++ http://www.sgi.com/Technology/mlc/MSBN http://www.research.microsoft.com/research./dtg/msbnSuperQuery http://www.azmy.comWeka http://www.cs.waikato.ac.nz/ml/wekaApriori: http://fuzzy.cs.uni-magdeburg.de/~borgelt/apriori/apriori.html

Some Products and Free Soft available for association rule mining

K-menas clustering

Birch uses summary information – bonus question

STUDY QUESTIONS

Some sample questions on data mining part. You may practice by yourself. No need to hand in. 1. Given transaction table:

TIDList of items

T1 1, 2, 5

T2 2, 4

T3 2,3

T4 1, 2, 4

T5 1, 3

T6 2, 3

T7 1, 3

T8 1, 2, 3, 5

T9 1, 2, 3

1)if min_sup = 2/9, apply apriori algorithm to get all the frequent itemsets, show the step.2)If min_con = 50%, show all the association rules generated from L3 (the large itemsets contains 3 items).

STUDY QUESTIONS

2. Assume we have the following association rules with min_sup = s and min_con = c: A=>B (s1, c1) B=>C (s2,c2) C=>A (s3,c3)

Show the probability of P(A), P(B), P(C), P(AB), P(BC), P(AC), P(B|A), P(C|B), P(C|A)Show the conditions we can get A=>C

STUDY QUESTIONS

. Given the following table

Outlook Temp Humidity Windy Classsunny 75 70 Y Playsunny 80 90 Y Don'tsunny 85 85 N Don'tsunny 72 95 N Don'tsunny 69 70 N Playovercast 72 90 Y Playovercast 83 78 N Playovercast 64 65 Y Playovercast 81 75 N Playrain 71 80 Y Don'train 65 70 Y Don'train 75 80 Y Playrain 68 80 N Playrain 70 96 N Play

Apply sprint algorithm to build decision tree. (The measure is gini)

STUDY QUESTIONS

4. Apply k-means to cluster the following 8 points to 3 clusters. The distance function is Euclidean distance. Assume initially we assign A1, B1, and C1 as the center of each cluster respectively. The 8 points are : A1(2,10), A2(2,5), A3(8,4) B1(5,8) B2(7,5), B3(6,4), C1(1,2), C2(4,9) Show - the three cluster centers after the first round execution.- the final three clusters.