+ All Categories
Home > Documents > Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Date post: 21-Dec-2015
Category:
View: 220 times
Download: 0 times
Share this document with a friend
Popular Tags:
47
Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda
Transcript
Page 1: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Fast Vertical Mining Using Diffsets

Mohammed J. ZakiKaram Gouda

Page 2: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Outline

• Introduction• Problem Setting and Notations• Equivalence Classes & Diffsets• Algorithms For Mining Frequent, Closed and

Maximal Patterns• Experimental Results• Conclusions

Page 3: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Introduction

• Horizontal methods (Most are Apriori variants)

• Mining Maximal Frequent Patterns (All-MFS,Max Miner,Depth Project,FP-Growth)

• Mining Closed Sets (A-Close, Closet, Charm)

• Vertical Methods• Vertical Approach Problems• Diffsets

Page 4: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Notations

• I – set of items• T- database transactions • Tid – transaction identifier• Itemset – a set • Tidset – a set • K-itemset – An itemset with k items• Support of an itemset X, denoted - the

number of transactions in which X occurs as a subset

TY

IX

)(X

Page 5: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Notation

• Frequent itemset – if • Powerset P(I) – search space enumeration• Maximal frequent itemset- if it is not a subset of

any other frequent itemset• Closed frequent itemset (X) - if there is not exist

a superset with • Closure of an itemset X, denoted c(X) – the

smallest closed set that contains X

supmin_)( X

XY )()( YX

Page 6: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

The Problem

• Find all frequent items having minimum support

Page 7: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Database Example

Page 8: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Frequent, Closed and Maximal Itemsets

Page 9: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Data Formats

Page 10: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Equivalence Classes

• Define a function ,where

the k-length prefix of X• Define an equivalence relation (prefix-based) :

)()(: IPNIPp

]:1[),( kXkXp

k

),(),(),(, kYpkXpYXIPYXK

Page 11: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Example{} {A,C,D,T,W}

A {C,D,T,W} C {D,T,W} D {T,W}T {W} W

AC {D,T,W}

ACD {T,W}

ACDT {W}

ACDTW

AD {TW} AT {W} AW CD {T,W}

CDT {W}

CDTW

ACT {W} ACW

ACDW ACTW

ADT ADW ATW

ADTW

CT {W} CW DT ,W}

DTW

DW TW

CDW CTW

Page 12: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Compute Subset Class

• Let • Perform intersection of with all with

to obtain a new class with elements ,where is frequent

},...,,{ 21 nXXXP

iPX jPX ij

iPX jX

ji XPX

Page 13: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Tidset Intersections (example)1

2

3

4

5

6

1

3

4

5

A C2

4

5

6

D1

3

5

6

T W1

2

3

4

5

1

3

4

5

4

5

1

3

5

1

3

4

5

AC AD AT AW

2

4

5

6

1

3

5

6

1

2

3

4

5

5

6

2

4

5

1

3

5

CD CT CW DT DW TW

1

3

5

1

3

4

5

1

3

5

1

3

5

2

4

5

1

3

5

ACT ACW ATW CDW CTW

ACTW

Page 14: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Diffsets

Difference of the prefix tidset and a class member tidset

• Consider class with prefix P• Let t(X) denote the tidset of element X• Let d(X) denote the diffset of element X, with respect to prefix tidset• Let PX and PY be class members of P• Support )()( and )()( PtPYtPtPXt

Page 15: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Diffsets

• Then • Define diffset • Then

)()()( PYtPXtPXYt )()()( XtPtPXd

)()()( PXdPPX

Page 16: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Diffsets

• How to Calculate using d(PX) and d(PY) ?– – –

)(PXY

)()()( PXYdPXPXY

)()()( PXYtPXtPXYd

)()()]()([)]()([

)()()()()()()(

PXdPYdXtPtYtPt

PtPtPYtPXtPXYtPXtPXYd

Page 17: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Example

t(P)

d(PY) d(PX)

t(X)

t(Y)

d(PXY) t(PXY)

Page 18: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Diffset Intersections (example)

1

2

3

4

5

6

1

3

4

5

A C2

4

5

6

D1

3

5

6

T W1

2

3

4

5

1

3

4

AC AD AT AW

1

3

2

4

6 6 6

CD CT CW DT DW TW

4 6 6

ACT ACW ATW CDW CTW

ACTW

2

4

A C D T W2

6

1

3

2

4

6

TIDSET database DIFFSET database

Page 19: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Diffset Example

• Diffset calculation– –

• Support calculation–

13)()()( DtAtADd

132613)()()( AdDdADd

224)()()( ADdAAD

Page 20: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Diffset Example

• Database Size– Tidsets database size =23– Diffets database size =7

• Total Size– Tidsets database size =76– Diffsets database size =22

• Size By Length

K-itemset (k) Avg. tidset length Avg. diffset length

2 3.8 1

3 3.2 0.6

4 3 0

Page 21: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Experimental Study

• Compare diffsets versus tidsets in terms of database sizes

• Method– Real datasets (usually dense)– Synthetic datasets (sparse)

Page 22: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Size Of Database

Page 23: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Average Diffset / Tidset Size By length

Page 24: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Average Diffset / Tidset Size Database Min_sup

(%)Max Length

Avg. Diffset Size

Avg. Tidset Size

Reduction Ration

chess 0.5 16 26 1820 70

connect 90 12 143 62204 435

mushroom 5 17 60 622 10

Pumsb* 35 15 301 18977 63

pumsb 90 8 330 45036 136

T10I4D100K 0.025 11 14 86 6

T20I16D100K 0.1 14 31 230 11

T40I10D100K 0.5 18 96 755 8

Page 25: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

When To Use diffsets

• Usually there is a cross-over point• For Dense dataset start with diffset format• For Sparse dataset start with tidset format

Page 26: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Reduction Ratio

• Let class P• Let PX and PY class members with t(PX) and

t(PY)• Consider new Itemset PXY in class PX• PXY can be stored as t(PXY) or d(PXY)• Definition : reduction ratio • Benefit if or •

)(/)( PXYdPXYtr

1r 1)(/)( PXYdPXYt

1))()(/()()( PYtPXtPXYtPXYd

Page 27: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Reduction Ratio

• Or

1))()(/()( PXYtPXtPXYt

2)(/)( PXYtPXt

Page 28: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Compressed Bitvectors

• Classical way run-length encoding (RLE) – not appropriate for association mining

• Skinning encoding scheme (used by Viper) – Worst case compression ratio reaches asymptotically

2.91– Best case compression ratio asymptotically reaches 32

Page 29: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

GenMax: Mining Maximal Frequent Itemsets

• Uses backtracking search technique• Optimizations

– Initially sort items in increasing order of their combine-set size and increasing order of support (i. first explore items with small combine sets, ii. remove a node as early as possible from the search tree)

– Superset checking• More Optimizations

– Progressive focusing to improve superset checking– Vertical database format to improve frequency checking using

tidsets, which is more improved by diffsets• Memory Handling

– Store at most k=m+l tidsets (diffsets) in memory, where m is the length of the longest combine-set and l is the length of the longest maximal itemset

Page 30: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

MReturn 18.

.17

16.

),,( 15.

};{ 14.

)()()( );()( 13.

},{ 12.

break in set super a has 11.

in follows or is 10.

.9

}:{ 8.

7.

Itemsets.Frequent Maximal// .6

}. oforder sortedin :{)()( 5.

. oforder in each Sort .4

. INCREASING then and

ofy cardinalit INCREASINGin in itemsSort .3

set.-combine its , calculate itemeach For .2

.F Calculate , Calculate 1.

)(Dataset

1

1

1

1

1

21

ZMM

YZZ

YXIExtend

xjZxY

jtitIdjcicX

jiI

ZH

c(i)}jxj{x:xH

c(i)j

xiMxZ

Fi

{}; M

Fijjicic

Fc(i)

σ(i)

c(i)F

c(i)Fi

F

T

thenif

do eachfor

doeachfor

GenMax

Page 31: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

} { 18.

0) Y and 0(extendflg .17

} { .16

),,( .15

}:{ 14.

13.

}{ 12.

)( 11.

1extendflg .10

)( 9.

frequent) is (NewI 8.

)()()( };{ 7.

break; ; 1extendflg 6.

in set super has 5.

in follows or is 4.

0Y 3.

2.

0extendflg 1.

.contain which itemsets maximal all i.e.,//

far so found itemsets maximalrelevant ofset theis //

and set, combine thei.e., , toadded becan // that

items ofset theis extended, be itemset to theis //

),,(

IYY

NewIYY

NewYNewXNewIExtend

XjYxNewY

NewIYY

NewX

jcXNewX

IdjdNewIdjINewI

YG

X}jxj{x:xG

Xj

I

Y

I

XI

YXIExtend

thenif

else

then if

thenif

thenif

then if

doeachfor

Procedure

Page 32: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

dEclat: Mining All Frequent Itemsets

• Performs bottom-up search• The equivalence class lattice is traversed in a bfs

order• Input: class members• F.I are generated by computing diffsets for all

distinct pairs of itemsets and checking the support of the resulting itemset

• Stores in memory intermediate diffsets (tidsets) of at most two levels

Page 33: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

);(DiffEclat all 7.

emptyinitially // ; 6.

supmin_)( 5.

);( -)()( 4.

; 3.

with , all 2.

all 1.

:

ii

iii

ij

ji

j

i

TT

T{R}TT

R

XdXdRd

XXR

ij[P]X

[P]X

[P]

do for

thenif

do for

do for

)(DiffEclat

Page 34: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

dCharm: Mining Frequent Closed Itemsets

• Performs bottom-up search• Eliminates branches and grows itemsets using

subset relationship

Page 35: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Subset Relationships

Theorem: Let and be any two

members of class , with , where is a total order (e.g., lexiographic or support-based). The following for properties hold:1. If , then 2. If , then , but 3. If , then , but 4. If , then

)( ii XdX )( jj XdX

P jfi XX f

)()( ji XdXd )()()( jiji XXcXcXc

)()( ji XdXd )()( ji XcXc )()( jii XXcXc

)()( ji XdXd )()( ji XcXc )()( jij XXcXc

)()( ji XdXd )()()( jiji XXcXcXc

Page 36: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

(NewN) DiffCharmNewN i 11.

to Add )()( 10.

to Add Nodes; from Remove )()( 9.

with all Replace )()( 8.

with all Replace Nodes; from Remove )( )( 7.

continue supmin_)( 6.

)( -)()( 5.

4.

with , all 3.

2.

all 1.

:

then f

then if

then if

then if

then if

thenif

do for

do for

)( DiffCharm

NewNRXdXd

NewNRXXdXd

RXXdXd

RXXXdXd

R

XdXdRd

XXR

ij[P]X

XX

[P]X

P

ji

jji

iji

ijji

ij

j

j

i

i

Page 37: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Optimized Initialization

• Computation • Let be the number of frequent items• Let be the average tidset size • Amount of data read is • Number Of intersections• In horizontal approach amount of data read

is

2F

n

l2/)1( nnl

nl

2/)1( nn

Page 38: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Improvement

• Compute frequent items of length 2• Combine items and only if is frequent • Now The number of intersections in practice is

closer to rather then • Frequent itemsets of length 2 computation

– perform vertical to horizontal transformation– Update the count of pairs of items

1I 2I 21 II

)(nO )( 2nO

Page 39: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Experimental Results

• Times include all costs, including horizontal to vertical database conversion

• Method– Real datasets (usually dense)– Synthetic datasets (sparse)

Page 40: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Database Characteristics Database # Items Avg. trans. Length # Records

chess 76 37 3,196

connect 130 43 67,557

mushroom 120 23 8,124

Pumsb* 7117 50 49,046

pumsb 7117 74 49,046

T10I4D100K 1000 10 100,000

T20I16D100K 1000 40 100,000

Page 41: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Length Of the Longest Itemset

Page 42: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Cardinality Of F.I , C.F.I and M.F.I

Page 43: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Improvements using Diffsets

Page 44: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Mining Frequent Itemsets

Page 45: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Mining Closed Itemsets

Page 46: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Mining Maximal Itemsets

Page 47: Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda.

Amir Epstein

Conclusions

• Diffsets dramatically cut down the size of memory required to store intermediate results

• Diffsets increase performance significantly when incorporated into previous vertical mining methods

• Diffsets can deliver over order of magnitude performance improvements over the best previous methods


Recommended