LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining

transcript

LCM ver.3: Collaboration ofLCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Array, Bitmap and Prefix Tree for

Frequent Itemset Mining Frequent Itemset Mining

LCM ver.3: Collaboration ofLCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Array, Bitmap and Prefix Tree for

Frequent Itemset Mining Frequent Itemset Mining

Takeaki UnoTakeaki Uno

Masashi KiyomiMasashi Kiyomi

Hiroki ArimuraHiroki Arimura

National Institute of Informatics, JAPAN

Hokkaido University, JAPAN

20/Aug/2005 Open Source Data Mining ’05

Computation of Pattern MiningComputation of Pattern MiningComputation of Pattern MiningComputation of Pattern Mining

(#iterations) is not so larger than (#solutions)

(linearly bounded!)

•• linear time in #solution

•• polynomial delay

We focus on data structureWe focus on data structure and and computation in an iterationcomputation in an iterationWe focus on data structureWe focus on data structure and and computation in an iterationcomputation in an iteration

TIMETIME #iterations#iterations time of an iterationtime of an iteration

•• frequency counting,•• data structure reconstruction,•• closure operation,•• pruning, ...

== ××

For frequent itemsets and closed itemsets, For frequent itemsets and closed itemsets, enumeration methods are almost optimalenumeration methods are almost optimal

++ I/OI/O

coding technique

Goal: clarify feature of Goal: clarify feature of • • enumeration algorithmsenumeration algorithms • • real-world data setsreal-world data setsFor what cases(parameter), which technique is good? For what cases(parameter), which technique is good? ““theoretical intuitions/evidences” are importanttheoretical intuitions/evidences” are important

MotivationMotivationMotivationMotivation

•• Some data structures have been proposed for storing huge datasets, and accelerate the computation (frequency counting)•• Each has its own advantages and disadvantages

1.1. Bitmap

2.2. Prefix Tree

3.3. Array List (with deletion of duplications)

Datasets have bothDatasets have both dense part and sparse partdense part and sparse part

GoodGood: dense data, large supportBadBad: sparse data, small support

GoodGood: non-sparse data, structured dataBadBad: sparse data, non-structured data

GoodGood: non-dense dataBadBad: very dense data

How can we fit?How can we fit?How can we fit?How can we fit?

1 0 1 1 1

1 0 0 1 1

1 1 0 0 0

1 1 0 1 1

4 a c d e f

1 a c e f

ObservationsObservationsObservationsObservations

•• Usually, databases satisfy power law

the part of few items is dense,

and the rest is very sparse

•• Using reduced conditional databases, in almost all iterations, the size of the database is very small

Quick operations for

small database are very efficient

ddeennssee

sparsesparse

rec. dep

threc. d

......

transactions

Idea of CombinationIdea of CombinationIdea of CombinationIdea of Combination

• • Choose a constant c• • F == c items of largest frequency• • Split each transaction TT in two parts, dense part dense part composed of items in FF sparse part sparse part composed of items not in FF

• • Store dense part by bitmap, and sparse part by array list

•• Use bitmap and array lists for dense and sparse partsUse bitmap and array lists for dense and sparse parts•• Use prefix tree of constant size for frequency countingUse prefix tree of constant size for frequency counting

We can take all their advantagesWe can take all their advantages

c c items c c items

ddeennssee

sparsesparse

transactions

Complete Prefix TreeComplete Prefix TreeComplete Prefix TreeComplete Prefix Tree

We use complete prefix tree:complete prefix tree: prefix tree including all patterns

Parent of a pattern is obtained by clearing the highest bit (Ex. 010110 0000110 ) no pointer is needed

0110 1110

Ex) Ex) transactions

{a,b,c}, {a,c,d}, {c,d} We construct the complete prefix tree for dense part of the transactionsIf c is small, then its size 2c is not huge

We construct the complete prefix tree for dense part of the transactionsIf c is small, then its size 2c is not huge

0100 1100

We use complete prefix tree:complete prefix tree: prefix tree including all patterns Any prefix tree is its subtree

Parent of a pattern is obtained by clearing the highest bit (Ex. 010110 0000110 ) no pointer is needed

0110 1110

{a,b,c}, {a,c,d}, {c,d} We construct the complete prefix tree for dense part of the transactionsIf cc is small, then its size 2c is not huge

We construct the complete prefix tree for dense part of the transactionsIf cc is small, then its size 2c is not huge

{a,b,c,d}, {a}, {a,d}

Frequency countingFrequency countingFrequency countingFrequency counting

• • Frequency of a pattern (vertex) == # descendant leaves

• • Occurrence by adding item ii == patterns with iith bit = 11

Bottom up sweep is good

0110 1110

Linear time in the size of prefix treeLinear time in the size of prefix tree

““Constant Size” DominatesConstant Size” Dominates““Constant Size” DominatesConstant Size” Dominates

•• How much iterations input “constant size databaseconstant size database” ?

support Small supp.(1M solutions) Large supp. (1K solutions)

constant size database

strategy changes

constant size database

strategy changes

pumsb 99.9% 0.1% 99% 0.2%

pumsb*, connect, chess, accidents

99% 0.1 – 0.5% 95 - 99% 0.2 - 1%

kosarak, mushroom, BMS-WebView2, T40I10D100K

90-99% 1 - 2% 30-90% 2 - 4%

Retail, BMS-pos, T10I4D100K 30-90% 3 - 5% 30-90% 3 - 5%

““Small iterations” dominate computation time, Small iterations” dominate computation time, ““Strategy change” is not a heavy taskStrategy change” is not a heavy task

More AdvantagesMore AdvantagesMore AdvantagesMore Advantages

•• Reconstruction of prefix trees is a heavy task complete prefix tree needs no reconstruction

•• Coding prefix trees is not easy complete prefix tree is easy to be coded

• • Radix sort used for detecting the identical transactions is heavy when data is dense Bitmaps for dense parts accelerate the radix sort

For Closed/Maximal ItemsetsFor Closed/Maximal ItemsetsFor Closed/Maximal ItemsetsFor Closed/Maximal Itemsets

•• Compute the closure/maximality by storing the previously obtained itemsets No additional function is needed

•• Depth-first search (closure extension type) Need prefix of each itemsets

0110 1110

1001By taking intersection/weighted union of prefixes at each node of the prefix tree, we can compute efficiently (from LCM v2)

By taking intersection/weighted union of prefixes at each node of the prefix tree, we can compute efficiently (from LCM v2)

prefixprefix

ExperimentsExperimentsExperimentsExperiments

CPU, memory, OS: 　 Pentium4 3.2GHz, 2GB memory, Linux

Compared with: FP-growth, afopt, Mafia, Patriciamine, kDCI, nonodrfp, aim2, DCI-closed (All these marked high scores at competition FIMI04)

1414 datasets datasets of FIMI repository FIMI repository

Memory usage decreased to half, for dense datasets,Memory usage decreased to half, for dense datasets, but not for sparse datasetsbut not for sparse datasets

We applied the data structure to LCM2 We applied the data structure to LCM2 We applied the data structure to LCM2 We applied the data structure to LCM2

Experimental ResultsExperimental ResultsExperimental ResultsExperimental Results

Discussion and Future WorkDiscussion and Future WorkDiscussion and Future WorkDiscussion and Future Work

•• Combination of bitmaps and array lists reduces memory space efficiently, for dense datasets

•• Using prefix trees for constant number of item is sufficient for speeding up frequency counting, for non-sparse datasets,

•• The data structure is orthogonal to other methods for closed/maximal itemset mining, maximality check, pruning, closure operations etc.

•• Bitmaps and prefix trees are not so efficient for semi-structured data (semi-structure gives huge variations, hardly represented by bits, and to be shared)

•• Simplify the techniques so that they can be applied easily

•• Stable memory allocation (no need to dynamic allocation)

•• Bitmaps and prefix trees are not so efficient for semi-structured data (semi-structure gives huge variations, hardly represented by bits, and to be shared)

•• Simplify the techniques so that they can be applied easily

•• Stable memory allocation (no need to dynamic allocation)

Future work: Future work: other pattern mining problemsFuture work: Future work: other pattern mining problems

LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining

Documents