Post on 09-Jan-2016
description
transcript
LCM ver.3: Collaboration ofLCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Array, Bitmap and Prefix Tree for
Frequent Itemset Mining Frequent Itemset Mining
LCM ver.3: Collaboration ofLCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Array, Bitmap and Prefix Tree for
Frequent Itemset Mining Frequent Itemset Mining
Takeaki UnoTakeaki Uno
Masashi KiyomiMasashi Kiyomi
Hiroki ArimuraHiroki Arimura
National Institute of Informatics, JAPAN
National Institute of Informatics, JAPAN
Hokkaido University, JAPAN
20/Aug/2005 Open Source Data Mining ’05
Computation of Pattern MiningComputation of Pattern MiningComputation of Pattern MiningComputation of Pattern Mining
(#iterations) is not so larger than (#solutions)
(linearly bounded!)
•• linear time in #solution
•• polynomial delay
We focus on data structureWe focus on data structure and and computation in an iterationcomputation in an iterationWe focus on data structureWe focus on data structure and and computation in an iterationcomputation in an iteration
TIMETIME #iterations#iterations time of an iterationtime of an iteration
•• frequency counting,•• data structure reconstruction,•• closure operation,•• pruning, ...
•• frequency counting,•• data structure reconstruction,•• closure operation,•• pruning, ...
== ××
For frequent itemsets and closed itemsets, For frequent itemsets and closed itemsets, enumeration methods are almost optimalenumeration methods are almost optimal
For frequent itemsets and closed itemsets, For frequent itemsets and closed itemsets, enumeration methods are almost optimalenumeration methods are almost optimal
++ I/OI/O
coding technique
coding technique
Goal: clarify feature of Goal: clarify feature of • • enumeration algorithmsenumeration algorithms • • real-world data setsreal-world data setsFor what cases(parameter), which technique is good? For what cases(parameter), which technique is good? ““theoretical intuitions/evidences” are importanttheoretical intuitions/evidences” are important
Goal: clarify feature of Goal: clarify feature of • • enumeration algorithmsenumeration algorithms • • real-world data setsreal-world data setsFor what cases(parameter), which technique is good? For what cases(parameter), which technique is good? ““theoretical intuitions/evidences” are importanttheoretical intuitions/evidences” are important
MotivationMotivationMotivationMotivation
•• Some data structures have been proposed for storing huge datasets, and accelerate the computation (frequency counting)•• Each has its own advantages and disadvantages
1.1. Bitmap
2.2. Prefix Tree
3.3. Array List (with deletion of duplications)
Datasets have bothDatasets have both dense part and sparse partdense part and sparse part
Datasets have bothDatasets have both dense part and sparse partdense part and sparse part
GoodGood: dense data, large supportBadBad: sparse data, small support
GoodGood: non-sparse data, structured dataBadBad: sparse data, non-structured data
GoodGood: non-dense dataBadBad: very dense data
How can we fit?How can we fit?How can we fit?How can we fit?
1 0 1 1 1
1 0 0 1 1
1 1 0 0 0
1 1 0 1 1
ab
d
c e
c
e g
1 a b
4 a c d e f
5 c d
1 a c e f
ObservationsObservationsObservationsObservations
•• Usually, databases satisfy power law
the part of few items is dense,
and the rest is very sparse
•• Using reduced conditional databases, in almost all iterations, the size of the database is very small
Quick operations for
small database are very efficient
ddeennssee
sparsesparse
rec. dep
threc. d
epth
......
items
transactions
Idea of CombinationIdea of CombinationIdea of CombinationIdea of Combination
• • Choose a constant c• • F == c items of largest frequency• • Split each transaction TT in two parts, dense part dense part composed of items in FF sparse part sparse part composed of items not in FF
• • Store dense part by bitmap, and sparse part by array list
•• Use bitmap and array lists for dense and sparse partsUse bitmap and array lists for dense and sparse parts•• Use prefix tree of constant size for frequency countingUse prefix tree of constant size for frequency counting
•• Use bitmap and array lists for dense and sparse partsUse bitmap and array lists for dense and sparse parts•• Use prefix tree of constant size for frequency countingUse prefix tree of constant size for frequency counting
We can take all their advantagesWe can take all their advantages
c c items c c items
ddeennssee
sparsesparse
items
transactions
Complete Prefix TreeComplete Prefix TreeComplete Prefix TreeComplete Prefix Tree
We use complete prefix tree:complete prefix tree: prefix tree including all patterns
a
c
b
d
c
c
b
d
c
d
d
d
d
d
d
0011
0100
0000
0001
0111
0101
1100
1101
Complete Prefix TreeComplete Prefix TreeComplete Prefix TreeComplete Prefix Tree
We use complete prefix tree:complete prefix tree: prefix tree including all patterns
Parent of a pattern is obtained by clearing the highest bit (Ex. 010110 0000110 ) no pointer is needed
0010
1000
1010
0110 1110
1001
1111
1001
Ex) Ex) transactions
{a,b,c}, {a,c,d}, {c,d} We construct the complete prefix tree for dense part of the transactionsIf c is small, then its size 2c is not huge
We construct the complete prefix tree for dense part of the transactionsIf c is small, then its size 2c is not huge
0001
1111
1001
0011
0000
0111
0101
0100 1100
1101
Complete Prefix TreeComplete Prefix TreeComplete Prefix TreeComplete Prefix Tree
We use complete prefix tree:complete prefix tree: prefix tree including all patterns Any prefix tree is its subtree
Parent of a pattern is obtained by clearing the highest bit (Ex. 010110 0000110 ) no pointer is needed
0010
1000
1010
0110 1110
1001
Ex) Ex) transactions
{a,b,c}, {a,c,d}, {c,d} We construct the complete prefix tree for dense part of the transactionsIf cc is small, then its size 2c is not huge
We construct the complete prefix tree for dense part of the transactionsIf cc is small, then its size 2c is not huge
Ex) Ex) transactions
{a,b,c,d}, {a}, {a,d}
0011
0100
0000
0001
0111
0101
1100
1101
Frequency countingFrequency countingFrequency countingFrequency counting
• • Frequency of a pattern (vertex) == # descendant leaves
• • Occurrence by adding item ii == patterns with iith bit = 11
Bottom up sweep is good
0010
1000
1010
0110 1110
1001
1111
1001
Linear time in the size of prefix treeLinear time in the size of prefix tree
1100
1101
22
0111
0101
0100
0011
0001
33
1122
““Constant Size” DominatesConstant Size” Dominates““Constant Size” DominatesConstant Size” Dominates
•• How much iterations input “constant size databaseconstant size database” ?
support Small supp.(1M solutions) Large supp. (1K solutions)
constant size database
strategy changes
constant size database
strategy changes
pumsb 99.9% 0.1% 99% 0.2%
pumsb*, connect, chess, accidents
99% 0.1 – 0.5% 95 - 99% 0.2 - 1%
kosarak, mushroom, BMS-WebView2, T40I10D100K
90-99% 1 - 2% 30-90% 2 - 4%
Retail, BMS-pos, T10I4D100K 30-90% 3 - 5% 30-90% 3 - 5%
““Small iterations” dominate computation time, Small iterations” dominate computation time, ““Strategy change” is not a heavy taskStrategy change” is not a heavy task
““Small iterations” dominate computation time, Small iterations” dominate computation time, ““Strategy change” is not a heavy taskStrategy change” is not a heavy task
More AdvantagesMore AdvantagesMore AdvantagesMore Advantages
•• Reconstruction of prefix trees is a heavy task complete prefix tree needs no reconstruction
•• Coding prefix trees is not easy complete prefix tree is easy to be coded
• • Radix sort used for detecting the identical transactions is heavy when data is dense Bitmaps for dense parts accelerate the radix sort
For Closed/Maximal ItemsetsFor Closed/Maximal ItemsetsFor Closed/Maximal ItemsetsFor Closed/Maximal Itemsets
•• Compute the closure/maximality by storing the previously obtained itemsets No additional function is needed
•• Depth-first search (closure extension type) Need prefix of each itemsets
0011
0100
0000
0001
0111
0101
1100
1101
0010
1000
1010
0110 1110
1001
1111
1001By taking intersection/weighted union of prefixes at each node of the prefix tree, we can compute efficiently (from LCM v2)
By taking intersection/weighted union of prefixes at each node of the prefix tree, we can compute efficiently (from LCM v2)
prefixprefix
prefixprefix
prefixprefix
prefixprefix
prefixprefix
prefixprefix
prefixprefix
ExperimentsExperimentsExperimentsExperiments
CPU, memory, OS: Pentium4 3.2GHz, 2GB memory, Linux
Compared with: FP-growth, afopt, Mafia, Patriciamine, kDCI, nonodrfp, aim2, DCI-closed (All these marked high scores at competition FIMI04)
1414 datasets datasets of FIMI repository FIMI repository
Memory usage decreased to half, for dense datasets,Memory usage decreased to half, for dense datasets, but not for sparse datasetsbut not for sparse datasets
Memory usage decreased to half, for dense datasets,Memory usage decreased to half, for dense datasets, but not for sparse datasetsbut not for sparse datasets
We applied the data structure to LCM2 We applied the data structure to LCM2 We applied the data structure to LCM2 We applied the data structure to LCM2
Experimental ResultsExperimental ResultsExperimental ResultsExperimental Results
Experimental ResultsExperimental ResultsExperimental ResultsExperimental Results
Discussion and Future WorkDiscussion and Future WorkDiscussion and Future WorkDiscussion and Future Work
•• Combination of bitmaps and array lists reduces memory space efficiently, for dense datasets
•• Using prefix trees for constant number of item is sufficient for speeding up frequency counting, for non-sparse datasets,
•• The data structure is orthogonal to other methods for closed/maximal itemset mining, maximality check, pruning, closure operations etc.
•• Bitmaps and prefix trees are not so efficient for semi-structured data (semi-structure gives huge variations, hardly represented by bits, and to be shared)
•• Simplify the techniques so that they can be applied easily
•• Stable memory allocation (no need to dynamic allocation)
•• Bitmaps and prefix trees are not so efficient for semi-structured data (semi-structure gives huge variations, hardly represented by bits, and to be shared)
•• Simplify the techniques so that they can be applied easily
•• Stable memory allocation (no need to dynamic allocation)
Future work: Future work: other pattern mining problemsFuture work: Future work: other pattern mining problems