Date post: | 11-May-2015 |
Category: |
Engineering |
Upload: | fabio-fumarola |
View: | 694 times |
Download: | 4 times |
L.A.C.A.M.KDDE
A Parallel Algorithm for Approximate Frequent Itemset Mining using
MapReduce Fabio Fumarola and Donato Malerba
Ciaociao
Vai a fare
ciao ciao
Dipartimento di INFORMATICA
Department of Computer Science
University of Bari “Aldo Moro”
via Orabona, 4, I-70125 Bari, Italy
L.A.C.A.M.KDDE
2
Outline• Frequent Pattern Mining• Frequent Itemset Mining using MapReduce– Issues related to data distribution
• Discover “Approximate” frequent itemset with a statistical error guarantee
• MrAdam– Chernoff Bound– Local Model Discovery– Global Combination and Interpolation
• Experiments• Conclusions
Fumarola and Malerba – A parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
3
Frequent Pattern Analysis• Frequent pattern: a pattern (a set of items,
subsequences, substructures, etc.) that occurs frequently in a dataset
• First proposed by Agrawal et al. in the context of frequent itemsets and association rule mining
• Motivation: Finding inherent regularities in data– What products were often purchased together?– What kinds of DNA are sensitive to this new drug?
• Application: Basket data analysis, cross-marketing, catalog design, Web log analysis, and DNA sequence analysis
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
4
Why Frequent Pattern Mining Important?
Foundation for many essential data mining tasks– Association, correlation, and causality analysis– Sequential, structural (e.g., sub-graph) patterns– Pattern analysis in spatiotemporal, multimedia,
time-series, and stream data – Classification: discriminative, frequent pattern
analysis– Cluster analysis: frequent pattern-based clustering
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
5
Basic Concepts: Frequent Patterns
• itemset: A set of one or more items• k-itemset X = {x1, …, xk}• (absolute) support, or, support
count of X: Frequency or occurrence of an itemset X
• (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X)
• An itemset X is frequent if X’s support is no less than a minsup threshold
Customerbuys diaper
Customerbuys both
Customerbuys beer
Tid
Items bought
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
6
FIM & MapReduce
• Several algorithm have been proposed for Frequent Itemset Mining using MapReduce.
• However, their computation time is negatively affected by:– Inter-communication costs,– In-process synchronizations,– Balanced data distribution, and– Input parameters tuning.
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
7
Problem
• Is that Frequent Itemset Mining is not Map-Reducible.
{a,b,c}{a,b}
{a,b,c}{a,c,d}
{a,b,c}{a}
{a,b,c}
{a,c}{a}
{a,b,c}
minsup >= 0.5
{a,b}=3/4 {a,b}=2/3 {a,b}=1/3
{a,b}= (3/4 +2/3 + 0)/3 < 0.5
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
8
Research Goals
Can we still make reasonable decisions in the absence of perfect answers?
(Yes…)
If we introduce an error is still acceptable?(Not Always)
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
9
Contribution
• To mine “approximate” frequent itemsets from Big Data with statistical error guarantees.
• We want to avoid:– Inter-communication costs,– In-process synchronizations,– Additional input parameters tuning.
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
11
Literature (DDM)
• In 1996, Aggrawal and Shafer proposed approaches based on: Count Distribution, Data Distribution, and Candidate Distribution
• In 2002, Orlando et al. proposed the Partion Algorithm• In 2004, Asharafi et al. proposed Odam, where the
mining process is synchronized via message passing• Still all the proposed solutions are based on Apriori
approach– Issues: synchronization, data balancing
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
12
Literature: MapReduce• However, Google introduced– 2003 - Google File System,– 2004 - MapReduce,
• In 2008: Li et al. proposed Parallel FP-Growth (PFP)1. Parallel and distributed counting of frequent items2. MapReduce round to generate group-dependent
transactions3. FP-growth applied to each group
• VLDB 2010, DBKA 2010, KDD 2011 evolutionsIssues: data replication, inter-communication and synchronizations
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
13
Literature: MapReduce
• Riondato et al. [CIKM 2012] proposed an approach, which:– Creates samples of the dataset– Extracts frequent itemsets with support ≥ (minsup – ε/2)
from the samples– Combines the frequent itemesets
• Parameters: number of transactions, allowed error ε, probability δ, replication parameter Φ, and the type of Mapper to be used (Partition, Binomial, Count Flipper, Input Sample)
• Matlab script for parameters tuning• Composed by 2 map-reduce rounds
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
14
MrAdam• Given a dataset stored on HDFS, MrAdam takes:– Input: 2 parameters, minsup and a value for the
reliability parameter δ.– Output: a (1 – δ) approximation of the exact set of
frequent itemsets• MrAdam combines:– A statistical approach based on the Chernoff bound– A MapReduce based algorithm– The SE-Tree to combine intermediate results– An interpolation function to estimate the support of an
itemset
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
15
Chernoff bound
• It allows us to give an estimation of the expected value for a random variable with Binomial distribution.
• We can use it to express the allowed error ε in term of δ.
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
16
MapReduce
• We modeled the computation into Map and Reduce functions.
1.map(key: LongWritable ,value: Text ,context: Context)
2.reduce(key: Text, vals: Iterable[LongWritable],context: Context)
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
17
1. Local Model Discovery
Input: minsup, δ, hdfs folder1. It uses the Chernoff bound to compute the
maximum acceptable error ε2. It executes a Map-step with FP-growth as
routine and it returns the collection of frequent itemsets with support (minsup – ε)
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
18
2. Global Combination
• The partial result are aggregated using an SE-Tree
• The SE-Tree enumerates the ordered collection of the discovered frequent itemsets
• Baseline Approach: the support is computed by summing the values in the SE-Tree nodes
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
19
3. Structural Interpolated Support:
Let S be an SE-Tree, X a candidate k-itemset infrequent on Di and P(X) the set of (k-1)-subitemsets of X• The structural interpolated support of X
accepted if:1. None of the (k-1)-subitemsets of X is marked as
approximated2.
Method: MrAdam-ChernoffFumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
20
EXPERIMENTS
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
21
Experiment: Goal
1. The runtime overhead of MrAdam on Hadoop w.r.t. FP-Growth
2. Evaluate the performances of MrAdam w.r.t the literature.
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
Fumarola and Malerba – A parallel algorrithm for Approximate Frequent Itemset Mining using MapReduce 22
Experimental Settings
• MrAdam implemented using Hadoop 2.2• Private Cluster composed by 8 VMs– Intel Xeon 2.2GHz – 8GB of RAM
• One Vm equipped with 32GB of RAM
L.A.C.A.M.KDDE
23
Mushroom
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
24
Pumsb
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
25
Accident
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
26
LARGE DATASETS
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
27
Mushroom Large
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
28
Pumsb
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
29
Scalability
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
L.A.C.A.M.KDDE
30
Conclusions
We presented MrAdam algorithm that:– Does not require any time-consuming
communication and synchronization activity, – It generates in parallel the sets of itemsets locally
frequent and,– It return a (1-δ) approximation of the collection of
the globally frequent itemsets by aggregation and interpolation.
– Experiment shows that MrAdam is from 2 to 100 times faster than PFP.
Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce