A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

A Parallel Algorithm for Approximate Frequent Itemset Mining using

MapReduce Fabio Fumarola and Donato Malerba

Ciaociao

Vai a fare

ciao ciao

Dipartimento di INFORMATICA

Department of Computer Science

University of Bari “Aldo Moro”

via Orabona, 4, I-70125 Bari, Italy

L.A.C.A.M.KDDE

2

Outline• Frequent Pattern Mining• Frequent Itemset Mining using MapReduce– Issues related to data distribution

• Discover “Approximate” frequent itemset with a statistical error guarantee

• MrAdam– Chernoff Bound– Local Model Discovery– Global Combination and Interpolation

• Experiments• Conclusions

Fumarola and Malerba – A parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

3

Frequent Pattern Analysis• Frequent pattern: a pattern (a set of items,

subsequences, substructures, etc.) that occurs frequently in a dataset

• First proposed by Agrawal et al. in the context of frequent itemsets and association rule mining

• Motivation: Finding inherent regularities in data– What products were often purchased together?– What kinds of DNA are sensitive to this new drug?

• Application: Basket data analysis, cross-marketing, catalog design, Web log analysis, and DNA sequence analysis

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

4

Why Frequent Pattern Mining Important?

Foundation for many essential data mining tasks– Association, correlation, and causality analysis– Sequential, structural (e.g., sub-graph) patterns– Pattern analysis in spatiotemporal, multimedia,

time-series, and stream data – Classification: discriminative, frequent pattern

analysis– Cluster analysis: frequent pattern-based clustering


L.A.C.A.M.KDDE

5

Basic Concepts: Frequent Patterns

• itemset: A set of one or more items• k-itemset X = {x1, …, xk}• (absolute) support, or, support

count of X: Frequency or occurrence of an itemset X

• (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X)

• An itemset X is frequent if X’s support is no less than a minsup threshold

Customerbuys diaper

Customerbuys both

Customerbuys beer

Tid

Items bought

10 Beer, Nuts, Diaper

20 Beer, Coffee, Diaper

30 Beer, Diaper, Eggs

40 Nuts, Eggs, Milk

50 Nuts, Coffee, Diaper, Eggs, Milk


L.A.C.A.M.KDDE

6

FIM & MapReduce

• Several algorithm have been proposed for Frequent Itemset Mining using MapReduce.

• However, their computation time is negatively affected by:– Inter-communication costs,– In-process synchronizations,– Balanced data distribution, and– Input parameters tuning.


L.A.C.A.M.KDDE

7

Problem

• Is that Frequent Itemset Mining is not Map-Reducible.

{a,b,c}{a,b}

{a,b,c}{a,c,d}

{a,b,c}{a}

{a,b,c}

{a,c}{a}

{a,b,c}

minsup >= 0.5

{a,b}=3/4 {a,b}=2/3 {a,b}=1/3

{a,b}= (3/4 +2/3 + 0)/3 < 0.5


L.A.C.A.M.KDDE

8

Research Goals

Can we still make reasonable decisions in the absence of perfect answers?

(Yes…)

If we introduce an error is still acceptable?(Not Always)


L.A.C.A.M.KDDE

9

Contribution

• To mine “approximate” frequent itemsets from Big Data with statistical error guarantees.

• We want to avoid:– Inter-communication costs,– In-process synchronizations,– Additional input parameters tuning.


L.A.C.A.M.KDDE

11

Literature (DDM)

• In 1996, Aggrawal and Shafer proposed approaches based on: Count Distribution, Data Distribution, and Candidate Distribution

• In 2002, Orlando et al. proposed the Partion Algorithm• In 2004, Asharafi et al. proposed Odam, where the

mining process is synchronized via message passing• Still all the proposed solutions are based on Apriori

approach– Issues: synchronization, data balancing


L.A.C.A.M.KDDE

12

Literature: MapReduce• However, Google introduced– 2003 - Google File System,– 2004 - MapReduce,

• In 2008: Li et al. proposed Parallel FP-Growth (PFP)1. Parallel and distributed counting of frequent items2. MapReduce round to generate group-dependent

transactions3. FP-growth applied to each group

• VLDB 2010, DBKA 2010, KDD 2011 evolutionsIssues: data replication, inter-communication and synchronizations


L.A.C.A.M.KDDE

13

Literature: MapReduce

• Riondato et al. [CIKM 2012] proposed an approach, which:– Creates samples of the dataset– Extracts frequent itemsets with support ≥ (minsup – ε/2)

from the samples– Combines the frequent itemesets

• Parameters: number of transactions, allowed error ε, probability δ, replication parameter Φ, and the type of Mapper to be used (Partition, Binomial, Count Flipper, Input Sample)

• Matlab script for parameters tuning• Composed by 2 map-reduce rounds


L.A.C.A.M.KDDE

14

MrAdam• Given a dataset stored on HDFS, MrAdam takes:– Input: 2 parameters, minsup and a value for the

reliability parameter δ.– Output: a (1 – δ) approximation of the exact set of

frequent itemsets• MrAdam combines:– A statistical approach based on the Chernoff bound– A MapReduce based algorithm– The SE-Tree to combine intermediate results– An interpolation function to estimate the support of an

itemset


L.A.C.A.M.KDDE

15

Chernoff bound

• It allows us to give an estimation of the expected value for a random variable with Binomial distribution.

• We can use it to express the allowed error ε in term of δ.


L.A.C.A.M.KDDE

16

MapReduce

• We modeled the computation into Map and Reduce functions.

1.map(key: LongWritable ,value: Text ,context: Context)

2.reduce(key: Text, vals: Iterable[LongWritable],context: Context)


L.A.C.A.M.KDDE

17

1. Local Model Discovery

Input: minsup, δ, hdfs folder1. It uses the Chernoff bound to compute the

maximum acceptable error ε2. It executes a Map-step with FP-growth as

routine and it returns the collection of frequent itemsets with support (minsup – ε)


L.A.C.A.M.KDDE

18

2. Global Combination

• The partial result are aggregated using an SE-Tree

• The SE-Tree enumerates the ordered collection of the discovered frequent itemsets

• Baseline Approach: the support is computed by summing the values in the SE-Tree nodes


L.A.C.A.M.KDDE

19

3. Structural Interpolated Support:

Let S be an SE-Tree, X a candidate k-itemset infrequent on Di and P(X) the set of (k-1)-subitemsets of X• The structural interpolated support of X

accepted if:1. None of the (k-1)-subitemsets of X is marked as

approximated2.

Method: MrAdam-ChernoffFumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

20

EXPERIMENTS


L.A.C.A.M.KDDE

21

Experiment: Goal

1. The runtime overhead of MrAdam on Hadoop w.r.t. FP-Growth

2. Evaluate the performances of MrAdam w.r.t the literature.


L.A.C.A.M.KDDE

Fumarola and Malerba – A parallel algorrithm for Approximate Frequent Itemset Mining using MapReduce 22

Experimental Settings

• MrAdam implemented using Hadoop 2.2• Private Cluster composed by 8 VMs– Intel Xeon 2.2GHz – 8GB of RAM

• One Vm equipped with 32GB of RAM

L.A.C.A.M.KDDE

23

Mushroom


L.A.C.A.M.KDDE

24

Pumsb


L.A.C.A.M.KDDE

25

Accident


L.A.C.A.M.KDDE

26

LARGE DATASETS


L.A.C.A.M.KDDE

27

Mushroom Large


L.A.C.A.M.KDDE

28

Pumsb


L.A.C.A.M.KDDE

29

Scalability


L.A.C.A.M.KDDE

30

Conclusions

We presented MrAdam algorithm that:– Does not require any time-consuming

communication and synchronization activity, – It generates in parallel the sets of itemsets locally

frequent and,– It return a (1-δ) approximation of the collection of

the globally frequent itemsets by aggregation and interpolation.

– Experiment shows that MrAdam is from 2 to 100 times faster than PFP.


Date post:	11-May-2015
Category:	Engineering
Upload:	fabio-fumarola
View:	694 times
Download:	4 times

A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Engineering