+ All Categories
Home > Documents > [IEEE Seventh IEEE International Conference on Data Mining (ICDM 2007) - Omaha, NE, USA...

[IEEE Seventh IEEE International Conference on Data Mining (ICDM 2007) - Omaha, NE, USA...

Date post: 03-Dec-2016
Category:
Upload: vivekanand
View: 212 times
Download: 0 times
Share this document with a friend
10
Mining statistical information of frequent fault-tolerant patterns in transactional databases Ardian Kristanto Poernomo Nanyang Technological University Singapore, 639798 [email protected] Vivekanand Gopalkrishnan Nanyang Technological University Singapore, 639798 [email protected] Abstract Constraints applied on classic frequent patterns are too strict and may cause interesting patterns to be missed. Hence, researchers have proposed to mine a more relaxed version of frequent patterns, where transactions are allowed to miss some items in the itemset they support. Patterns exhibiting such “faults” are called frequent fault-tolerant patterns (FFT-patterns) if they are significant in number. In this paper, the term “pattern” is distinguished from “item- set” as referring to a pair (tidset × itemset). Unlike classical frequent patterns, the number of FFT- patterns grows exponentially not only with the number of items, but also with the number of transactions. Since the latter may reach millions, mining FFT-patterns by enu- merating them becomes infeasible. Hence, the challenge is to represent FFT-patterns concisely without losing any useful information. To address this, we draw on the ob- servation that, in transactional databases, the transactions themselves are not important from the data mining point-of- view; i.e. researchers are interested in finding itemsets con- tained in lots of transactions, rather than in the transactions per se. Therefore, we propose to mine only the frequent itemsets along with the statistical information of the sup- porting transaction sets, rather than enumerate entire FFT- patterns. Then we present our approach - the BIAS frame- work, consisting of Backtracking algorithm, Integer Linear Programming (ILP) constraints, and aggregation statistics to solve this problem. Algorithms under this framework not only increase the efficiency of the FFT-patterns mining pro- cess by more than an order of magnitude, but also provide a more comprehensive analysis of FFT-Patterns. 1 Introduction The frequent patterns mining problem [2] has been widely studied in data mining. Some researchers use the terms pattern and itemset interchangeably, but in this paper, we distinguish the term pattern as pair (tidset × itemset) instead of itemset only. a, b 4 a, b, d 3 b, c 2 c, d 1 Itemset TID a, b 4 a, b, d 3 b, c 2 c, d 1 Itemset TID 4 3 2 1 d c b a 4 3 2 1 d c b a Figure 1. Transactional database: table and matrix form Yang et al [14] argued that analysts would find it use- ful to mine groups of similar transactions that share most of the items, but these cannot be discovered by traditional approaches due to their harsh definition of support. Hence, they proposed to mine a more relaxed version of frequent patterns, where transactions are allowed to miss some items in the itemset they support. Patterns significantly supported in such a manner are called frequent fault-tolerant patterns (FFT-patterns for short). Let us consider the transactional database as shown in Figure 1, which will be used as a running example in the rest of this paper. The database consists of a set of transactions (henceforth called tidset for brevity) over an itemset, and is illustrated in table and matrix forms. Here, we highlight some examples of FFT-patterns. We can see that when supporting transactions are allowed to miss one item in the itemset they support, three transactions, viz. 2, 3, 4 support itemset {a, b}; when supported items are al- lowed to be missed by one transaction in their supporting set, itemset {a, b, c, d} is supported by tidsets {1, 3}, {1, 4} and {2, 3}; and when three such transaction-item misses are allowed in a pair (tidset × itemset), ({2, 3, 4}×{a, b, c}), ({1, 2, 3}×{b, c, d}), ({2, 3}×{a, b, c, d}) satisfy the crite- Seventh IEEE International Conference on Data Mining 1550-4786/07 $25.00 © 2007 IEEE DOI 10.1109/ICDM.2007.48 272 Seventh IEEE International Conference on Data Mining 1550-4786/07 $25.00 © 2007 IEEE DOI 10.1109/ICDM.2007.48 272 Seventh IEEE International Conference on Data Mining 1550-4786/07 $25.00 © 2007 IEEE DOI 10.1109/ICDM.2007.48 272 Seventh IEEE International Conference on Data Mining 1550-4786/07 $25.00 © 2007 IEEE DOI 10.1109/ICDM.2007.48 272 Seventh IEEE International Conference on Data Mining 1550-4786/07 $25.00 © 2007 IEEE DOI 10.1109/ICDM.2007.48 272
Transcript

Mining statistical information of frequent fault-tolerant patternsin transactional databases

Ardian Kristanto PoernomoNanyang Technological University

Singapore, [email protected]

Vivekanand GopalkrishnanNanyang Technological University

Singapore, [email protected]

Abstract

Constraints applied on classic frequent patterns are toostrict and may cause interesting patterns to be missed.Hence, researchers have proposed to mine a more relaxedversion of frequent patterns, where transactions are allowedto miss some items in the itemset they support. Patternsexhibiting such “faults” are called frequent fault-tolerantpatterns (FFT-patterns) if they are significant in number. Inthis paper, the term “pattern” is distinguished from “item-set” as referring to a pair (tidset × itemset).

Unlike classical frequent patterns, the number of FFT-patterns grows exponentially not only with the number ofitems, but also with the number of transactions. Sincethe latter may reach millions, mining FFT-patterns by enu-merating them becomes infeasible. Hence, the challengeis to represent FFT-patterns concisely without losing anyuseful information. To address this, we draw on the ob-servation that, in transactional databases, the transactionsthemselves are not important from the data mining point-of-view; i.e. researchers are interested in finding itemsets con-tained in lots of transactions, rather than in the transactionsper se. Therefore, we propose to mine only the frequentitemsets along with the statistical information of the sup-porting transaction sets, rather than enumerate entire FFT-patterns. Then we present our approach - the BIAS frame-work, consisting of Backtracking algorithm, Integer LinearProgramming (ILP) constraints, and aggregation statisticsto solve this problem. Algorithms under this framework notonly increase the efficiency of the FFT-patterns mining pro-cess by more than an order of magnitude, but also providea more comprehensive analysis of FFT-Patterns.

1 Introduction

The frequent patterns mining problem [2] has beenwidely studied in data mining. Some researchers use the

terms pattern and itemset interchangeably, but in this paper,we distinguish the term pattern as pair (tidset × itemset)instead of itemset only.

a, b4

a, b, d3

b, c2

c, d1

ItemsetTID

a, b4

a, b, d3

b, c2

c, d1

ItemsetTID

4

3

2

1

dcba

4

3

2

1

dcba

Figure 1. Transactional database: table andmatrix form

Yang et al [14] argued that analysts would find it use-ful to mine groups of similar transactions that share mostof the items, but these cannot be discovered by traditionalapproaches due to their harsh definition of support. Hence,they proposed to mine a more relaxed version of frequentpatterns, where transactions are allowed to miss some itemsin the itemset they support.

Patterns significantly supported in such a manner arecalled frequent fault-tolerant patterns (FFT-patterns forshort). Let us consider the transactional database as shownin Figure 1, which will be used as a running example inthe rest of this paper. The database consists of a set oftransactions (henceforth called tidset for brevity) over anitemset, and is illustrated in table and matrix forms. Here,we highlight some examples of FFT-patterns. We can seethat when supporting transactions are allowed to miss oneitem in the itemset they support, three transactions, viz.2, 3, 4 support itemset {a, b}; when supported items are al-lowed to be missed by one transaction in their supportingset, itemset {a, b, c, d} is supported by tidsets {1, 3}, {1, 4}and {2, 3}; and when three such transaction-item misses areallowed in a pair (tidset × itemset), ({2, 3, 4} × {a, b, c}),({1, 2, 3}×{b, c, d}), ({2, 3}×{a, b, c, d}) satisfy the crite-

Seventh IEEE International Conference on Data Mining

1550-4786/07 $25.00 © 2007 IEEEDOI 10.1109/ICDM.2007.48

272

Seventh IEEE International Conference on Data Mining

1550-4786/07 $25.00 © 2007 IEEEDOI 10.1109/ICDM.2007.48

272

Seventh IEEE International Conference on Data Mining

1550-4786/07 $25.00 © 2007 IEEEDOI 10.1109/ICDM.2007.48

272

Seventh IEEE International Conference on Data Mining

1550-4786/07 $25.00 © 2007 IEEEDOI 10.1109/ICDM.2007.48

272

Seventh IEEE International Conference on Data Mining

1550-4786/07 $25.00 © 2007 IEEEDOI 10.1109/ICDM.2007.48

272

ria. These examples while not comprehensive, illustrate thevariety of FFT-patterns.

Consequently, several definitions of FFT-patterns havebeen proposed [9, 10, 12, 3, 6, 7, 8, 11] to accommodate thisdiversity. However, a general problem with FFT-patternsis that they not only grow exponentially with the numberof items, but also with the number of transactions. Sincethe number of transactions may reach millions, this impliesthat mining FFT-patterns is infeasible. Several approacheshave been proposed to tackle this problem: [9, 14] relaxthe constraints, while [3, 10] add more constraints, and[14, 7, 8] perform approximation. Nevertheless, all of theseapproaches either still produce too many results, or mightlose interesting patterns.

In transactional databases, the transactions themselvesare not important from the data mining point-of-view, i.e.the interest is in finding itemsets which are contained in alot of transactions, instead of finding the transactions them-selves. Therefore, we propose to mine all frequent itemsets,along with the statistical information of the supporting-transaction sets (formally defined in section 3.1). Wechoose this approach because (1) this approach doesn’t loseany interesting itemsets (no false negatives), and (2) we findthat the statistical information can concisely describe thecharacteristics of huge data.

In this paper, we formulate the general definitions ofFFT-patterns in terms of basic relaxation criteria, and showhow new and existing variants of such patterns can beformed from combinations of those criteria. Then wepresent our BIAS framework1, consisting of Backtrackingalgorithm, Integer Linear Programming (ILP) constraints,and aggregation statistics to solve this problem, and demon-strate its application for all problem variants. Note that ILPis used here to define constraints of a combinatoric prob-lem, and not to optimize an objective function as in usualLP. This framework can be used to mine statistical informa-tion of the FFT-patterns efficiently without generating thepatterns themselves.

The rest of this paper is organized as follows. Section2 discusses previous works related to the problem of FFT-patterns mining, and links them to the variants describedin our paper. Section 3 presents the general problem defini-tion, and shows how variants of FFT-patterns can be derivedfrom constraints. Section 4 describes the BIAS frameworkto mine frequent itemsets as well as the statistics of theirsupporting-tidsets, and algorithms under this framework areexplained in detail for each variant in section 5. These ap-proaches are compared with related techniques, and section6 shows the experimental results of these comparisons. Fi-nally, section 7 concludes with some directions for futurework.

1BIAS is an acronym for Backtracking, Integer linear programming,and Aggregation Statistics

2 Related Work

Pioneering work in FFT-patterns mining was by Yanget al [14] in 2001, where the amount of relaxation is pro-portional to the size of the pattern. An FFT-pattern, alsocalled strong-ETI, is defined as a sub-matrix of itemset andtransaction-set, where each transaction contains a certainproportion of the itemset. Besides [14], other papers min-ing FFT-patterns using proportion are [10, 7, 8]. In [10] and[14], the objective is to find interesting patterns, while [7]and [8] aim to approximate the pure frequent itemsets in anoisy data.

FFT-patterns mining with proportional relaxation,though natural, is intractable [3], because this problem lacksthe anti-monotonicity property2 of itemsets. With anti-monotonicity property, we can safely stop extending anitemset once it is not frequent, so the mining complexity isproportional to the number of frequent itemsets. This is notthe case when mining using proportional relaxation. More-over, techniques using fixed-value relaxation [9, 12, 3, 11]demonstrate that interesting itemsets can be mined in real-world data. Since our paper focuses on fixed-value fault-tolerance, we skip details on these other variants.

The first piece of work using fixed-value as relaxationon FFT-patterns mining was authored by Pei et al [9]. Thedefinition of FFT-pattern here is similar to the one givenin [14], but they employ fixed-value relaxation instead ofproportional relaxation, i.e. every transaction is allowedto miss a fixed number of items. This definition is equiv-alent to that of FFTT -pattern in our paper (c.f. section 3.2).To mine such patterns, the authors used an apriori-basedmethod. Koh and Yo [6] tackled the same problem usinga backtracking approach, and empirically showed that theirapproach improves the efficiency of the previous approachby an order of magnitude.

A more complicated problem was tackled by Sim etal [11], where the objective is to mine maximal quasi-bicliques in bipartite graphs. A quasi-biclique is maximalif it is not a proper subset of any other quasi-biclique. In[11], the relaxation criteria is applied to both sides of thebipartite graph, i.e. each vertex is allowed to disconnect toat most ε vertices in the other partite-set. To solve this prob-lem, they proposed algorithm CompleteQB which is basedon the backtracking approach. The constraints for this prob-lem are exactly the same as for our FFTT I-pattern (c.f. sec-tion 3.2).

Besson et al [3] relate transactional databases with for-mal concept analysis in lattice theory. They performedcomparisons on several definitions of FFT-patterns (DRBS,CBS, and FBS) with the objective of finding interestingpatterns. However, all those definitions incorporate a con-

2Anti-monotonicity property means if f(X) is true for a function fand a set X , then f(X′ ⊆ X) is also true.

273273273273273

sistency constraint, which restricts all transactions (resp:items) outside the patterns to contain lesser items (resp:transactions) than those inside the pattern. The implicationsof this constraint are illustrated in Example 1.

Example 1. Consider the database as depicted in Figure1. Let the minimum support be 3, and the allowed num-ber of misses in each transaction and item be 1. No FFT-pattern with itemset {b, c} can include transactions 3 and4 together, since they both miss item c (no. of misses foritem c becomes 2). Conversely, neither of them can be ex-cluded from any pattern that contains transaction 2, as itwould violate the consistency constraint. As a result, {b, c}is not considered interesting (frequent). On the other hand,if the dataset had contained only transactions 1, 2 and 3(or 1, 2 and 4), we find patterns that support itemset {b, c},which is now considered interesting. Therefore we concludethat such constraints are not appropriate for mining item-sets with fixed tolerance in transactional databases.

3 Problem Definition

3.1 Basic Definition and Notation

In this paper, we deal with transactional databases. Atransactional database D consists of transactions T over aset of items I, and can be represented by a (T × I) matrixof transactions and items.

T (x ∈ I) (resp: T (X ⊆ I)) represents a tidset whereeach transaction contains item x (resp: itemset X). Sim-ilarly I(y ∈ T ) (resp: I(Y ⊆ T )) represents an itemsetwhere each item is contained in transaction y (resp: tidsetY ).

Definition 1. [PATTERN] A pattern is defined as a sub-matrix ((Y ⊆ T ) × (X ⊆ I)) of D. A pattern (Y × X) issaid to be associated with itemset X .

In classic frequent itemset mining, a pattern P is inter-esting if all its transactions contains all its items, i.e. all(tid, item) pairs in the pattern are present in the database.In fault-tolerant mining, we relax this interesting constraintby allowing P to contain a constant number of (tid, item)pairs which are actually missing in the database. This re-laxation is formally defined by the following relaxation cri-teria, where we distinguish between the allowed number ofmisses in each transaction, each item, and the entire pattern.

Transaction relaxation In a pattern (Y × X), the totalnumber of misses in each transaction are at most εt,i.e. ∀y ∈ Y , |I(y) ∩ X| ≥ |X| − εt. This relaxationpermits more transactions to be included in the pattern.

Item relaxation In a pattern (Y × X), the total number ofmisses in each item are at most εi, i.e.

∀x ∈ X , |T (x)∩Y | ≥ |Y |−εi. This relaxation allowsmore items to be included in the pattern than would beallowed by the classic definition.

Pattern relaxation In a pattern (Y ×X), the total numberof misses in the pattern are at most εp, i.e.∑

y∈Y |I(y) ∩ X| ≥ |Y |.|X| − εp. This relaxationdeals with the associativity of transactions and itemsin the pattern.

The set of all patterns satisfying the relaxation criteriamay be very large and undesirable. In order to reduce this,we consider the utility of the patterns in terms of the supportfor their itemsets and maximality of their tidsets, where thesupport of a pattern is calculated as the size of its tidset.The set of such patterns can be limited by the followingconstraints.

Support constraint A pattern (Y × X) is frequent, i.e.|Y | ≥ ξ, where ξ is the minimum support thresholddesired by the mining process. This constraint ensuresthat any interesting behavior captured in the pattern isobserved by a substantial number of transactions.

Maximality constraint A pattern (Y ×X) relaxed by anycombination of the relaxation criteria is maximal.This implies that �(Y ′ × X), Y ⊂ Y ′ ⊆ T satisfy-ing the given combination of relaxation criteria.

Support constraint and maximality constraint are de-noted as utility constraints. Note that these constraints onlyrestrict support and maximality in the tidsets, and not in theentire pattern as in usual terminology.

Definition 2. [FFT-PATTERN] An FFT-pattern is a patternsatisfying all utility constraints and any given combinationof relaxation criteria.

Definition 3. [SUPPORT-TIDSET] Tidset Y supports item-set X if (Y × X) is an FFT-pattern. The set of all support-tidsets of X is denoted as S(X).

Definition 4. [FREQUENT ITEMSET] An itemset X is fre-quent if S(X) �= Ø.

As mentioned earlier, instead of mining all FFT-patterns,we propose to mine all frequent itemsets along with thestatistics of their support-tidsets. In this work, we calculatethe maximum support, the average support, and the numberof FFT-patterns associated with an itemset as the statisticalinformation, which we consider useful for itemset analysis.Other statistical information such as median and mode canalso be mined without adding much complexity.

In a market basket database, while |I| may reach thou-sands, the maximum size of any practical itemset is around10 [13]. We denote the maximum size of an FFT-itemset asXmax and use it to provide a tight upper bound for someoperations (c.f. section 5), since Xmax << |I| in practice.

274274274274274

3.2 Variants of the FFT-pattern

In this paper, combinations of above relaxation criteriahave been used to form three variants of the FFT-pattern,viz. FFTT -pattern (using transaction relaxation), FFTT I-pattern (using transaction relaxation and item relaxation),and FFTP -pattern (using pattern relaxation). Variables sub-scripted with T , T I, or P belong to the corresponding spe-cific variants of FFT-pattern.

The aim of introducing three different variants is mainlyto show how the BIAS framework addresses each relaxationcriteria. Depending on the dataset, some relaxation crite-rion by itself might not be pertinent, as the resulting pat-tern may be sparse. However, combining the criteria canproduce good variants, like FFTT -pattern [9] and FFTT I-pattern [11] which have already been successfully appliedto real world problems. Differences among the variants interm of the patterns mined are shown in Example 2.

Example 2. Consider the database as illustrated in Fig-ure 1. Let εt = εi = εp = 1, and ξ = 1. Itemset X ={b, c}, is associated with (a) one unique FFTT -pattern,({1, 2, 3, 4}×X), and its corresponding ST (X) is a single-ton {{1, 2, 3, 4}}; (b) two FFTT I-patterns, ({1, 2, 3}×X)and ({1, 2, 4}×X), with ST I(X) = {{1, 2, 3}, {1, 2, 4}};and (c) three FFTP -patterns, ({1, 2} × X), ({2, 3} × X),and ({2, 4} × X), with SP(X) = {{1, 2}, {2, 3}, {2, 4}}.

Despite these differences, all the FFT-pattern variantsshare the anti-monotonicity property.

Theorem 1. FFT-pattern is anti-monotone.

Proof. Since every relaxation criterion on the FFT-patternsis based on fixed-value, the allowed number of misses is aconstant. The number of misses in an FFT-pattern is thus in-variant to it’s size, and addition of items and/or transactionswill never reduce the amount of misses.

This property is very important for designing algorithmsto mine FFT-itemsets. If an itemset is not frequent (regard-less of the definition), then none of its supersets are fre-quent. Based on this property, we can enumerate all resultsusing the pattern growth approach [2], and safely prunewhen the current itemset is not frequent.

4 BIAS framework

The BIAS framework, consisting of Backtracking algo-rithm, Integer linear programming (ILP), and aggregationstatistic, is designed as follows. First, the backtracking al-gorithm iterates all necessary and relevant itemsets, whileconcurrently maintaining the necessary aggregation statis-tic of the transactions. For each itemset X , the statisticalinformation of S(X) is determined by ILP incorporatingthe corresponding aggregation statistic.

Algorithm 1: Backtracking algorithm1: call Backtrack(Ø);2:3: Function Backtrack(X) :: void4: for each item i extension of X5: extend itemset X with item i while

maintaining aggregation statistics;6: Form ILP based on the aggregation statistics;7: if (X) is frequent8: output the statistical information of S(X);9: call Backtrack(X);

10: exclude item i from X while maintainingaggregation statistics;

4.1 Backtracking Algorithm

The backtracking algorithm (Algorithm 1) enumeratesthe itemsets by inserting items one by one. Once it fin-ishes enumerating all itemsets containing an item, it back-tracks, then inserts another item. The complete process canbe seen as an enumeration tree [4]. As FFT-itemset is anti-monotone, once an itemset is not frequent, we can safelyprune it by not traversing its subtree.

Since we enumerate all FFT-itemsets, this backtrackingalgorithm combined with pruning, traverses all necessaryand relevant result states. This means that we cannot expectany algorithm that iterates the itemsets to be asymptoticallymore efficient than this algorithm.

4.2 ILP Constraints

Given an itemset X generated by the backtracking step,the aim of ILP is to find the statistical information of associ-ated FFT-patterns. We first present the algorithm to mine allS(X), and then show how to calculate the statistics withoutenumerating all tidsets.

The problem of enumerating S(X) can be translated toa system of binary linear inequalities where the variablesrepresent the transactions. We denote this system of integerlinear equalities as an integer linear program (ILP ). ThisILP consists of |T | variables, denoted by integer array A =(a1, a2, ..., a|T |). Each variable ai, 1 ≤ i ≤ |T |, representsone transaction and is constrained to be 0 or 1 (binary). Avalue 1 of variable ai means transaction i is present in thetidset.

The relaxation criteria and utility constraint on the pat-tern (c.f. section 3.1) presented in terms of the variables areas follows.

Transaction Relaxation (ILPT )

∀i ∈ {1..|T |}, ai · (|X \ I(ti)|) ≤ εt (1)

275275275275275

Item Relaxation (ILPI)

∀x ∈ X,∑

j,tj �∈T (x)

aj ≤ εi (2)

Pattern Relaxation (ILPP )

|T |∑i=1

ai · (|X \ I(ti)|) ≤ εp (3)

Maximality Constraint (ILPmax)Let A′

k = (a′1, a

′2, ..., a

′|T |)

a′i =

{ai if i �= k;ai + 1 if i = k.

�k,A′k satisfies relaxation criteria (4)

Support Constraint (ILPsup)

|T |∑i=1

ai ≥ ξ (5)

Note that we haven’t changed the problem statement, in-stead we have only represented it in a mathematical form.Analyzing the variables, we note that this problem is in-tractable since the number of transactions can be potentiallylarge. Especially, it is because integer programming in gen-eral is NP-Hard.

This problem can be resolved by reducing the numberof variables. Equivalent variables are substituted by onecombined-variable, representing the equivalence classes.In any definition of relaxation criteria, two transactions taand tb are equivalent if they contain the same itemset in X ,i.e. I(ta) ∩ X = I(tb) ∩ X . This equivalent conditionis even stronger in FFTT -pattern and FFTP -pattern, wheretwo transactions are equivalent if they contain the samenumber of itemsets in X , i.e. |I(ta)∩X| = |I(tb)∩X|. Bythis reduction, the number of variables is equal to the num-ber of equivalence classes. Let us call ILP with reducedvariable as ILP red, with variables B = (b0, b1, ..., b|B|)where |B| is the number of equivalence classes. A value kof variable bi means there are k transactions from equiva-lence class i in the tidset, hence it is constrained to be atmost the number of variables it substitutes. Variable reduc-tion is illustrated in example 3.

Example 3. Consider mining FFTT -pattern from thedatabase in Figure 1. Let the itemset X = {a, b}. As de-scribed previously, transactions containing same number ofitems in X are considered equivalent. Hence, transaction 3(resp: variable a3) is equivalent to transaction 4 since theyboth contain 2 items, while transactions 1 and 2 are dif-ferent since they contain 0 and 1 item respectively. Hence,

there are 3 equivalence classes, b1 representing tidset {1},b2 representing tidset {2}, and b3 representing tidset {3, 4}.These variables are constrained to be at most 1, 1, and 2 re-spectively.

Using this technique, this problem becomes tractablesince the number of variables is generally small, as is shownin Section 5. Having a solution B for ILP red, the numberof support-tidsets satisfying B, denoted as |SB(X)| can becomputed using equation (6), where ni is the number oftransactions in equivalence class i.

|SB(X)| =|B|∏i=1

(ni

bi

)(6)

Example 4. Continuing Example 3, we have n1 = 1, n2 =1, and n3 = 2. If we consider one transaction of class 1(b1 = 1), none of class 2 (b2 = 0), and one of class 3(b3 = 1), B = (1, 0, 1) representing

(11

(10

(21

)= 2

tidsets, viz. {1, 3} and {1, 4}.

Total size of all transactions in SB(X) can be calculatedas |SB(X)| ×(

∑|B|i=1 bi). Summing these sizes for all B so-

lutions of ILP red gives the total size of all support transac-tions, from which the average transaction size can be com-puted.

Since there can be multiple solutions A and B for ILPand ILP red, the set of all distinct solutions A and B is de-noted as A and B respectively. To summarize, the objectiveof ILP is to enumerate B, in order to find the statisticalinformation of A.

In order to obtain the equivalence classes of transactions,we need to maintain an aggregation statistic which con-cisely represents the connectivity in the database.

4.3 Aggregation Statistic

The choice of the aggregation statistic needed dependson the variant of FFT-patterns being considered. Example5 gives a broad view of what aggregation statistics are main-tained and how to calculate them.

Example 5. Consider the problem of mining ST (X). Inthis problem, all transactions containing the same numberof items are equivalent. Hence, we maintain the number ofmisses of every transaction. To speed up the variable reduc-tion process we can also maintain the number of transac-tions containing the same number of items as another ag-gregation statistic. The speed up occurs because insteadof enumerating all transactions, we only need to enumeratethe equivalence classes.

Let us consider current itemset X = {a, b} in thedatabase in Figure 1. We maintain the statistic that transac-tion 1 contains zero item, transaction 2 contains one item,

276276276276276

and transactions 3 and 4 contain both items. To speed upthe reduction process, we can also maintain the statistic thatthere are two transactions (regardless of which transactionsthey are) that contain both items, one transaction that con-tains one, and one transaction that contains no items in X .

These aggregation statistics can be maintained effi-ciently. Since we are using a backtracking algorithm, atmost one item is added or removed at one time. Thus, in theworst case this operation needs O(|T |) iterations. Detailsof the specific aggregation statistic used in each variant areexplained in section 5.

5 Algorithms

In this section, we present algorithms based on the BIASframework for the three FFT-pattern variants described insection 3.2.

5.1 FFTT -pattern

Recalling the definition, a pattern (Y × X) is an FFTT -pattern if it satisfies the criterion of transaction relaxationas stated in section 3.2 (ILPT in section 4.2). Since thereis no relaxation on the number of misses on each item, eachtransaction is independent of each other. Hence the tidsetY supporting X is unique, i.e. the set of all transactionscontaining at least |X| − εt items in X .

In this problem, the information about the number of ap-pearances or number of misses of each transaction is suffi-cient to determine ST (X). These two variables can be usedinterchangeably since the number of appearances equals thesize of the itemset less the number of misses. For the sakeof efficiency, we use the number of appearances as the ag-gregation statistic.

5.1.1 Number of appearances as aggregation statistic

Given itemset X , one way to calculate its frequency is bymaintaining the number of appearances of each transaction,denoted as counta. If x is the last added item in the back-tracking step, counta can be maintained by increasing thecount of all transactions containing x. Hence, every updateneeds |T (x)| iterations. The complexity is O(|T |) in theworst case but O(γ.|T |) in the expected case where γ is thedensity of database. This approach has complexity propor-tional to the database’s density, which is desirable since thedatabase is generally sparse.

To deduce the supporting tidsets, we need to maintainanother aggregation statistic, which stores the number oftransactions for every distinct value in counta. This is de-noted as countc, i.e. countcx = |{t|countat = x}| (see Ex-ample 5). This statistic can be maintained concurrently with

the former, therefore the additional cost needed is constant.The support of an itemset can be calculated by summingcountc from index |X| − εt to |X|, which needs εt + 1 iter-ations. This technique requires |T | integers to store counta,and Xmax integers to store countc.

5.1.2 Bitmap representation of aggregation statistic

Another way to represent the aggregation statistic is by us-ing Xmax bitmaps, where bit t in bitmap c is set if andonly if transaction t misses c items. The insertion processis shown in following pseudo-code. A similar algorithmcan be derived for the removal process. As before, x cor-responds to the last added item, and missx is the bitmapcorresponding to transactions not containing item x. Thisoperation needs O(α · Xmax · |T |) iterations for each up-date and O(α ·Xmax · |T |) space. The symbol α representsthe ratio between bit-wise and integer operations which isroughly 1/32 in a 32 bits processor, where 32 operationscan be performed in one clock cycle.

countXmax:= countXmax

∧ ¬missx

for i := Xmax − 1 down to 0counti+1 := counti+1 ∨ (counti ∧ missx)counti := counti ∧ ¬missx

By sacrificing space, we can reduce Xmax to εt by main-taining εt + 1 bitmaps in all recurrence levels. In this case,the maximum space allocated is O(α · Xmax · εt · |T |), buteach update only needs O(α · (εt + 1) · |T |) iterations. Thisapproach even though asymptotically inefficient, is very fastin practice because of the architecture of current processors.

5.1.3 Comparisons with other techniques

A similar problem has been tackled by Pei et al in [9] andKoh and Yo in [6]. Both their algorithms, as ours, iterate theitemsets and then find their supporting transactions. Pei etal proposed the FT-Apriori algorithm [9] that enumeratesitemsets using an apriori-like approach. FT-Apriori firstfinds all frequent itemsets with size εt + 1. Then for eachk, it generates candidates of size k + 1 whose k-subsets arefrequent. Once these candidates are accumulated, they areall checked using one pass of the database.

To generate one candidate X , the algorithm checkswhether all its subsets are frequent in O(|X|2) time, andthen verifies whether X is frequent in another pass of thedatabase taking O(|T | · |X|) time.

The number of candidates generated in this approach isrelatively small compared to that in backtracking. However,generation of the candidates itself is an expensive process,consuming most of the mining time. Another disadvantage

277277277277277

Approach #Iterations Space

BIAS(appear) |T (x)| + εt |T | + |X|BIAS(bit) α · (εt + 1) · |T | α · (εt + 1) · |T | · |X|

FT-Apriori[9] |X|2 + |T | · |X| N · |X|VB-FT-Mine [6] α · (εt + 1) · |T | α · (εt + 1) · |T | · |X|

Table 1. FFTT -pattern mining approaches

of this approach is that it needs another database pass whileprocessing itemsets of each size.

Koh and Yo presented VB-FT-Mine [6] which uses abacktracking technique similar to ours. However insteadof using an aggregation statistic, they analyzed the recur-rence relation of the support-tidset. Let P (X, d) be the setof transactions which support itemset X when εt = d. Theyshow that P (X∪{x}, d) = P (X, d−1)∪(P (X, d)∩T (x)),hence this problem can be solved using dynamic program-ming [4]. The time and space complexity of VB-FT-Mineis exactly the same as ours using bitmap. This algorithm isfast because the recurrence relation happens to be simple,and using bitmaps is efficient.

Both approaches limit each item x ∈ X to be containedin at least δ supporting transactions. Our technique can eas-ily be adjusted to this requirement by iterating each itemx ∈ X , as is done in [9] and [6]. However, this constraintis neglected for the sake of consistency among all variantsdiscussed in this paper.

A summary comparison among these three approachesis presented in Table 1. Our approaches representing theaggregation statistic as appearances and bitmaps are char-acterized by BIAS(appear) and BIAS(bit) respectively. Ndenotes the total number of frequent itemsets, and the con-stant α represents the ratio of time taken for bit operationsto integer operations, which in this case is roughly 1/32.

Table 1 shows that no approach is totally superior com-pared to others. Given a sparse database and large toleranceεt, ILPFT (appear) seems promising, while ILPFT (bit)and VB-FT-Mine [6] might be faster in a dense databaseand with small εt. These results must be viewed in light ofthe main focus of this paper which is to mine the statisticalinformation of support-tidsets. This variant is incorporatedto illustrate how the BIAS framework handles the criterionof transaction relaxation.

5.2 FFTT I-pattern

FFTT I-pattern is defined as a pattern that satisfies thecriteria of item relaxation and transaction relaxation as de-fined in section 3 (ILPT and ILPI in section 4.2). Thisproblem is seemingly more complicated compared to theFFTT -pattern problem since each itemset might have mul-tiple support-tidsets.

Since the approach used in FFTT -pattern can be appliedto handle the transaction relaxation, the pending problemis to generate FFT-patterns that satisfy the criterion of itemrelaxation.

5.2.1 Finding the maximum sized STI

A pattern is frequent if the size of its largest support-tidsetis greater than or equal to ξ. Therefore finding the largestsized support-tidset can verify whether the pattern is fre-quent. This problem can be expressed as the ILPI ob-jective function to maximize

∑|T |i=1 ai. Clearly, finding the

maximum sized ST I is at most as difficult as generatingST I . While it may seem easier, we show that finding themaximum result in ILPI is an NP-Complete problem.

Lemma 2. Finding maximum∑|T |

i=1 ai in ILPI is NP-Complete.

Proof. To prove NP-Hardness we take the problem of find-ing Maximum Independent Set (MIS) in a graph which is awell-known NP-complete problem, and prove that MIS canbe reduced into this problem. Given a graph G, the ob-jective of MIS is to find a maximum set of vertices, suchthat there is no edge between any pair of them. The reduc-tion is as follows. For every edge e(u, v), add inequalityu + v ≤ 1 in ILPI . This reduction is complete since theresult of ILPI will be the maximum set of variables (ver-tices) which are not connected. Since this case is a subsetof ILPI , the ILPI problem is at least as hard as MIS, im-plying that it is NP-Hard. Proving the completeness of itsdecision-problem (

∑ki=1 ai ≥ K) is straight-forward.

Since this problem is NP-Complete, we cannot expect tohave a practically efficient solution unless P = NP . Inour implementation, we check if itemsets are frequent bycalculating the statistical information of the FFT-pattern.

5.2.2 Finding statistical information of STI

As in section 4.2, we reduce the number of variables byfinding the equivalence classes. In this variant, two transac-tions are equivalent if they contain exactly the same itemsetin X . It is easy to see that the maximum number of equiv-alence classes is bounded by min(k, 2|X|). Since |X| isless than 10 in many real-world datasets [13], the number ofvariables is at most 1024, hence enumerating all solutionsis feasible.

To make this process more efficient, we introduce someheuristics, i.e., we iterate from the most constrained variableand from the highest possible value (most constraining), toreduce the search space as much as possible. Our experi-mental results show that all solutions can be found fast.

Once a solution B for ILP redI is obtained (section 4.2),

we can calculate the desired statistic using equation (6).

278278278278278

5.3 FFTP-pattern

A pattern (Y × X) is an FFTP -pattern if it satisfies thecriterion of pattern relaxation as given in section 3 (ILPPin section 4.2). In this problem, transactions are equivalentif they miss the same number of items in X . Let variablebi represent the equivalence class for transactions missing iitems in X . Note that only transactions that miss one to εp

items are important. Therefore, the number of equivalenceclasses, which is also the number of variables in ILP red

P , isequal to εp. ILP red

P itself is formulated as follows:

εp∑i=1

bi × i ≤ εp (7)

5.3.1 Finding the maximum sized SP

Let Ymax be the largest tidset in SP . The problem of find-ing Ymax can be optimally solved using a greedy algorithmwhich continuously takes transactions with the lowest num-ber of misses, until the total number of misses exceeds εp.This algorithm takes O(|T |) iterations. By maintaining thenumber of transactions that miss the same number of items,the complexity can be reduced to O(εp).

5.3.2 Finding statistical information of SP

As in FFTT I-pattern, we can find the statistical informationby enumerating B in ILP red

P (c.f. (7) and section 4.2). Toshow the bound of the number of solutions, we first intro-duce the term partition number [1].

Definition 5. [PARTITION NUMBER] Partition number ofan integer n means the number of distinct multi-sets of pos-itive integers that add-up to n, denoted as part(n). Forexample, part(3) = 3, since 3 = 3, 2+1, and 1+1+1.part(4) = 5, since 4 = 4, 3+1, 2+2, 2+1+1, and 1+1+1+1,and so on. Each multi-set that adds-up to n is called a par-tition of n.

Let the number of transactions missing i items be de-noted as ni. We show that |B| is bounded by part(εp).

Lemma 3. |B| is bounded by part(εp).

Proof. It should be clear that if the number of transactionsthat miss i items, ni = ∞ for all i, then |B| is equal topart(εp). Now if the total number of misses in B equals εp,then it is a partition of εp. Otherwise, b1 = n1, i.e. alltransactions missing one item are already considered in B.This is true, because if b1 < n1, we can always increaseb1 (i.e. we can add transactions missing one item to B),implying that B was not maximal. Assuming we have n1 =∞, we can always increase b1 until the number of missesin B equals εp, so that it becomes a partition of εp. This

partition is a unique extension for any maximal B. Hence,all B are subsets of distinct partitions of εp, implying that|B| is at most part(εp).

Lemma 3 shows that the number of maximal B isbounded by part(εp), which is quite small for small εp. Forexample, part(10) = 42 and part(20) = 627. In a prac-tical situation, we don’t need εp to be larger than 20, andhence enumerating all solutions is feasible.

The algorithm to generate B is same as the one in section5.2.2. We iterate the variable from the most constrained(having most number of misses), and most constraining(choosing the variable with highest value, so that we startwith the largest number of transactions possible). The num-ber of iterations is bounded by O(εp × part(εp)). It can beshown by assuming we have enough n1 that this algorithmwill generate subsets of partition of εp (similar to the proofof Lemma 3).

6 Experimental Results

All approaches are implemented in C++ and evaluatedon an AMD 2.2 GHz machine with 2GB main memory. Inall experiments the items are sorted in a non-decreasing or-der of appearances in the database. This performs best com-pared to other ordering techniques in our empirical studies.

For the experiments we used synthetic datasets gener-ated using the IBM market-basket data generator. We usedtwo different datasets, a small database (400 items, 10,000transactions) to show the effect of parameters and com-parison between approaches, and a practical-sized database(1,000 items, 100,000 transactions) to show the scalabilityof our approach in real world applications. The density ofthe databases is varied from 0.5% to 4% as in real-worldmarket basket data.

6.1 FFTT -pattern

We demonstrate this problem using the small database.The result of FT-apriori algorithm isn’t plotted since it takestoo long. Figure 2(a) shows the scalability of the algorithmswhile varying minimum support. We observe that in thiscase, BIAS(appear) performs best, followed closely by VB-FT-Mine then BIAS(bit). This is due to the operation’s com-plexity, while VB-FT-Mine only needs one “and” and one“or” operations, BIAS(bit) needs 2 “and”s, 2 “or”s, and 1“not” operations (refer to section 5.1).

However, VB-FT-Mine only finds the number of trans-actions which miss less than or equal to εt items, while bothBIAS(appear) and BIAS(bit) maintain the number of trans-actions that miss all 0 to εt items. In the case this statisticis useful - as in other variants, VB-FT-Mine cannot be used(i.e. it needs post processing).

279279279279279

0

200

400

600

800

1,000

1,200

4% 6% 8% 10%Minimum Support

Tim

e (s

ec)

BIAS(appear)BIAS(bit)VB-FT-Mine

(a) Density = 1%, εt = 2

0

500

1,000

1,500

2,000

2,500

2.5% 5.0% 7.5% 10.0%Density

Tim

e (s

ec)

BIAS(appear)BIAS(bit)VB-FT-Mine

(b) ξ = 2%, εt = 1

0

50

100

150

200

250

300

2% 3% 4% 5%

Minimum Support

Tim

e (s

ec)

Density = 1%Density = 2%Density = 4%

(c) εt = εi = 1

1E-72

1E-60

1E-48

1E-36

1E-24

1E-12

12% 3% 4% 5%

Minimum Support

|F I|

/ |F

P|

ei = 1ei = 5ei = 10

(d) Density = 4%, εt = 1

0

200

400

600

800

1000

1200

1400

1600

1800

1 5 10

|F I|

Density = 1%Density = 2%Density = 4%

ei

(e) ξ = 2%, εt = 2

0

100

200

300

400

500

600

700

0.5% 1.0% 2.0% 3.0%

Minimum support

Tim

e (s

ec)

Density = 1%

Density = 2%

Density = 4%

(f) εp = 5

0

0.2

0.4

0.6

0.8

1

1.2

0.5% 1.0% 2.0% 3.0% 4.0% 5.0%

Minimum support

|FI|

/ |F

P|

εp = 2εp = 5

εp = 10

εp = 20

(g) Density = 4%

0

5

10

15

20

25

30

35

40

45

50

1 2 5 10 20

|FI|

(th

ou

san

ds) Density = 1%Density = 2%Density = 4%

ep

(h) ξ = 0.5%

Figure 2. Experimental result for FFT-patterns mining. Figures (a)-(b) show the FFTT -pattern variant,figures (c)-(e) show the FFTT I-pattern variant, and figures (f)-(h) show the FFTP -pattern variant

Figure 2(b) shows that BIAS(appear) is more affected bydensity compared to BIAS(bit). From this experiment, wecan conclude that if we have a sparse database, we shoulduse BIAS(appear) to maintain the aggregation statistic, andotherwise use BIAS(bit). All of these behaviors are as ex-pected as they follow the characteristics given in Table 1.

As we’ve already mentioned, this problem tries to re-duce the number of FT-patterns by relaxing the constraint(it doesn’t consider item relaxation). In practical-sizeddatabase and εt > 1, this problem is intractable. To con-clude, this problem is not scalable in terms of database sizeand εt as has also been discussed in [9].

6.2 FFTT I-pattern

For this problem, we use the practical-sized dataset. Fig-ure 2(c) shows the scalability of our algorithm. Even thoughCompleteQB [11] doesn’t tackle exactly the same problem,

it is used as competitor because to the best of our knowl-edge, there are no other relevant references. In our work,the objective is to mine all frequent itemsets along with thestatistical information of associated patterns, while that of[11] is to mine all maximal quasi-bicliques.

It can be seen that the number of frequent itemsets in-creases as the density increases and as the minimum sup-port decreases. Using the BIAS framework, the time takenis almost linear to the number of frequent itemsets. Com-pleteQB [11] algorithm takes longer than one hour (theoperation was terminated), because the number of FFT-patterns is much larger than the number of frequent itemsetsas shown in Figure 2(d). The results also show that enumer-ating all results in market-basket database is meaninglesssince we might get as many as 1072(!) FFT-patterns, whilethere are only 1,600 frequent itemsets.

Experimenting with the value of εi, it is interesting toobserve that the value of εi doesn’t increase the number of

280280280280280

frequent itemsets much, even though the number of FFT-patterns grows exponentially (Figure 2(e)). This fact can-not be discovered if we enumerate all patterns as in Com-pleteQB.

6.3 FFTP-pattern

Since this variant hasn’t been mentioned previously, wecan’t compare the efficiency of BIAS with any other ap-proach. However, the experimental results in a practical-sized dataset indicate that this algorithm is efficient andscalable. Figure 2(f) shows that the time needed is ap-proximately linear to the number of frequent itemsets. Thenumber of frequent itemsets itself scales similar to classicalfrequent-itemsets, which grows exponentially as the densityincreases and/or as the minimum support decreases.

In this problem, the number of frequent itemsets is alsomuch smaller compared to FFTP -patterns, even though theratio is not as dramatic as in FFTT I-pattern. For example indatabase with 4% density and 0.5% minimum support, withεp = 20, the proportion is 46K frequent itemsets to 637KFFT-patterns, which is around 7%.

As seen in Figure 2(g), the ratio becomes smaller as εp

increases and minimum support decreases. Similar to theprevious problem, the value of εp doesn’t affect the numberof FFT-itemsets much (Figure 2(h)).

7 Conclusion

We have generalized the problem of mining fixed-valueFFT-patterns into relaxation criteria and constraints, andhave shown that these criteria and constraints can be usedto derive several available variants of FFT-patterns, as wellas many new definitions. For FFT-patterns mining in trans-actional database, we proposed to mine only statistical in-formation of the corresponding FFT-patterns, and we haveshown with several experiments that such statistics are suf-ficient to describe the characteristic of an itemset.

We have developed a new framework called BIASframework for mining FFT-patterns based on backtrackingalgorithm, ILP constraints, and aggregation statistics. Wehave also shown that this framework can be used to effi-ciently mine several variants of FFT-patterns.

For future work, we plan to analyze the utility of statis-tical information in real world data. We are also interestedin analyzing the impact of each relaxation criteria, and inresearching the problem using proportional relaxation. Thistechnique may also be applied to other problems such assequential mining, stream mining, and graph mining.

References

[1] M. Abramowitz and I. A. Stegun, eds. Handbook ofMathematical Functions. Dover Publications, New York,1964.

[2] R. Agrawal, T. Imielinski, and A. N. Swami. Mining as-sociation rules between sets of items in large databases.In Proc. of SIGMOD, pages 207–216, 1993.

[3] J. Besson, R. G. Pensa, C. Robardet, and J-F. Boulicaut.Constraint-based mining of fault-tolerant patterns fromboolean data. In Proc. of KDID, pages 55–71, 2005.

[4] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. In-troduction to Algorithms. MIT Press, Cambridge, MA,2001.

[5] S. I. Gass. Linear Programming: Methods and Appli-cations. McGraw-Hill, New York, 1985.

[6] J-L. Koh and P.-W. Yo. An efficient approach for min-ing fault-tolerant frequent patterns based on bit vectorrepresentations. In Proc. of DASFAA, pages 568–575,2005.

[7] J. Liu, S. Paulsen, X. Sun, W. Wang, A. B. Nobel, andJ. Prins. Mining approximate frequent itemsets in thepresence of noise: Algorithm and analysis. In Proc. ofSDM, 2006.

[8] K. Narita and H. Kitagawa. Mining frequent itemsetsfrom noisy data. In Proc. of ICDEW, page 117, 2006.

[9] J. Pei, A. K. H. Tung, and J. Han. Fault-tolerant fre-quent pattern mining: Problems and challenges. In Proc.of DMKD, 2001.

[10] J. K. Seppanen and H. Mannila. Dense itemsets. InProc. of SIGKDD, pages 683–688, 2004.

[11] K. Sim, J. Li, V. Gopalkrishnan, and G. Liu. Miningmaximal quasi-bicliques to co-cluster stocks and finan-cial ratios for value investment. In Proc. of ICDM, pages1059–1063, 2006.

[12] M. Steinbach, P.-N. Tan, and V. Kumar. Support en-velopes: a technique for exploring the structure of asso-ciation patterns. In Proc. of SIGKDD, pages 296–305,2004.

[13] A. Veloso, B. Gusmao Rocha, M. de Carvalho, andW. Meira Jr. Real World Association Rule Mining. InProc. of BNCOD, pages 77–89, 2002.

[14] C. Yang, U. M. Fayyad, and P. S. Bradley. Efficientdiscovery of error-tolerant frequent itemsets in high di-mensions. In Proc. of SIGKDD, pages 194–203, 2001.

281281281281281


Recommended