Efficient Closed Pattern Mining in the Presence of Tough Block Constraints

Efficient Closed Pattern Mining in the Presence of Tough Block Constraints

Krishna GadeComputer Science &

[email protected]

Outline Introduction Problem Definition and Motivation Contributions

Block Constraints Matrix Projection based approach Search Space Pruning Techniques

Experimental Evaluation Conclusions

Introduction to Pattern Mining

What is a frequent pattern ? Why is frequent pattern mining a

fundamental task in data mining ? Closed, Maximal and Constrained

Extensions State-of-the-art algorithms Limitations with the current

solutions

What is a frequent pattern ? A frequent pattern can be a set of items,

sequence, graph that occurs frequently in a database

It can also be a spatial, geometric or topological pattern depending on the database chosen

For a given a transaction database and a support threshold min_sup, an itemset X is frequent if Sup(X) >= min_sup

Sup(X) is the support of X, defined as the fraction of the transactions in the database which contain X.

Why is frequent pattern mining so fundamental to data mining ?

Foundation for several essential data mining tasks Association, Correlation and Causality Analysis Classification based on Association Rules Pattern-based and Pattern-preserving Clustering.

Support is a simple, yet effective (in many cases) measure to determine the significance of a pattern, which also correlates with most of the other statistical measures.

Closed, Maximal & Constrained Extensions to Frequent Patterns

A frequent pattern X is said to be closed, if no superset of X has the same supporting set.

It is said to be maximal, if no superset of X is frequent.

It is said to be constrained, if it satisfies some constraint defined on the items

present or on the transactions that support it. E.g., length(X) >= min_l

State-of-the-art algorithms for Pattern Discovery Itemsets

Frequent : FP-Growth, Apriori, OP, Inverted Matrix

Closed : Closet+, Charm, FPClose, LCM Maximal : Mafia, FPMax Constrained : LPMiner, Bamboo

Sequences SPAM, SLPMiner, BIDE etc.,

Graphs FSG, gFSG, gSpan etc.,

Limitations Limitations of Support

May not capture the semantics of user interest. Too many frequent patterns if the support

threshold is too low. Closed and Maximal frequent patterns fix this but there

may be loss of information (in case of maximal). Support is ‘a’ measure of interestingness of a

pattern, there can be others such as ‘length’ etc. E.g., One may be interested in finding patterns whose

length decreases with the support.

Definitions Block is a 2-tuple B = (I,T), consisting of itemset I

and its supporting set T. Weighted block is a block with a weight function

w, where w: I x T ->R+

B is a Closed block iff there exists no block B’ = (I’,T’) where I’ is the superset of I. (If such B’ exists then it is the super-block of B and B its sub-block.)

The size of the block B is defined as

The sum of the corresponding weighted block, is defined as

||||)( TIBBSize

IiTtitwBBSum

,),()(

Example of a Block

T1 {a,b,c,e}

T2 {b,c,d,e}

T3 {a,d,e}

T4 {c,d,e}

a b c d e

T1 1 1 1 0 1

T2 0 1 1 1 1

T3 1 0 0 1 1

T4 0 0 1 1 1

B1 = ({a,b},{T1}), B2 = ({c,d},{T2,T4}) are examples of a block.

Red sub-matrix is not a block.

Example Database

Matrix Representation of the Database

Block Constraints Let be the set of all transactions in the database and

be the set of all items. A block constraint C is a predicate, C : 2 x 2-> {true,

false} A block B is a valid block for C if B satisfies C or C(B) is

true. C is a tough block constraint if there is no dependency

between the satisfaction (violation) of by a block B and the satisfaction (violation) of its super or sub-blocks.

In this thesis explore 3 different tough block constraints.

Block-size Block-sum Block-similarity

Monotonicity and Anti-montononicty of Constraints Monotone Constraint

C is monotone iff C(X) = true, then for every Y such that Y is a superset of X, C(Y) = true.

E.g. : Sup(X) <= v is monotone. Benefit : Prune all Y if Sup(X) > v.

Anti-monotone Constraint C is anti-monotone iff C(X) = true, then for

every Y such that Y is a subset of X, C(Y) = true.

E.g. : Sup(X) >= v is anti-monotone. Benefit : Prune all Y if Sup(X) < v.

Why Block-size is a tough constraint – An Illustration

a b c d e

T1

1 1 1 0 1

T2

0 1 1 1 1

T3

1 0 0 1 1

T4

0 0 1 1 1

For the constraint BSize >= 4,

•({b,c},{T1,T2} is a valid block, but ({b,c,d},{T2}) is invalid.

Block-size constraint is not monotone.

• neither of ({b},{T1,T2}), ({c},{T1,T2,T4}) is valid.

Block-size constraint is not anti-monotone.

Block-size, Block-sum Constraints Block-size Constraint

Motivation : Find set of itemsets each of which accounts for a certain fraction of overall number of transactions performed in a period of time.

Block-sum Constraint Motivation : Identify

product groups that account for a certain fraction of the overall sales, profits etc.

t

length(t)N

NvBBSize

and 10 where

,)(

it

itwW

WvBBSum

,

),( and 10 where

,)(

Block-similarity Definition Motivation: Finding groups of thematically related

words in large document datasets. Importance of a group of words can be measured

by their contribution to the overall similarity between the documents in the collection.

Here is the set of tf-idf scaled and normalized unit-length document vectors and is the set of distinct terms in the collection.

Block-similarity of a weighted block B is defined as Loss in the aggregate pairwise similarity of the

documents in resulting from zeroing-out entries corresponding to B.

BSim(B) = S – S’, where S, S’ are the aggregate pairwise similarities before and after removing B.

Block Similarity - Illustration

a b c d e

D1

.1 .1 .1 0 .1

D2

0 .1 .1 .3 .4

D3

.1 0 0 .3 .3

D4

0 0 .3 .2 .1

a b c d e

D1

.1 0 0 0 .1

D2

0 0 0 .3 .4

D3

.1 0 0 .3 .3

D4

0 0 .3 .2 .1({b,c},{D1,D2}) is removed here to calculate its block-similarity, by measuring the loss in the aggregate similarity.

Block-similarity contd., Similarity of any two documents is measured as the dot-

product of their unit-length vectors. (cosine) For the given collection , we define a composite vector to be

the sum of all document vectors in

We define the composite vector BI for a weighted block B = (I,T) to be the vector formed by adding all the vectors in T only along the dimensions in I. Then,

Block-similarity constraint is now defined as

ddD

III

II

BBBD

BDBD

DD

2')(

)()(',

SSBBSim

S

ddddSjiji d

j

d

i

dd

ji

10 where,)( SvBBSim

Key Features of the Algorithm

Follows widely used projection-based pattern mining paradigm. Adopts a depth-first search traversal on the lattice of complete set of

itemsets, with the items ordered non-decreasingly on their frequency. Represents the transaction/document database as a matrix,

transactions (documents) as rows and items (terms) as columns. Employs efficient compressed sparse matrix storage and access

schemes like to achieve high computational efficiency. Matrix-projection based pattern enumeration shares ideas from the

recently developed array-projection based method H-Mine. Prunes potentially invalid rows and columns at each node during the

traversal of the lattice (shown in the next page) as determined by our row-pruning and column-pruning and matrix-pruning tests.

Adopts various closed itemset mining optimization techniques, like column fusing, redundant pattern pruning from CHARM and Closet+ to the block constraints.

The hash-table consists of only closed patterns hashed by the sum of the transaction-ids of the transactions in their supporting sets.

Ø

a b c d

ab ac ad bc bd cd

abc abd acd bcd

abcd

Level 1

Level 2

Level 3

Level 4

Pattern Enumeration Visits each node in the lattice in a dept-first order. Each node represents a

distinct pattern p. At a certain node labeled p in the lattice, we report and store p in the hash

table, as a closed pattern if p is closed and valid under the given block constraint.

We build a p-projected matrix by pruning any potentially invalid columns and rows determined by our pruning tests.

Matrix-Projection A p-projected matrix

is the matrix containing only the rows that contain p, and the columns that appear after p, in the predefined order.

Projecting the matrix is linear on the number of non-zeroes in the projected matrix.

a b c d e

T1 1 1 1 0 1

T2 0 1 1 1 1

T3 1 0 0 1 1

T4 0 0 1 1 1

c d e

T1 1 0 1

T2 1 1 1{b}-projected matrix

given matrix

Compressed Sparse Representation CSR format utilizes two one-

dimensional arrays : First stores the actual non-zero elements

of the matrix in a row (or column). Second stores the indices corresponding

to the beginning of each row (or column).

We maintain both row- and column-based representation for efficient projection and frequency counting.

CSR format for the example matrix

a b c d e

T1 1 1 1 0 1

T2 0 1 1 1 1

T3 1 0 0 1 1

T4 0 0 1 1 1

T1 T2 T3 T4 end

0 4 8 11 14

o 1 2 3 4 5 6 7 8 9 10

11

12

13

a b c e b c d e a d e c d e

Index Array

Pointer Array

Row-based CSR

Search Space Pruning Column Pruning

Given a pattern p and its p-projected matrix.

Necessary condition for the columns which can form a valid block with p.

Eliminate all columns in the p-projected matrix that do not satisfy it.

Block-Size A= local supporting set of

x. rlen(t)= local rowlength of

t.

c d e

T1 1 0 1

T2 1 1 1

{b}-projected matrix

Let BSize >= 5 be the constraint.

‘d’ will get pruned as it can never form a block of size >= 5 with its prefix {b}, since the maximum block-size possible with ‘d’ is 4.

vtrlenApBSize x At

)(),( ifPrune

Search Space Pruning contd., Column Pruning

Block-sum rsum(t) = local rowsum of t.

Block-similarity maximum value of vector D. local maximum rowsum. freq = frequency. a = 2D.BP

vtrsumApBSize x At

)(),( ifPrune

2/)( ifPrune avfreq(x) x

Search Space Pruningcontd., Row Pruning

Smallest Valid Extension, SVE SVE(p) is the length of the smallest possible extension q

to p, such that resulting block formed by p and q, is valid. Prune rows whose length is smaller than SVE in the p-

projected matrix. SVE for generic block constraint BSxxx is given below.

Block-size z = size of the supporting set of p.

Block-sum z = maximum column sum in the p-projected matrix.

Block-similarity z = maximum column similarity in the p-projected matrix.

zBBSxxxvpSVE /))(()(

Search Space Pruning contd., Row Pruning Example Matrix Pruning

Prune p-projected matrix if Block-size

Sum of the row-lengths in the projected matrix is insufficient.

Block-sum Sum of the row-sums and

is insufficient. Block-similarity

Sum of the column- similarities is insufficient.

to form a valid block with p.

c d e

T1 1 0 1

T2 1 1 1

{b}-projected matrix

Let BSum >= 7 be the constraint.

Since SVE >= 3, T1 gets pruned.

Pattern Closure Check and Optimizations Closure Check

Hash-table consists of closed patterns Hash-keys are sum of transaction-ids

At a certain node p in the lattice (shown before), Column Fusing

Fuse the fully dense columns of the p-projected matrix to p. Also fuse columns to one another that have identical

supporting sets. Redundant Pattern Pruning

If p is a proper subset of an already mined closed pattern with the same support, it can be safely pruned. Also any pattern extending it need not be explored as it has already been done. Hence p is a redundant pattern.

Experimental Setup

Data #Trans #Items

A.(M.)tran.len

Gazelle 59601 498 2.5(267)

Pumsb* 49046 2089 50.5(63)

Big-market

838466 38336 3.12(90)

Sports 8580 126373

258.3(2344)

T10I4Dx 200k-1000k

10000 10(31)

Notation :

CBMiner – Closed Block Miner Algorithm

CLOSET+ - State-of-the-art closed frequent itemset mining algorithm

CP – Column Pruning, RP – Row Pruning, MP – Matrix Pruning.

Experimental Results

Comparisons with Closet+ on Gazelle

Experimental Resultscontd.,

Comparisons with Closet+ on Sports

Experimental Results contd.,

Comparisons of Pruning Techniques on Gazelle (left) and Pumsb*(right)

No Pruning Gazelle : 1578.48 , BSize >= 0.1; Pumsb* = 1330.03 , BSum >= 6.0


Comparisons Closed & All Valid Block Mining Big-Market


Comparison of Pruning Techniques on Big-Market

Scalability Test on T10I4Dx

Time for No Pruning : 3560 seconds

Micro Concept Discovery Scaled the document vectors using tf-idf. Normalized using L2-norm. Applied the CBMiner algorithm for each of

the three constraints. Chose the top-1000 patterns ranked on the

constraint function value. Compute the entropies of the documents

that form the supporting set of the block. Also ran CLOSET+ to get the top-1000

patterns ranked on frequency.

Micro Concept Discovery contd.,

Data #docs

#terms

#classes

Classic 7089 12009 4

Sports 8580 18324 7

LA1 3204 31472 6

Average entropies of the four schemes are pretty low.

Block Similarity outperforms the rest as it leads to lowest entropies or purest clusters.

Block-size and itemset frequency constraints do not account for the weights associated with the terms and hence are inconsistent.

But, Block-sum performs reasonably well as it accounts for the term weights provided by tf-idf and L2-norm.

Micro Concept Discovery contd.,

Conclusions Proposed a new class of constraints called

“tough” block constraints. And a matrix-projection based framework

CBMiner for mining closed block patterns. Block Constraints discussed : Block-size, Block-

sum, Block-similarity 3 novel pruning techniques column pruning,

row pruning and matrix pruning. Order(s) magnitude faster than traditional

closed frequent itemset mining algorithms Finds much fewer patterns.

Thank You !!

Date post:	11-Jan-2016
Category:	Documents
Upload:	viho
View:	25 times
Download:	0 times

Efficient Closed Pattern Mining in the Presence of Tough Block Constraints

Documents