Download - † Prog. Lang. & Sys. Lab Dept of Comp. Science National Uni. of Singapore Current:

1

†Prog. Lang. & Sys. LabDept of Comp. Science

National Uni. of SingaporeCurrent:

(Sch. of Info. Systems, Singapore Management

Uni.)

Efficient Mining of Recurrent Rules from a Sequence Database

‡Data Mining GroupDepartment of Computer

ScienceUni. of Illinois at Urbana-

ChampaignCurrent:

(Microsoft Research, Redmond)

David Lo†*

Joint work with: Siau-Cheng Khoo† and Chao Liu‡

2

Motivationo Huge amount of data exists, we want to mine knowledge from data.

o Recurrent Rules “Whenever a series of precedent events (pre) occurs, eventually

another series of consequent events (post) occurs.”

Denoted as: pre->post

o We want to mine for recurrent rules from a sequence database.

3

Recurrent Rules – Intuitive Examples

o Locking Protocol

o Internet Banking

“Whenever a lock is acquired, eventually it is released”

“Whenever a connection to a bank server is made and authentication is

completed, money transfer command is issued and verified, eventually money is

transferred and notification is displayed.”

4

Soft. Specifications & Recurrent Rule

o Recurrent rule– Corresponds to a family of program properties useful for software verification

o Formalized in Linear Temporal Logico Mining for these software specs are often incomplete, outdated [ABL02,DSB04,LKL07]o Mining specifications helps in:

– Understanding existing/legacy systems – Help verification tools to ensure correctness of systems and detect bugs.

5

Problem Statements

“Given a set of sequences, find rules that recur (are satisfied) a significant number of times within a sequence and across multiple sequences.

A rule is significant if it satisfies minimum thresholds of supports and confidence. ”

Problem 2“Mine a set of non-redundant significant

recurrent rules.”

Problem 1

6

Extending Sequential Rules [S99]o Sequential rule pre->post:

– Rules formed by composing sequential patterns [AS95,YHA03,WH04]: series of events supported (i.e. a sub-sequence of) by a significant number of sequences.– Whenever a sequence is a super-seq. of pre it will also be a super-seq. of pre++post

o Recurrent rule:- Multiple occurrences of the rule’s premise and

consequent both within a sequence and across

multiple sequences are considered

7

Extending Episode Rules [MTV97]o Episode rule pre->post:

– Episode: series of events occurring close together (e.g., in a window).– Whenever a window is a super-seq. of pre it will also be a super-seq. of pre++post.

o Recurrent rule:– Handle multiple sequences– We want to break the window barrier– It is hard to tell the right window size

– Lock separated frm unlock by arbitrary no of evs– We mine a non-redundant set of rules

8

Preliminaries

9

Linear Temporal Logic (LTL)

o Formalism to precisely specify temporal requirements.

o It works on paths [HR03]o There are a number of operators:

o G p – Globally at every point in time p holds

o F p – At that point in time or eventually (Finally) p holds

o X p – p holds at the neXt point in timeRule LTL

a -> b G(a->XF(b))<a,b>-><c,d>

G(a->XG(b->XF(c^XF(d))))

10

Checking or Verifying Temporal Logics

Automata Model

main

lock

use

unlock

lock

use

unlock

lock endToCheck

Violation

LTL property to check

<main,lock> ->

<unlock,end>

Transform

Possible Traces or Sequencesmain lock use unlock lock endmain lock use unlock lock use unlock endmain lock use unlock end…

main(x){ if (lock=0) lock;use;unlock;lock; else for i: 1 to 10 lock;use;unlock}

Program

10

11

Concepts, Definitions

AndRules Semantics

12

Temporal Points“Whenever a series of precedent events

occurs at a point in time or temporal point, eventually another series of consequent

events occurs.”- Peek at interesting temporal points & see

what series of evs are likely to happen next- Temporal points in a sequence S

- The indices in S, starting from 1. - Consider a sequence <a,b,a,b,a,c>. There are 6 temporal points in the sequence.

- For a temporal point j in S= <a1,…,an> , the prefix <a1,…,aj> of S is called j-prefix of S.

13

Occurrences & Instanceso Consider a pattern P, and a sequence So The set of all occurrences of P in S,

Occ(P,S) is the set: {j| P j-prefix of S && last (P) = S[j] }o The set of all instances of P in S, Inst(P,S)

is the set: {j-prefix of S | j is in Occ(P,S)}

o Consider the sequence <A,B,A,B,A,B>– The set of occurrences of <A,B> is {2,4,6}– Instances of <A,B> is: {<A,B>,<A,B,A,B>, <A,B,A,B,A,B>}– Correspond to temporal points to be checked for rules with <A,B> as premise

14

Projected and Projected-all DBo A database SeqDB projected on pattern P

is defined as: SeqDBP = {(j,sx)| s = SeqDB[j], s = px++sx, where px is the minimal prefix of s containing P}

ID. SequenceS1 <a,b,e,a,b,c>S2 <a,c,b,e,a,e,b,c>

ID. SequenceS1S2

SeqDB SeqDB<a,b>

<e,a,b,c><e,a,e,b,c>

15

Projected and Projected-all DBo A database SeqDB projected-all on

pattern P is defined as: SeqDBP = {(j,sx)| s = SeqDB[j], s = px++sx, where px is an instance of P}

o Return temporal points to check

all

ID. SequenceS1 <a,b,e,a,b,c>S2 <a,c,b,e,a,e,b,c

>

SeqDB ID. SequenceS1i

S1ii

S2i

S2ii

SeqDB<a,b>all

<e,a,b,c><c>

<e,a,e,b,c><c>

16

Counting Supports and Confidenceo Consider the rule pre->posto Sequence Support (s-sup): The number of

sequences where the prefix pre appears.o Instance support (i-sup): The number of

instances of pre++post.o Confidence (conf): The likelihood that post

appears after pre. This can be found by computing the ratio:

Instances of pre, where post eventually occurs afterwards

----------------------------- = |Instances of pre|

|(SeqDBpre)post

|----------

|SeqDBpre|

allall

17

Counting Supports and Confidence

s-sup (<a,b>-><c>) = 2i-sup (<a,b>-><c>) = 3conf(<a,b>-><c>) = 1.0conf(<a,b>-><e>) = 0.5

Seq ID. SequenceS1 <a,b,c,e,a,b,c>

S2 <a,c,b,e,a,e,b,c>

XX

18

Properties, Theorems, and

Algorithms

19

Apriori Properties – Support & Conf.Theorem 1. Consider two rule Rx = p->c & Ry = q -> c. If p q and s-sup(Rx) < min-s-sup, then s-sup(Ry) < min-s-sup.

Rx: a -> z ; s-sup(Rx) < min_s-

supa,b -> za,b,c -> za,c -> z

a,b,d -> z….

Non-significan

tRys

Theorem 2. Consider two rule Rx = p->c & Ry = p -> d. If c d and conf(Rx) < min-conf, then conf(Ry) < min-conf.

Rx: a -> z ; conf(Rx) < min_confa -> b,z a -> b,c,z

a -> c,z a -> b,d,z

…. Rys

20

Rule Redundancyo Consider two rules Rx = p->c and Ry = q -> d.

Rx is redundant if the following conditions hold:1.Rx is a sub-seq. of Y (i.e., p++c q++d)2.Rx & Ry have the same sup. and conf. values.

Redundant rules are identified and removed early during mining process.

a -> b a -> ca -> b,c

a -> b,d….

Redundantiff

sup and conf are the same

Rx: a -> b,c,d

Rys

21

Theorem 3. Given two pre-conditions PX and PY where PX PY , if SeqDBPX = SeqDBPY then for all sequences of events post, rules PX -> post is rendered redundant by PY -> post.

<a,b,c,d> -> post

<a,d> -> post<a,c,d> -> post

Redundant Rules:

….

Theorem 4. Given two rules RX (pre -> CX) and RY (pre -> CY ) if CX CY and (SeqDBpre)CX = (SeqDBpre)CY then RX is rendered redundant

by RY and can be pruned.

allall

pre -> <b,c,d,e>

pre -> <c,d,e>pre -> <d,e>

Redundant Rules:

….

22

Algorithmo Step 1: Mine a pruned set of pre-conditions

– Satisfy min-s-sup threshold– Use Theorems 1 & 3

o Step 2: For each pre-cond. pre, create SeqDBpre.

o Step 3: Mine a pruned set of post-conditions– Corresponding rules satisfy min-conf.– Use Theorems 2 & 4

o Step 4: Remove rules that don’t satisfy min-i-sup.

o Step 5: Filter any remaining redundant rules.

all

23

Equiv. Proj DB & LS-Set Patternso From Theorem 3 (& 4), a pre- (post-)

condition is not pruned iff: there does not exist any super-sequence

pattern having the same projected database.o Also referred to as projected-database

closed or LS-Set (Yan and Han, 2003) o We generate this set by modifying BIDE

(Wang and Han, 2004)- Keep the search space pruning strategy- Remove the closure checks- Proof of completeness in technical report

24

Mine Pruned Pre-Conds

Mine Pruned Post-Conds

Check Instance Support & Remove Remaining Red. Rules

25

Performance &

Case Study

26

Synthetic DatasetD5C20N10S20

101

102

103

104

105

0.4 0.45 0.5 0.55 0.6min_s-sup (%)

Run

time

(s) -

(log

-sca

le)

FullNR

108

107

106

105

104

103

102

0.4 0.45 0.5 0.55 0.6min_s-sup (%)

|Rul

es| -

(log

-sca

le) Full

NR

147x Faster, 8500x More Compact

27

Gazelle DatasetKDD Cup 2000

Full-set of significant rules is not minable

106

105

104

103

0.034 0.041 0.048 0.055 0.062min_s-sup (%)

|Rul

es| -

(log

-sca

le) Full

NR

Full not mine-able

102

103

104

0.034 0.041 0.048 0.055 0.062min_s-sup (%)

Run

time

(s) -

(log

-sca

le)

FullNR

Full not mine-able

28JBoss Security

Premise ConsequentXLoginConfImpl.getCfgEntry()AuthenticationInfo.getName()

ClientLoginModule.initialize()ClientLoginModule.login()ClientLoginModule.commit()SecAssocActs.setPrincipalInfo()SetPrincipalInfoAction.run()SecAssocActs.pushSubjectCtx()SubjectThdLocalStack.push()SimplePrincipal.toString()SecAssoc.getPrincipal()SecAssoc.getCredential()SecAssoc.getPrincipal()SecAssoc.getCredential()

Whenever login configuration information is

checked, eventually invocations of

authentication events, binding of principal to subject, utilization of subject & principal information occur

29

Conclusiono We propose a novel framework to mine a non-

redundant set of significant recurrent rules:“Whenever a series of precedent events occurs, eventually a series of consequent events occurs”

o Employ 2 apriori properties and 2 redundancy thms

o Major speedup and reduction of rules by non-redundant rule mining strategy.

o We show the utility in mining behavior of JBoss Security

Future Worko Improve mining speedo More case studies and apps to DM/SE problems