1
†Prog. Lang. & Sys. LabDept of Comp. Science
National Uni. of SingaporeCurrent:
(Sch. of Info. Systems, Singapore Management
Uni.)
Efficient Mining of Recurrent Rules from a Sequence Database
‡Data Mining GroupDepartment of Computer
ScienceUni. of Illinois at Urbana-
ChampaignCurrent:
(Microsoft Research, Redmond)
David Lo†*
Joint work with: Siau-Cheng Khoo† and Chao Liu‡
2
Motivationo Huge amount of data exists, we want to mine knowledge from data.
o Recurrent Rules “Whenever a series of precedent events (pre) occurs, eventually
another series of consequent events (post) occurs.”
Denoted as: pre->post
o We want to mine for recurrent rules from a sequence database.
3
Recurrent Rules – Intuitive Examples
o Locking Protocol
o Internet Banking
“Whenever a lock is acquired, eventually it is released”
“Whenever a connection to a bank server is made and authentication is
completed, money transfer command is issued and verified, eventually money is
transferred and notification is displayed.”
4
Soft. Specifications & Recurrent Rule
o Recurrent rule– Corresponds to a family of program properties useful for software verification
o Formalized in Linear Temporal Logico Mining for these software specs are often incomplete, outdated [ABL02,DSB04,LKL07]o Mining specifications helps in:
– Understanding existing/legacy systems – Help verification tools to ensure correctness of systems and detect bugs.
5
Problem Statements
“Given a set of sequences, find rules that recur (are satisfied) a significant number of times within a sequence and across multiple sequences.
A rule is significant if it satisfies minimum thresholds of supports and confidence. ”
Problem 2“Mine a set of non-redundant significant
recurrent rules.”
Problem 1
6
Extending Sequential Rules [S99]o Sequential rule pre->post:
– Rules formed by composing sequential patterns [AS95,YHA03,WH04]: series of events supported (i.e. a sub-sequence of) by a significant number of sequences.– Whenever a sequence is a super-seq. of pre it will also be a super-seq. of pre++post
o Recurrent rule:- Multiple occurrences of the rule’s premise and
consequent both within a sequence and across
multiple sequences are considered
7
Extending Episode Rules [MTV97]o Episode rule pre->post:
– Episode: series of events occurring close together (e.g., in a window).– Whenever a window is a super-seq. of pre it will also be a super-seq. of pre++post.
o Recurrent rule:– Handle multiple sequences– We want to break the window barrier– It is hard to tell the right window size
– Lock separated frm unlock by arbitrary no of evs– We mine a non-redundant set of rules
8
Preliminaries
9
Linear Temporal Logic (LTL)
o Formalism to precisely specify temporal requirements.
o It works on paths [HR03]o There are a number of operators:
o G p – Globally at every point in time p holds
o F p – At that point in time or eventually (Finally) p holds
o X p – p holds at the neXt point in timeRule LTL
a -> b G(a->XF(b))<a,b>-><c,d>
G(a->XG(b->XF(c^XF(d))))
10
Checking or Verifying Temporal Logics
Automata Model
main
lock
use
unlock
lock
use
unlock
lock endToCheck
Violation
LTL property to check
<main,lock> ->
<unlock,end>
Transform
Possible Traces or Sequencesmain lock use unlock lock endmain lock use unlock lock use unlock endmain lock use unlock end…
main(x){ if (lock=0) lock;use;unlock;lock; else for i: 1 to 10 lock;use;unlock}
Program
10
11
Concepts, Definitions
AndRules Semantics
12
Temporal Points“Whenever a series of precedent events
occurs at a point in time or temporal point, eventually another series of consequent
events occurs.”- Peek at interesting temporal points & see
what series of evs are likely to happen next- Temporal points in a sequence S
- The indices in S, starting from 1. - Consider a sequence <a,b,a,b,a,c>. There are 6 temporal points in the sequence.
- For a temporal point j in S= <a1,…,an> , the prefix <a1,…,aj> of S is called j-prefix of S.
13
Occurrences & Instanceso Consider a pattern P, and a sequence So The set of all occurrences of P in S,
Occ(P,S) is the set: {j| P j-prefix of S && last (P) = S[j] }o The set of all instances of P in S, Inst(P,S)
is the set: {j-prefix of S | j is in Occ(P,S)}
o Consider the sequence <A,B,A,B,A,B>– The set of occurrences of <A,B> is {2,4,6}– Instances of <A,B> is: {<A,B>,<A,B,A,B>, <A,B,A,B,A,B>}– Correspond to temporal points to be checked for rules with <A,B> as premise
14
Projected and Projected-all DBo A database SeqDB projected on pattern P
is defined as: SeqDBP = {(j,sx)| s = SeqDB[j], s = px++sx, where px is the minimal prefix of s containing P}
ID. SequenceS1 <a,b,e,a,b,c>S2 <a,c,b,e,a,e,b,c>
ID. SequenceS1S2
SeqDB SeqDB<a,b>
<e,a,b,c><e,a,e,b,c>
15
Projected and Projected-all DBo A database SeqDB projected-all on
pattern P is defined as: SeqDBP = {(j,sx)| s = SeqDB[j], s = px++sx, where px is an instance of P}
o Return temporal points to check
all
ID. SequenceS1 <a,b,e,a,b,c>S2 <a,c,b,e,a,e,b,c
>
SeqDB ID. SequenceS1i
S1ii
S2i
S2ii
SeqDB<a,b>all
<e,a,b,c><c>
<e,a,e,b,c><c>
16
Counting Supports and Confidenceo Consider the rule pre->posto Sequence Support (s-sup): The number of
sequences where the prefix pre appears.o Instance support (i-sup): The number of
instances of pre++post.o Confidence (conf): The likelihood that post
appears after pre. This can be found by computing the ratio:
Instances of pre, where post eventually occurs afterwards
----------------------------- = |Instances of pre|
|(SeqDBpre)post
|----------
|SeqDBpre|
allall
17
Counting Supports and Confidence
s-sup (<a,b>-><c>) = 2i-sup (<a,b>-><c>) = 3conf(<a,b>-><c>) = 1.0conf(<a,b>-><e>) = 0.5
Seq ID. SequenceS1 <a,b,c,e,a,b,c>
S2 <a,c,b,e,a,e,b,c>
XX
18
Properties, Theorems, and
Algorithms
19
Apriori Properties – Support & Conf.Theorem 1. Consider two rule Rx = p->c & Ry = q -> c. If p q and s-sup(Rx) < min-s-sup, then s-sup(Ry) < min-s-sup.
Rx: a -> z ; s-sup(Rx) < min_s-
supa,b -> za,b,c -> za,c -> z
a,b,d -> z….
Non-significan
tRys
Theorem 2. Consider two rule Rx = p->c & Ry = p -> d. If c d and conf(Rx) < min-conf, then conf(Ry) < min-conf.
Rx: a -> z ; conf(Rx) < min_confa -> b,z a -> b,c,z
a -> c,z a -> b,d,z
…. Rys
20
Rule Redundancyo Consider two rules Rx = p->c and Ry = q -> d.
Rx is redundant if the following conditions hold:1.Rx is a sub-seq. of Y (i.e., p++c q++d)2.Rx & Ry have the same sup. and conf. values.
Redundant rules are identified and removed early during mining process.
a -> b a -> ca -> b,c
a -> b,d….
Redundantiff
sup and conf are the same
Rx: a -> b,c,d
Rys
21
Theorem 3. Given two pre-conditions PX and PY where PX PY , if SeqDBPX = SeqDBPY then for all sequences of events post, rules PX -> post is rendered redundant by PY -> post.
<a,b,c,d> -> post
<a,d> -> post<a,c,d> -> post
Redundant Rules:
….
Theorem 4. Given two rules RX (pre -> CX) and RY (pre -> CY ) if CX CY and (SeqDBpre)CX = (SeqDBpre)CY then RX is rendered redundant
by RY and can be pruned.
allall
pre -> <b,c,d,e>
pre -> <c,d,e>pre -> <d,e>
Redundant Rules:
….
22
Algorithmo Step 1: Mine a pruned set of pre-conditions
– Satisfy min-s-sup threshold– Use Theorems 1 & 3
o Step 2: For each pre-cond. pre, create SeqDBpre.
o Step 3: Mine a pruned set of post-conditions– Corresponding rules satisfy min-conf.– Use Theorems 2 & 4
o Step 4: Remove rules that don’t satisfy min-i-sup.
o Step 5: Filter any remaining redundant rules.
all
23
Equiv. Proj DB & LS-Set Patternso From Theorem 3 (& 4), a pre- (post-)
condition is not pruned iff: there does not exist any super-sequence
pattern having the same projected database.o Also referred to as projected-database
closed or LS-Set (Yan and Han, 2003) o We generate this set by modifying BIDE
(Wang and Han, 2004)- Keep the search space pruning strategy- Remove the closure checks- Proof of completeness in technical report
24
Mine Pruned Pre-Conds
Mine Pruned Post-Conds
Check Instance Support & Remove Remaining Red. Rules
25
Performance &
Case Study
26
Synthetic DatasetD5C20N10S20
101
102
103
104
105
0.4 0.45 0.5 0.55 0.6min_s-sup (%)
Run
time
(s) -
(log
-sca
le)
FullNR
108
107
106
105
104
103
102
0.4 0.45 0.5 0.55 0.6min_s-sup (%)
|Rul
es| -
(log
-sca
le) Full
NR
147x Faster, 8500x More Compact
27
Gazelle DatasetKDD Cup 2000
Full-set of significant rules is not minable
106
105
104
103
0.034 0.041 0.048 0.055 0.062min_s-sup (%)
|Rul
es| -
(log
-sca
le) Full
NR
Full not mine-able
102
103
104
0.034 0.041 0.048 0.055 0.062min_s-sup (%)
Run
time
(s) -
(log
-sca
le)
FullNR
Full not mine-able
28JBoss Security
Premise ConsequentXLoginConfImpl.getCfgEntry()AuthenticationInfo.getName()
ClientLoginModule.initialize()ClientLoginModule.login()ClientLoginModule.commit()SecAssocActs.setPrincipalInfo()SetPrincipalInfoAction.run()SecAssocActs.pushSubjectCtx()SubjectThdLocalStack.push()SimplePrincipal.toString()SecAssoc.getPrincipal()SecAssoc.getCredential()SecAssoc.getPrincipal()SecAssoc.getCredential()
Whenever login configuration information is
checked, eventually invocations of
authentication events, binding of principal to subject, utilization of subject & principal information occur
29
Conclusiono We propose a novel framework to mine a non-
redundant set of significant recurrent rules:“Whenever a series of precedent events occurs, eventually a series of consequent events occurs”
o Employ 2 apriori properties and 2 redundancy thms
o Major speedup and reduction of rules by non-redundant rule mining strategy.
o We show the utility in mining behavior of JBoss Security
Future Worko Improve mining speedo More case studies and apps to DM/SE problems