Association Rule and Sequential Pattern Mining for Episode Extraction Jonathan Yip.

Association Rule and Association Rule and Sequential Pattern Mining Sequential Pattern Mining

for Episode Extractionfor Episode Extraction

Jonathan YipJonathan Yip

Introduction to Association Introduction to Association RuleRule

• Associating multiple objects/events togetherAssociating multiple objects/events together

• Example: A customer buying a laptop also Example: A customer buying a laptop also

buys a wireless LAN card (2- itemset)buys a wireless LAN card (2- itemset)

Wireless Wireless LAN CardLAN Card

LaptopLaptop

Laptop Wireless LAN Card

Association Rule (con’t)Association Rule (con’t)

Measures of Rule InterestingnessMeasures of Rule Interestingness

•Support Support ==== P(Laptop ∪ LAN card)P(Laptop ∪ LAN card)

Probability that all studied sets

occur

•Confidence Confidence == == P(LAN card Laptop)∣P(LAN card Laptop)∣

=P(Laptop U LAN card)/P(Laptop)=P(Laptop U LAN card)/P(Laptop)

Conditional Probability that a

customer bought Laptop also

bought Wireless LAN card

Buy bothBuy both

Thresholds:

Minimum Support: 25%

Minimum Confidence: 30%

[Support = 40%,

Confidence = 60%]

LaptopLaptop Wireless Wireless LAN LAN CardCard

Association Rule (eg.)Association Rule (eg.)

TIDTID ItemsItems

11 Bread, Coke, MilkBread, Coke, Milk

22 Chips, BreadChips, Bread

33 Coke, Eggs, MilkCoke, Eggs, Milk

44 Bread, Eggs, Milk, Bread, Eggs, Milk, CokeCoke


Min_Sup = 25%Min_Sup = 25%

Min_Conf = 25%Min_Conf = 25%

Milk Milk Eggs Eggs

Support :Support : P(Milk Eggs) = 3/5 = 60%∪P(Milk Eggs) = 3/5 = 60%∪

Confidence :Confidence : P (Eggs|Milk) P (Eggs|Milk)

= P(Milk U Eggs)/P(Milk)= P(Milk U Eggs)/P(Milk)

P(Milk) = 4/5 = 80%P(Milk) = 4/5 = 80%

P(Eggs Milk)=60%/80%∣P(Eggs Milk)=60%/80%∣

= 75%= 75%

(75% Confidence that a customer buys (75% Confidence that a customer buys milk also buys eggs)milk also buys eggs)

Types of Association Types of Association

• Boolean vs. QuantitativeBoolean vs. Quantitative• Single dimension vs. Multiple dimension Single dimension vs. Multiple dimension • Single level vs. Multiple level AnalysisSingle level vs. Multiple level Analysis

Example:Example:1.) Gender(X,”Male”) ^ Income(X,”>50K”) Âge(X,”35…50”)1.) Gender(X,”Male”) ^ Income(X,”>50K”) Âge(X,”35…50”)

Buys (X, BMW Sedan)Buys (X, BMW Sedan)

2.) Income(X,,”>50K”) 2.) Income(X,,”>50K”) Buys (X, BMW Sedan) Buys (X, BMW Sedan)

3.) Gender(X,”Male”) ^ Income(X,”>50K”) Âge(X,”35…50”) 3.) Gender(X,”Male”) ^ Income(X,”>50K”) Âge(X,”35…50”) Buys (X, BMW 540i) Buys (X, BMW 540i)

Association Rule Association Rule (DB Miner)(DB Miner)

Apriori AlgorithmApriori Algorithm

• PurposePurpose

To mine frequent itemsets for boolean To mine frequent itemsets for boolean

association rulesassociation rules

• Use prior knowledge to predict future Use prior knowledge to predict future valuesvalues

• Has to be frequent (Support>Min_Sup)Has to be frequent (Support>Min_Sup)

• Anti-monotone conceptAnti-monotone concept

If a set cannot pass a min_sup test, all If a set cannot pass a min_sup test, all

supersets will fail as wellsupersets will fail as well

Apriori Algorithm Psuedo-Apriori Algorithm Psuedo-CodeCode

• Pseudo-codePseudo-code::CCkk: Candidate itemset of size k: Candidate itemset of size kLLkk : frequent itemset of size k : frequent itemset of size k

LL11 = {frequent items}; = {frequent items};forfor ( (kk = 1; = 1; LLkk != !=; ; kk++) ++) do begindo begin CCk+1k+1 = candidates generated from = candidates generated from LLkk;; for eachfor each transaction transaction tt in database do in database do

increment the count of all candidates in increment the count of all candidates in CCk+1k+1 that are contained in that are contained in tt

LLk+1k+1 = candidates in = candidates in CCk+1k+1 with min_support with min_support endendreturnreturn kk LLkk;;

Apriori Algorithm Apriori Algorithm ProceduresProcedures

Step 1Step 1

Scan & find Scan & find support of each support of each item (C1):item (C1):

TIDTID ItemsItems

11 Bread, Coke, MilkBread, Coke, Milk

22 Chips, BreadChips, Bread


44 Bread, Eggs, Milk, Bread, Eggs, Milk, CokeCoke


Example revisited:Example revisited:

5 – itemset with 5 transactions5 – itemset with 5 transactions

Min_Sup = 25%Min_Sup = 25%

Min Support Count = 2 itemsMin Support Count = 2 items

Min_Conf = 25%Min_Conf = 25%

ItemsItems supportsupport

BreadBread 33CokeCoke 44MilkMilk 44ChipsChips 1 (1 (fail)fail)

EggsEggs 33

ItemsItems supportsupport

BreadBread 33CokeCoke 44MilkMilk 44EggsEggs 33

Step 2Step 2

Compare with Compare with Min_Sup and Min_Sup and eliminate (prune) eliminate (prune) I<Min_SupI<Min_Sup

(L1):(L1):

Apriori Algorithm (con’t)Apriori Algorithm (con’t)

SupportsSupports

Bread & Coke:2/5=40%Bread & Coke:2/5=40%

Bread & Milk:2/5=40%Bread & Milk:2/5=40%

Bread & Eggs:1/5=20%Bread & Eggs:1/5=20%

Coke & Milk:4/5=80%Coke & Milk:4/5=80%

Coke & Eggs:2/5=40%Coke & Eggs:2/5=40%

Milk & Eggs:3/5=60%Milk & Eggs:3/5=60%

ItemsItems

BreadBread

CokeCoke

MilkMilk

EggsEggs

ItemsItems

BreadBread

CokeCoke

MilkMilk

EggsEggs

Step 3 Join (L1 L1)Step 3 Join (L1 L1) Repeated Step: Eliminate (prune) Repeated Step: Eliminate (prune) items<min_supPrune (C2):items<min_supPrune (C2):

L1 setL1 set L1 setL1 set

SupportsSupports

Bread & CokeBread & Coke

Bread & MilkBread & Milk

Coke & MilkCoke & Milk

Coke & EggsCoke & Eggs

Milk & EggsMilk & Eggs

L2 setL2 set

Join L2 L2Join L2 L2

SupportsSupports

Bread & CokeBread & Coke

Bread & MilkBread & Milk

Coke & MilkCoke & Milk

Coke & EggsCoke & Eggs

Milk & EggsMilk & Eggs

ItemsItems SupportSupport

Bread & Bread & Coke & Coke & MilkMilk

22

Bread & Bread & Coke & Coke & EggsEggs

1 (fail)1 (fail)

Bread & Bread & Coke & Coke & Milk & Milk & EggsEggs

1 (fail)1 (fail)

Coke & Coke & Milk & Milk & EggsEggs

33

L2 setL2 set

Compare with Min_Sup then eliminate (prune) items

<Min_sup:

Conclusion:Conclusion:

•Bread & Coke & Milk have strong correlationBread & Coke & Milk have strong correlation

•Coke & Milk & Eggs have strong correlationCoke & Milk & Eggs have strong correlation

Apriori Algorithm (con’t)Apriori Algorithm (con’t)

Sequential Pattern MiningSequential Pattern MiningIntroductionIntroduction• Mining of frequently occurring patterns related to time or other sequencesMining of frequently occurring patterns related to time or other sequences

ExamplesExamples• 70% of customers rent “Star Wars, then “Empire Strikes Back”, and then “Return of 70% of customers rent “Star Wars, then “Empire Strikes Back”, and then “Return of

the Jedithe Jedi

ApplicationApplication• Intrusion detection on computersIntrusion detection on computers• Web access patternWeb access pattern• Predict disease with sequence of symptomsPredict disease with sequence of symptoms• Many other areasMany other areas

Star Wars Empire Strikes Back

Return of the Jedi

Sequential Pattern Mining Sequential Pattern Mining (con’t)(con’t)

Steps:Steps:• Sort PhaseSort Phase

Sort by Cust_ID, Transaction_IDSort by Cust_ID, Transaction_ID

• Litemset PhaseLitemset Phase

Find large itemsetsFind large itemsets• Transform PhaseTransform Phase

Eliminates items < min_supEliminates items < min_sup• Sequence PhaseSequence Phase

Find desired sequencesFind desired sequences• Maximal PhaseMaximal Phase

Find the maximal sequences among set of large Find the maximal sequences among set of large sequencessequences


Cust Cust IDID

Trans. TimeTrans. Time Items Items BoughtBought

11 June 25 ‘02June 25 ‘02 3311 June 30 ‘02June 30 ‘02 9922 June 10 ‘02June 10 ‘02 1 , 21 , 2

22 June 15 ‘02June 15 ‘02 33

22 June 20 ‘02June 20 ‘02 4, 6, 74, 6, 7

33 June 25 ‘02June 25 ‘02 3, 5, 73, 5, 7

44 June 25 ‘02June 25 ‘02 33

44 June 30 ‘02June 30 ‘02 4, 74, 7

44 July 25 ‘02July 25 ‘02 99

55 June 12 ‘02June 12 ‘02 99

Example:Example: Database sorted by Database sorted by Cust_ID & Transaction Time Cust_ID & Transaction Time

(Min_sup=25%)(Min_sup=25%)

Organized format with Cust_ID:

Cust Cust IDID

Original Original SequenceSequence

11 {(3) (9)}{(3) (9)}

22 {(1,2) (3) (4,6,7)}{(1,2) (3) (4,6,7)}

33 {(3,5,7)}{(3,5,7)}

44 {(3) (4,7) (9)}{(3) (4,7) (9)}

55 {(9)}{(9)}


Cust IDCust ID Original Original SequenceSequence

Items to studyItems to study Support Support

CountCount

11 {(3)(9)}{(3)(9)} {(3)} {(9)} {(3,9)}{(3)} {(9)} {(3,9)} 3,3, 23,3, 2

55 {(9)}{(9)} {(9)}{(9)} 11

Step 1: Sort (examples of several transaction):Step 1: Sort (examples of several transaction):

Conclusion:Conclusion:

>25% >25% Min_supMin_sup

{(3) (9)} && {(3) (4,7)}{(3) (9)} && {(3) (4,7)}


Cust Cust IDID

Original Original SequenceSequence

Transformed Cust. Transformed Cust. SequenceSequence

After mappingAfter mapping

11 {(3) (9)}{(3) (9)} ({3} {(9)}({3} {(9)} ({1} {5})({1} {5})

22 {(1,2) (3) (4,6,7)}{(1,2) (3) (4,6,7)} {(3}) {(4) (7) (4,7)}{(3}) {(4) (7) (4,7)} ({1} {2 3 4})({1} {2 3 4})

33 {(3,5,7)}{(3,5,7)} {(3) (7)}{(3) (7)} ({1,3})({1,3})

44 {(3) (4,7) (9)}{(3) (4,7) (9)} ({3} {(4) (7) (4 7)} {(9)}({3} {(4) (7) (4 7)} {(9)} ({1} {2 3 4} {5})({1} {2 3 4} {5})

55 {(9)}{(9)} {(9)}{(9)} ({5})({5})

Data sequence of each Data sequence of each customer:customer:

Sequences < min_support:Sequences < min_support:

{(1,2) (3)}, {(3)},{(4)},{(7)},{(9)},{(1,2) (3)}, {(3)},{(4)},{(7)},{(9)},

{(3) (4)}, {(3) (7), {(4) (7)}{(3) (4)}, {(3) (7), {(4) (7)}

Support > 25% {(3) (9)}Support > 25% {(3) (9)}

{(3) (4 7)}{(3) (4 7)}

The most right column implies customers buying patterns

L L ItemItem

MaMappepped d ToTo

(30)(30) 11

(40)(40) 22

(70)(70) 33

(40 7(40 70)0)

44

(90)(90) 55

Step 2: Step 2: Litemset Litemset

phasephase

Sequential Pattern Mining Sequential Pattern Mining AlgorithmAlgorithm

AlgorithmAlgorithm

• AprioriAllAprioriAll

Count all large sequence, including those not Count all large sequence, including those not maximalmaximal

Pseudo-code:

Ck: Candidate sequence of size k

Lk : frequent or large sequence of size k

L1 = {large 1-sequence}; //result of litemset phase

for (k = 2; Lk !=; k++) do begin

Ck = candidates generated from Lk-1;

for each customer sequence c in database do

Increment the count of all candidates in Ck

that are contained in c

end

Answer=Maximal sequences in k Lk;

• AprioriSomeAprioriSome

Generates every candidate sequence, but Generates every candidate sequence, but skips counting some large sequences skips counting some large sequences (Forward Phase). Then, discards candidates (Forward Phase). Then, discards candidates not maximal and counts remaining large not maximal and counts remaining large sequences (Backward Phase).sequences (Backward Phase).

鸞

Episode ExtractionEpisode Extraction

• A partially ordered collection of events occurring togetherA partially ordered collection of events occurring together• Goal: To analyze sequence of events, and to discover Goal: To analyze sequence of events, and to discover

recurrent episodesrecurrent episodes• First finding small frequent episodes then progressively First finding small frequent episodes then progressively

looking larger episodeslooking larger episodes• Types of episodesTypes of episodes

Serial (Serial () – E occurs before F) – E occurs before F

Parallel(Parallel() – No constraints on ) – No constraints on

relativelyorder of A & Brelativelyorder of A & B

Non-Serial/Non-Parallel (Non-Serial/Non-Parallel () )

- Occurrence of A & B - Occurrence of A & B

precedes Cprecedes C

EE FF

AA

BBAA

BB

CC

Episode Extraction (con’t)Episode Extraction (con’t)

E D F A B C E F C D B A D C E F C B E A E C F A

30 35 40 45 50 55 60 65

S = {(AS = {(A11,t,t11),(A),(A22,t,t22),….,(A),….,(Ann, t, tnn) ) s={(E,31),(D,32),(F,33)….(A,65)} s={(E,31),(D,32),(F,33)….(A,65)}

•Time window is set to bind the interestingnessTime window is set to bind the interestingness

W(s,5) slides and snapshot the whole sequenceW(s,5) slides and snapshot the whole sequence

eg. (w,35,40) contains A,B,C,E episodes eg. (w,35,40) contains A,B,C,E episodes , , occurs but not occurs but not

• User specifies how many windows an episode has to occur to be User specifies how many windows an episode has to occur to be

frequentfrequent

Formula : Formula :

A Sequence of events:

| { ( , ) | occurs in w}|( , , )

| ( , ) |

w Win s winfr s win

W s win

Episode ExtractionEpisode Extraction

Minimal occurrencesMinimal occurrences

• Look at exact occurrences of episodes & relationships between Look at exact occurrences of episodes & relationships between occurrencesoccurrences

• Can modify width of windowCan modify width of window

• Eliminates unnecessary repetition of the recognition effortEliminates unnecessary repetition of the recognition effort

• ExampleExample

mo(mo() = {[35,38), [46,48),[57,60)}) = {[35,38), [46,48),[57,60)}

• When episode is a subepisode of another; this relation is used for When episode is a subepisode of another; this relation is used for

discovering all frequent episodesdiscovering all frequent episodes

Applications of Episodes Applications of Episodes ExtractionExtraction

• Computer SecurityComputer Security• BioinformaticsBioinformatics• FinanceFinance• Market AnalysisMarket Analysis• And more……And more……

ReferencesReferences

•Discovery of Frequent Episodes in Event Sequences

(Manilla,Toivonen, Verkamo)

• Mining Sequential Patterns (Agrawal, Srikant)

• Principles of Data Mining (Hand, Manilla, Smyth) 2001

• Data Mining Concepts and Techniques (Han, Kamber) 2001

ENDEND

Date post:	31-Mar-2015
Category:	Documents
Upload:	ayden-chappie
View:	214 times
Download:	0 times

Association Rule and Sequential Pattern Mining for Episode Extraction Jonathan Yip.

Documents