Date post: | 31-Mar-2015 |
Category: |
Documents |
Upload: | ayden-chappie |
View: | 214 times |
Download: | 0 times |
Association Rule and Association Rule and Sequential Pattern Mining Sequential Pattern Mining
for Episode Extractionfor Episode Extraction
Jonathan YipJonathan Yip
Introduction to Association Introduction to Association RuleRule
• Associating multiple objects/events togetherAssociating multiple objects/events together
• Example: A customer buying a laptop also Example: A customer buying a laptop also
buys a wireless LAN card (2- itemset)buys a wireless LAN card (2- itemset)
Wireless Wireless LAN CardLAN Card
LaptopLaptop
Laptop Wireless LAN Card
Association Rule (con’t)Association Rule (con’t)
Measures of Rule InterestingnessMeasures of Rule Interestingness
•Support Support ==== P(Laptop ∪ LAN card)P(Laptop ∪ LAN card)
Probability that all studied sets
occur
•Confidence Confidence == == P(LAN card Laptop)∣P(LAN card Laptop)∣
=P(Laptop U LAN card)/P(Laptop)=P(Laptop U LAN card)/P(Laptop)
Conditional Probability that a
customer bought Laptop also
bought Wireless LAN card
Buy bothBuy both
Thresholds:
Minimum Support: 25%
Minimum Confidence: 30%
[Support = 40%,
Confidence = 60%]
LaptopLaptop Wireless Wireless LAN LAN CardCard
Association Rule (eg.)Association Rule (eg.)
TIDTID ItemsItems
11 Bread, Coke, MilkBread, Coke, Milk
22 Chips, BreadChips, Bread
33 Coke, Eggs, MilkCoke, Eggs, Milk
44 Bread, Eggs, Milk, Bread, Eggs, Milk, CokeCoke
55 Coke, Eggs, MilkCoke, Eggs, Milk
Min_Sup = 25%Min_Sup = 25%
Min_Conf = 25%Min_Conf = 25%
Milk Milk Eggs Eggs
Support :Support : P(Milk Eggs) = 3/5 = 60%∪P(Milk Eggs) = 3/5 = 60%∪
Confidence :Confidence : P (Eggs|Milk) P (Eggs|Milk)
= P(Milk U Eggs)/P(Milk)= P(Milk U Eggs)/P(Milk)
P(Milk) = 4/5 = 80%P(Milk) = 4/5 = 80%
P(Eggs Milk)=60%/80%∣P(Eggs Milk)=60%/80%∣
= 75%= 75%
(75% Confidence that a customer buys (75% Confidence that a customer buys milk also buys eggs)milk also buys eggs)
Types of Association Types of Association
• Boolean vs. QuantitativeBoolean vs. Quantitative• Single dimension vs. Multiple dimension Single dimension vs. Multiple dimension • Single level vs. Multiple level AnalysisSingle level vs. Multiple level Analysis
Example:Example:1.) Gender(X,”Male”) ^ Income(X,”>50K”) ^Age(X,”35…50”)1.) Gender(X,”Male”) ^ Income(X,”>50K”) ^Age(X,”35…50”)
Buys (X, BMW Sedan)Buys (X, BMW Sedan)
2.) Income(X,,”>50K”) 2.) Income(X,,”>50K”) Buys (X, BMW Sedan) Buys (X, BMW Sedan)
3.) Gender(X,”Male”) ^ Income(X,”>50K”) ^Age(X,”35…50”) 3.) Gender(X,”Male”) ^ Income(X,”>50K”) ^Age(X,”35…50”) Buys (X, BMW 540i) Buys (X, BMW 540i)
Association Rule Association Rule (DB Miner)(DB Miner)
Apriori AlgorithmApriori Algorithm
• PurposePurpose
To mine frequent itemsets for boolean To mine frequent itemsets for boolean
association rulesassociation rules
• Use prior knowledge to predict future Use prior knowledge to predict future valuesvalues
• Has to be frequent (Support>Min_Sup)Has to be frequent (Support>Min_Sup)
• Anti-monotone conceptAnti-monotone concept
If a set cannot pass a min_sup test, all If a set cannot pass a min_sup test, all
supersets will fail as wellsupersets will fail as well
Apriori Algorithm Psuedo-Apriori Algorithm Psuedo-CodeCode
• Pseudo-codePseudo-code::CCkk: Candidate itemset of size k: Candidate itemset of size kLLkk : frequent itemset of size k : frequent itemset of size k
LL11 = {frequent items}; = {frequent items};forfor ( (kk = 1; = 1; LLkk != !=; ; kk++) ++) do begindo begin CCk+1k+1 = candidates generated from = candidates generated from LLkk;; for eachfor each transaction transaction tt in database do in database do
increment the count of all candidates in increment the count of all candidates in CCk+1k+1 that are contained in that are contained in tt
LLk+1k+1 = candidates in = candidates in CCk+1k+1 with min_support with min_support endendreturnreturn kk LLkk;;
Apriori Algorithm Apriori Algorithm ProceduresProcedures
Step 1Step 1
Scan & find Scan & find support of each support of each item (C1):item (C1):
TIDTID ItemsItems
11 Bread, Coke, MilkBread, Coke, Milk
22 Chips, BreadChips, Bread
33 Coke, Eggs, MilkCoke, Eggs, Milk
44 Bread, Eggs, Milk, Bread, Eggs, Milk, CokeCoke
55 Coke, Eggs, MilkCoke, Eggs, Milk
Example revisited:Example revisited:
5 – itemset with 5 transactions5 – itemset with 5 transactions
Min_Sup = 25%Min_Sup = 25%
Min Support Count = 2 itemsMin Support Count = 2 items
Min_Conf = 25%Min_Conf = 25%
ItemsItems supportsupport
BreadBread 33CokeCoke 44MilkMilk 44ChipsChips 1 (1 (fail)fail)
EggsEggs 33
ItemsItems supportsupport
BreadBread 33CokeCoke 44MilkMilk 44EggsEggs 33
Step 2Step 2
Compare with Compare with Min_Sup and Min_Sup and eliminate (prune) eliminate (prune) I<Min_SupI<Min_Sup
(L1):(L1):
Apriori Algorithm (con’t)Apriori Algorithm (con’t)
SupportsSupports
Bread & Coke:2/5=40%Bread & Coke:2/5=40%
Bread & Milk:2/5=40%Bread & Milk:2/5=40%
Bread & Eggs:1/5=20%Bread & Eggs:1/5=20%
Coke & Milk:4/5=80%Coke & Milk:4/5=80%
Coke & Eggs:2/5=40%Coke & Eggs:2/5=40%
Milk & Eggs:3/5=60%Milk & Eggs:3/5=60%
ItemsItems
BreadBread
CokeCoke
MilkMilk
EggsEggs
ItemsItems
BreadBread
CokeCoke
MilkMilk
EggsEggs
Step 3 Join (L1 L1)Step 3 Join (L1 L1) Repeated Step: Eliminate (prune) Repeated Step: Eliminate (prune) items<min_supPrune (C2):items<min_supPrune (C2):
L1 setL1 set L1 setL1 set
SupportsSupports
Bread & CokeBread & Coke
Bread & MilkBread & Milk
Coke & MilkCoke & Milk
Coke & EggsCoke & Eggs
Milk & EggsMilk & Eggs
L2 setL2 set
Join L2 L2Join L2 L2
SupportsSupports
Bread & CokeBread & Coke
Bread & MilkBread & Milk
Coke & MilkCoke & Milk
Coke & EggsCoke & Eggs
Milk & EggsMilk & Eggs
ItemsItems SupportSupport
Bread & Bread & Coke & Coke & MilkMilk
22
Bread & Bread & Coke & Coke & EggsEggs
1 (fail)1 (fail)
Bread & Bread & Coke & Coke & Milk & Milk & EggsEggs
1 (fail)1 (fail)
Coke & Coke & Milk & Milk & EggsEggs
33
L2 setL2 set
Compare with Min_Sup then eliminate (prune) items
<Min_sup:
Conclusion:Conclusion:
•Bread & Coke & Milk have strong correlationBread & Coke & Milk have strong correlation
•Coke & Milk & Eggs have strong correlationCoke & Milk & Eggs have strong correlation
Apriori Algorithm (con’t)Apriori Algorithm (con’t)
Sequential Pattern MiningSequential Pattern MiningIntroductionIntroduction• Mining of frequently occurring patterns related to time or other sequencesMining of frequently occurring patterns related to time or other sequences
ExamplesExamples• 70% of customers rent “Star Wars, then “Empire Strikes Back”, and then “Return of 70% of customers rent “Star Wars, then “Empire Strikes Back”, and then “Return of
the Jedithe Jedi
ApplicationApplication• Intrusion detection on computersIntrusion detection on computers• Web access patternWeb access pattern• Predict disease with sequence of symptomsPredict disease with sequence of symptoms• Many other areasMany other areas
Star Wars Empire Strikes Back
Return of the Jedi
Sequential Pattern Mining Sequential Pattern Mining (con’t)(con’t)
Steps:Steps:• Sort PhaseSort Phase
Sort by Cust_ID, Transaction_IDSort by Cust_ID, Transaction_ID
• Litemset PhaseLitemset Phase
Find large itemsetsFind large itemsets• Transform PhaseTransform Phase
Eliminates items < min_supEliminates items < min_sup• Sequence PhaseSequence Phase
Find desired sequencesFind desired sequences• Maximal PhaseMaximal Phase
Find the maximal sequences among set of large Find the maximal sequences among set of large sequencessequences
Sequential Pattern Mining Sequential Pattern Mining (con’t)(con’t)
Cust Cust IDID
Trans. TimeTrans. Time Items Items BoughtBought
11 June 25 ‘02June 25 ‘02 3311 June 30 ‘02June 30 ‘02 9922 June 10 ‘02June 10 ‘02 1 , 21 , 2
22 June 15 ‘02June 15 ‘02 33
22 June 20 ‘02June 20 ‘02 4, 6, 74, 6, 7
33 June 25 ‘02June 25 ‘02 3, 5, 73, 5, 7
44 June 25 ‘02June 25 ‘02 33
44 June 30 ‘02June 30 ‘02 4, 74, 7
44 July 25 ‘02July 25 ‘02 99
55 June 12 ‘02June 12 ‘02 99
Example:Example: Database sorted by Database sorted by Cust_ID & Transaction Time Cust_ID & Transaction Time
(Min_sup=25%)(Min_sup=25%)
Organized format with Cust_ID:
Cust Cust IDID
Original Original SequenceSequence
11 {(3) (9)}{(3) (9)}
22 {(1,2) (3) (4,6,7)}{(1,2) (3) (4,6,7)}
33 {(3,5,7)}{(3,5,7)}
44 {(3) (4,7) (9)}{(3) (4,7) (9)}
55 {(9)}{(9)}
Sequential Pattern Mining Sequential Pattern Mining (con’t)(con’t)
Cust IDCust ID Original Original SequenceSequence
Items to studyItems to study Support Support
CountCount
11 {(3)(9)}{(3)(9)} {(3)} {(9)} {(3,9)}{(3)} {(9)} {(3,9)} 3,3, 23,3, 2
55 {(9)}{(9)} {(9)}{(9)} 11
Step 1: Sort (examples of several transaction):Step 1: Sort (examples of several transaction):
Conclusion:Conclusion:
>25% >25% Min_supMin_sup
{(3) (9)} && {(3) (4,7)}{(3) (9)} && {(3) (4,7)}
Sequential Pattern Mining Sequential Pattern Mining (con’t)(con’t)
Cust Cust IDID
Original Original SequenceSequence
Transformed Cust. Transformed Cust. SequenceSequence
After mappingAfter mapping
11 {(3) (9)}{(3) (9)} ({3} {(9)}({3} {(9)} ({1} {5})({1} {5})
22 {(1,2) (3) (4,6,7)}{(1,2) (3) (4,6,7)} {(3}) {(4) (7) (4,7)}{(3}) {(4) (7) (4,7)} ({1} {2 3 4})({1} {2 3 4})
33 {(3,5,7)}{(3,5,7)} {(3) (7)}{(3) (7)} ({1,3})({1,3})
44 {(3) (4,7) (9)}{(3) (4,7) (9)} ({3} {(4) (7) (4 7)} {(9)}({3} {(4) (7) (4 7)} {(9)} ({1} {2 3 4} {5})({1} {2 3 4} {5})
55 {(9)}{(9)} {(9)}{(9)} ({5})({5})
Data sequence of each Data sequence of each customer:customer:
Sequences < min_support:Sequences < min_support:
{(1,2) (3)}, {(3)},{(4)},{(7)},{(9)},{(1,2) (3)}, {(3)},{(4)},{(7)},{(9)},
{(3) (4)}, {(3) (7), {(4) (7)}{(3) (4)}, {(3) (7), {(4) (7)}
Support > 25% {(3) (9)}Support > 25% {(3) (9)}
{(3) (4 7)}{(3) (4 7)}
The most right column implies customers buying patterns
L L ItemItem
MaMappepped d ToTo
(30)(30) 11
(40)(40) 22
(70)(70) 33
(40 7(40 70)0)
44
(90)(90) 55
Step 2: Step 2: Litemset Litemset
phasephase
Sequential Pattern Mining Sequential Pattern Mining AlgorithmAlgorithm
AlgorithmAlgorithm
• AprioriAllAprioriAll
Count all large sequence, including those not Count all large sequence, including those not maximalmaximal
Pseudo-code:
Ck: Candidate sequence of size k
Lk : frequent or large sequence of size k
L1 = {large 1-sequence}; //result of litemset phase
for (k = 2; Lk !=; k++) do begin
Ck = candidates generated from Lk-1;
for each customer sequence c in database do
Increment the count of all candidates in Ck
that are contained in c
end
Answer=Maximal sequences in k Lk;
• AprioriSomeAprioriSome
Generates every candidate sequence, but Generates every candidate sequence, but skips counting some large sequences skips counting some large sequences (Forward Phase). Then, discards candidates (Forward Phase). Then, discards candidates not maximal and counts remaining large not maximal and counts remaining large sequences (Backward Phase).sequences (Backward Phase).
鸞
Episode ExtractionEpisode Extraction
• A partially ordered collection of events occurring togetherA partially ordered collection of events occurring together• Goal: To analyze sequence of events, and to discover Goal: To analyze sequence of events, and to discover
recurrent episodesrecurrent episodes• First finding small frequent episodes then progressively First finding small frequent episodes then progressively
looking larger episodeslooking larger episodes• Types of episodesTypes of episodes
Serial (Serial () – E occurs before F) – E occurs before F
Parallel(Parallel() – No constraints on ) – No constraints on
relativelyorder of A & Brelativelyorder of A & B
Non-Serial/Non-Parallel (Non-Serial/Non-Parallel () )
- Occurrence of A & B - Occurrence of A & B
precedes Cprecedes C
EE FF
AA
BBAA
BB
CC
Episode Extraction (con’t)Episode Extraction (con’t)
E D F A B C E F C D B A D C E F C B E A E C F A
30 35 40 45 50 55 60 65
S = {(AS = {(A11,t,t11),(A),(A22,t,t22),….,(A),….,(Ann, t, tnn) ) s={(E,31),(D,32),(F,33)….(A,65)} s={(E,31),(D,32),(F,33)….(A,65)}
•Time window is set to bind the interestingnessTime window is set to bind the interestingness
W(s,5) slides and snapshot the whole sequenceW(s,5) slides and snapshot the whole sequence
eg. (w,35,40) contains A,B,C,E episodes eg. (w,35,40) contains A,B,C,E episodes , , occurs but not occurs but not
• User specifies how many windows an episode has to occur to be User specifies how many windows an episode has to occur to be
frequentfrequent
Formula : Formula :
A Sequence of events:
| { ( , ) | occurs in w}|( , , )
| ( , ) |
w Win s winfr s win
W s win
Episode ExtractionEpisode Extraction
Minimal occurrencesMinimal occurrences
• Look at exact occurrences of episodes & relationships between Look at exact occurrences of episodes & relationships between occurrencesoccurrences
• Can modify width of windowCan modify width of window
• Eliminates unnecessary repetition of the recognition effortEliminates unnecessary repetition of the recognition effort
• ExampleExample
mo(mo() = {[35,38), [46,48),[57,60)}) = {[35,38), [46,48),[57,60)}
• When episode is a subepisode of another; this relation is used for When episode is a subepisode of another; this relation is used for
discovering all frequent episodesdiscovering all frequent episodes
Applications of Episodes Applications of Episodes ExtractionExtraction
• Computer SecurityComputer Security• BioinformaticsBioinformatics• FinanceFinance• Market AnalysisMarket Analysis• And more……And more……
ReferencesReferences
•Discovery of Frequent Episodes in Event Sequences
(Manilla,Toivonen, Verkamo)
• Mining Sequential Patterns (Agrawal, Srikant)
• Principles of Data Mining (Hand, Manilla, Smyth) 2001
• Data Mining Concepts and Techniques (Han, Kamber) 2001
ENDEND