+ All Categories
Home > Documents > Rule Mining by akshay rele

Rule Mining by akshay rele

Date post: 22-Oct-2014
Category:
Upload: admis
View: 29 times
Download: 0 times
Share this document with a friend
Popular Tags:
42
Association Rule Mining -AKSHAY RELE M.E. CMPN ROLL NO: 03
Transcript
Page 1: Rule Mining by akshay rele

Association Rule Mining

-AKSHAY RELEM.E. CMPN

ROLL NO: 03

Page 2: Rule Mining by akshay rele

What is Association Rule Mining?

• It aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories.

• Association rules are widely used in various areas such as telecommunication networks, market and risk management, inventory control etc..

• Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database

Page 3: Rule Mining by akshay rele

The TaskTwo ways of defining the task

• General– Input: A collection of instances– Output: rules to predict the values of any attribute(s) (not

just the class attribute) from values of other attributes– E.g. if temperature = cool then humidity =normal– If the right hand side of a rule has only the class attribute,

then the rule is a classification rule• Specific - Market-basket analysis

– Input: a collection of transactions– Output: rules to predict the occurrence of any item(s) from

the occurrence of other items in a transaction– E.g. {Milk, Cheese} -> {Kellogs}

• General rule structure: – Antecedents -> Consequents

Page 4: Rule Mining by akshay rele

The model: rules• A transaction t contains X, a set of items

(itemset) in I, if X t.• An association rule is an implication of the

form:X Y, where X, Y I, and X Y =

• An itemset is a set of items.– E.g., X = {milk, bread, cereal} is an itemset.

• A k-itemset is an itemset with k items.– E.g., {milk, bread, cereal} is a 3-itemset

Page 5: Rule Mining by akshay rele

Rule strength measures• Support: The rule holds with support sup in T

(the transaction data set) if sup% of transactions contain X Y. – sup = Pr(X Y).

• Confidence: The rule holds in T with confidence conf if conf% of transactions that contain X also contain Y.– conf = Pr(Y | X)

• An association rule is a pattern that states when X occurs, Y occurs with certain probability.

Page 6: Rule Mining by akshay rele

Support and Confidence• Support count: The support count of an

itemset X, denoted by X.count, in a data set T is the number of transactions in T that contain X. Assume T has n transactions.

• Then,

n

countYXsupport

). (

countX

countYXconfidence

.

). (

Page 7: Rule Mining by akshay rele

Goal and key features

• Goal: Find all rules that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf).

• Key Features– Completeness: find all rules.– No target item(s) on the right-hand-side– Mining with data on hard disk (not in memory)

Page 8: Rule Mining by akshay rele

Association Rule Mining• Given a set of transactions, find rules that will predict the

occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Cheese, Kellogs, Eggs

3 Milk, Cheese, Kellogs, Coke

4 Bread, Milk, Cheese, Kellogs

5 Bread, Milk, Cheese, Coke

Example of Association Rules

{Cheese} {Kellogs},{Milk, Bread} {Eggs,Coke},{Kellogs, Bread} {Milk}

Implication means co-occurrence

Page 9: Rule Mining by akshay rele

Definition: Frequent Itemset

• Itemset– A collection of one or more items

• Example: {Milk, Bread, Cheese}– k-itemset

• An itemset that contains k items• Support count ()

– Frequency of occurrence of an itemset– E.g. ({Milk, Bread,Cheese}) = 2

• Support– Fraction of transactions that contain an

itemset– E.g. s({Milk, Bread, Cheese}) = 2/5

• Frequent Itemset– An itemset whose support is greater than

or equal to a minsup threshold

TID Items

1 Bread, Milk

2 Bread, Cheese, Kellogs, Eggs

3 Milk, Cheese, Kellogs, Coke

4 Bread, Milk, Cheese, Kellogs

5 Bread, Milk, Cheese, Coke

Page 10: Rule Mining by akshay rele

Definition: Association RuleTID Items

1 Bread, Milk

2 Bread, Cheese, Kellogs, Eggs

3 Milk, Cheese, Kellogs, Coke

4 Bread, Milk, Cheese, Kellogs

5 Bread, Milk, Cheese, Coke

Example:

Kellogs}Cheese,Milk{

4.05

2

|T|

)KellogsCheese,,Milk(

s

67.03

2

)Cheese,Milk(

)KellogsCheese,Milk,(

c

Association Rule– An implication expression of the form

X Y, where X and Y are itemsets

– Example: {Milk, Cheese} {Kellogs}

Rule Evaluation Metrics– Support (s)

Fraction of transactions that contain both X and Y

– Confidence (c) Measures how often items in Y

appear in transactions thatcontain X

Page 11: Rule Mining by akshay rele

Association Rule Mining Task

• Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold– confidence ≥ minconf threshold

• Brute-force approach:– List all possible association rules– Compute the support and confidence for each rule– Prune rules that fail the minsup and minconf

thresholds

Page 12: Rule Mining by akshay rele

Mining Association RulesTID Items

1 Bread, Milk

2 Bread, Cheese, Kellogs, Eggs

3 Milk, Cheese, Kellogs, Coke

4 Bread, Milk, Cheese, Kellogs

5 Bread, Milk, Cheese, Coke

Example of Rules:

{Milk,Cheese} {Kellogs} (s=0.4, c=0.67){Milk,Kellogs} {Cheese} (s=0.4, c=1.0){Cheese,Kellogs} {Milk} (s=0.4, c=0.67){Kellogs} {Milk,Cheese} (s=0.4, c=0.67) {Cheese} {Milk,Kellogs} (s=0.4, c=0.5) {Milk} {Cheese,Kellogs} (s=0.4, c=0.5)

Page 13: Rule Mining by akshay rele

Observations:

• All the above rules are binary partitions of the same itemset:

{Milk, Cheese, Kellogs}

• Rules originating from the same itemset have identical support but can have different confidence

• Thus, we may decouple the support and confidence requirements

Page 14: Rule Mining by akshay rele

Mining Association Rules

• Two-step approach: 1. Frequent Itemset Generation

– Generate all itemsets whose support minsup

2. Rule Generation– Generate high confidence rules from each frequent itemset,

where each rule is a binary partitioning of a frequent itemset

• Frequent itemset generation is still computationally expensive

Page 15: Rule Mining by akshay rele

Power set• Given a set S, power set P is the set of all subsets of S• Known property of power sets

– If S has n number of elements, P will have N = 2n number of elements.

• Examples:– For S = {}, P={{}}, N = 20 = 1

– For S = {Milk}, P={{}, {Milk}}, N=21=2

– For S = {Milk, Cheese}P={{},{Milk}, {Cheese}, {Milk, Cheese}}, N=22=4

– For S = {Milk, Cheese, Kellogs}P={{},{Milk}, {Cheese}, {Kellogs}, {Milk, Cheese}, {Cheese, Kellogs}, {Kellogs, Milk}, {Milk, Cheese, Kellogs}}, N=23=8

Page 16: Rule Mining by akshay rele

Brute Force approach to Frequent Itemset Generation

• For an itemset with 3 elements, we have 8 subsets– Each subset is a candidate

frequent itemset which needs to be matched against each transaction

TID Items

1 Bread, Milk

2 Bread, Cheese, Kellogs, Eggs

3 Milk, Cheese, Kellogs, Coke

4 Bread, Milk, Cheese, Kellogs

5 Bread, Milk, Cheese, Coke

Itemset Count

{Milk} 4

{Cheese} 4

{Kellogs} 3

1-itemsets

2-itemsets

Itemset Count

{Milk, Cheese}

3

{Cheese, Kellogs}

3

{Kellogs, Milk}

2

3-itemsets

Itemset Count

{Milk, Cheese, Kellogs}

2

Important Observation: Counts of subsets can’t be smaller than the count of an itemset!

Page 17: Rule Mining by akshay rele

How to reduce the number of Candidates???

Use Apriori Algorithm

Page 18: Rule Mining by akshay rele

Reducing Number of Candidates• Apriori principle:

– If an itemset is frequent, then all of its subsets must also be frequent

• Apriori principle holds due to the following property of the support measure:– Support of an itemset never exceeds the support of its

subsets

• Two steps:– Find all itemsets that have minimum support (frequent

itemsets, also called large itemsets).– Use frequent itemsets to generate rules.

)()()(:, YsXsYXYX

Page 19: Rule Mining by akshay rele

19

Define the Problem

Given a set of transactions D, generate all association rules that have support and confidence greater than the user-specified minimum support and minimum confidence.

Page 20: Rule Mining by akshay rele

Step 1: Mining all frequent itemsets• A frequent itemset is an itemset whose

support is ≥ minsup.• Key idea: The apriori property (downward

closure property): any subsets of a frequent itemset are also frequent itemsets

AB AC AD BC BD CD

A B D

ABC ABD ACD BCD

Page 21: Rule Mining by akshay rele

The Algorithm• Iterative algo. (also called level-wise search):

Find all 1-item frequent itemsets; then all 2-item frequent itemsets, and so on.– In each iteration k, only consider itemsets that

contain some k-1 frequent itemset.

• Find frequent itemsets of size 1: F1

• From k = 2– Ck = candidates of size k: those itemsets of size k

that could be frequent, given Fk-1

– Fk = those itemsets that are actually frequent, Fk Ck (need to scan the database once).

Page 22: Rule Mining by akshay rele

Algorithm Apriori

k

k

kk

t

kt

k-k

k-1

;L Answer

minsup}|c.countC { c L

c.count

Cc

,t)(C C

Dt

)(L C

); k 2; L( k

itemsets} {large 1- L

end

end

end

;

do candidates forall

subset

begin do ons transactiforall

;genapriori-

begin do For

1

1

Count item occurrences

Generate new k-itemsets candidates

Find the support of all the candidates

Take only those with support over minsup

Page 23: Rule Mining by akshay rele

Example Finding frequent itemsets Dataset T

TID Items

T100 1, 3, 4

T200 2, 3, 5

T300 1, 2, 3, 5

T400 2, 5

itemset:count

1. scan T C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3

F1: {1}:2, {2}:3, {3}:3, {5}:3

C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}

2. scan T C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2

F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2

C3: {2, 3,5}

3. scan T C3: {2, 3, 5}:2 F3: {2, 3, 5}

minsup=0.5

Page 24: Rule Mining by akshay rele

Details: ordering of items

• The items in I are sorted in lexicographic order (which is a total order).

• The order is used throughout the algorithm in each itemset.

• {w[1], w[2], …, w[k]} represents a k-itemset w consisting of items w[1], w[2], …, w[k], where w[1] < w[2] < … < w[k] according to the total order.

Page 25: Rule Mining by akshay rele

Apriori candidate generation• The candidate-gen function takes Fk-1 and

returns a superset (called the candidates) of the set of all frequent k-itemsets. It has two steps– join step: Generate all possible candidate

itemsets Ck of length k – prune step: Remove those candidates in Ck that

cannot be frequent.

Page 26: Rule Mining by akshay rele

An example• F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},

{1, 3, 5}, {2, 3, 4}}

• After join– C4 = {{1, 2, 3, 4}, {1, 3, 4, 5}}

• After pruning:– C4 = {{1, 2, 3, 4}}

because {1, 4, 5} is not in F3 ({1, 3, 4, 5} is removed)

Page 27: Rule Mining by akshay rele

Step 2: Generating rules from frequent itemsets

• Frequent itemsets association rules• One more step is needed to generate

association rules• For each frequent itemset X,

For each proper nonempty subset A of X, – Let B = X - A– A B is an association rule if

• Confidence(A B) ≥ minconf,support(A B) = support(AB) = support(X) confidence(A B) = support(A B) / support(A)

Page 28: Rule Mining by akshay rele
Page 29: Rule Mining by akshay rele

Generating rules: summary

• To recap, in order to obtain A B, we need to have support(A B) and support(A)

• All the required information for confidence computation has already been recorded in itemset generation. No need to see the data T any more.

• This step is not as time-consuming as frequent itemsets generation.

Page 30: Rule Mining by akshay rele

Problems with the association mining• Single minsup: It assumes that all items in

the data are of the same nature and/or have similar frequencies.

• Not true: In many applications, some items appear very frequently in the data, while others rarely appear. E.g., in a supermarket, people buy food processor and cooking pan much less frequently than they buy bread and milk.

Page 31: Rule Mining by akshay rele

Rare Item Problem• If the frequencies of items vary a great deal,

we will encounter two problems– If minsup is set too high, those rules that involve

rare items will not be found.

– To find rules that involve both frequent and rare items, minsup has to be set very low. This may cause combinatorial explosion because those frequent items will be associated with one another in all possible ways.

Page 32: Rule Mining by akshay rele

Multiple minsups model• The minimum support of a rule is expressed in terms of

minimum item supports (MIS) of the items that appear in the rule.

• Each item can have a minimum item support. • By providing different MIS values for different items,

the user effectively expresses different support requirements for different rules.

Page 33: Rule Mining by akshay rele

Minsup of a rule

• Let MIS(i) be the MIS value of item i. The minsup of a rule R is the lowest MIS value of the items in the rule.

• I.e., a rule R: a1, a2, …, ak ak+1, …, ar satisfies its minimum support if its actual support is

min(MIS(a1), MIS(a2), …, MIS(ar)).

Page 34: Rule Mining by akshay rele

An Example• Consider the following items:

bread, shoes, clothes The user-specified MIS values are as follows:

MIS(bread) = 2% MIS(shoes) = 0.1%MIS(clothes) = 0.2%

The following rule doesn’t satisfy its minsup:clothes bread [sup=0.15%,conf =70%]

The following rule satisfies its minsup:clothes shoes [sup=0.15%,conf =70%]

Page 35: Rule Mining by akshay rele

To deal with the problem• We sort all items in I according to their MIS

values (make it a total order). • The order is used throughout the algorithm in

each itemset. • Each itemset w is of the following form:

{w[1], w[2], …, w[k]}, consisting of items, w[1], w[2], …, w[k],

where MIS(w[1]) MIS(w[2]) … MIS(w[k]).

Page 36: Rule Mining by akshay rele

Mining class association rules (CAR)• Normal association rule mining does not have

any target. • It finds all possible rules that exist in data, i.e.,

any item can appear as a consequent or a condition of a rule.

• However, in some applications, the user is interested in some targets. – E.g, the user has a set of text documents from some

known topics. He/she wants to find out what words are associated or correlated with each topic.

Page 37: Rule Mining by akshay rele

Problem definition• Let T be a transaction data set consisting of n

transactions. • Each transaction is also labeled with a class y. • Let I be the set of all items in T, Y be the set of all class

labels and I Y = . • A class association rule (CAR) is an implication of the

form X y, where X I, and y Y.

• The definitions of support and confidence are the same as those for normal association rules.

Page 38: Rule Mining by akshay rele

An example• A text document data set

doc 1: Student, Teach, School : Educationdoc 2: Student, School : Education doc 3: Teach, School, City, Game : Educationdoc 4: Baseball, Basketball : Sportdoc 5: Basketball, Player, Spectator : Sportdoc 6: Baseball, Coach, Game, Team : Sportdoc 7: Basketball, Team, City, Game : Sport

• Let minsup = 20% and minconf = 60%. The following are two examples of class association rules:

Student, School Education [sup= 2/7, conf = 2/2]game Sport [sup= 2/7, conf = 2/3]

Page 39: Rule Mining by akshay rele

Mining algorithm• Unlike normal association rules, CARs can be mined

directly in one step. • The key operation is to find all ruleitems that have

support above minsup. A ruleitem is of the form:(condset, y)

where condset is a set of items from I (i.e., condset I), and y Y is a class label.

• Each ruleitem basically represents a rule: condset y,

• The Apriori algorithm can be modified to generate CARs

Page 40: Rule Mining by akshay rele

Multiple minimum class supports• The multiple minimum support idea can also be applied

here. • The user can specify different minimum supports to

different classes, which effectively assign a different minimum support to rules of each class.

• For example, we have a data set with two classes, Yes and No. We may want – rules of class Yes to have the minimum support of 5% and – rules of class No to have the minimum support of 10%.

• By setting minimum class supports to 100% (or more for some classes), we tell the algorithm not to generate rules of those classes. – This is a very useful trick in applications.

Page 41: Rule Mining by akshay rele

41

Summary

• Association rules are an important tool in analyzing databases.

• We’ve seen an algorithm which finds all association rules in a database.

• The algorithm has better time results then previous algorithms.

• The algorithm maintains it’s performances for large databases.

Page 42: Rule Mining by akshay rele

THANK YOU


Recommended