The association rule mining to mine the frequent patterns is a fundamentally important task in the process of knowledge discovery in large databases.
This project report the main focus lies in the generation of frequent patterns which is the most important task in explanation of the fundamentals of association rule mining.
This is done by analyzing the implementations of the well known association rule mining algorithms like Apriori, Dynamic Item set Counting Algorithm, FP-growth algorithm.
This experimental system is developed using Java under Windows XP Operating System. Run time behaviors of these algorithms are analyzed and compared using Mushroom dataset.
Overview of this Project
Outline
• Introduction• Association Rule Mining to Frequent Patterns• Implementation• Conclusions• Future Enhancements• Bibliography
Frequent Pattern which is the most important task in explanation of fundamental of association rule mining techniques
The well known association rule based algorithms to mine the frequent patterns :
Apriori
Dynamic Item Counting
FP Growth
Introduction to Frequent Patterns
Association Rule mining is one of the fundamental data mining
Association is a rule, which implies certain association relationships among set of objects such as occur together or one implies the other.
Goal of Association rule mining helps in finding interesting association relationships among large set of data items.
Each rule is assigned two factors: Support and Confidence
Association Rule Mining
Generally association rule mining is performed in two steps:
• Find all frequent item setsThe basic foundation of Association Rule
algorithm is fact that any subset of a frequent itemset must also be a frequent item set. i.e., if {AB} is a frequent item set, both {A} and {B} should be a frequent item set. Iteratively find frequent item sets with cardinality from 1 to k (k-item set)
• Use frequent item sets to generate strong rules having minimum confidence.
FP Array
• FP Array techniques that greatly reduce the needs to traverse FP Trees.
• FP Array techniques obtaining significance improved performance then
FP Tree based Algorithm.
• FP Array is new Algorithms in finding the Maximal and Closed Frequent
Item sets
FP Array Applications
• It generate the frequent patterns from the existing datasets.
• It Provides the minimum support to the given data inputs.
• Time Complexity for the searching the frequent item sets .
• It displays the no of records row and columns wise from the datasets
Rule to Mine Frequent Items
The frequent itemset mining algorithms are classified considering the following aspects:
• The type of the discovered frequent itemset• Using candidates• The representation of the transactions• The itemsets representation used in the algorithm• The number of disk access• The length of the maximal frequent pattern
APRIORI DIC FP
With Candidate generation
Without Candidate generation
BFS
DFS
FP-Tree
Implemented algorithms work differ as follows:
Stages in Knowledge Discovery in Frequent
Databases Selection - selecting and segmenting the data that are relevant to given criteria.
Preprocessing-data cleaning stage where unnecessary information is removed.
Transformation-the data is made usable and navigable.
Data Mining-extraction of patterns from the data
Interpretation and Evaluation-The patterns in the data mining stage are converted into knowledge to support decision-making
Data Visualization-to examine the large volumes of data and detect the patterns visually
· .
Discoveries in Frequent Databases
·
The Apriori algorithm is the most popular association rule algorithm. Apriori uses bottom up search.
Apriori algorithm works as follows:
• The first step, Apriori algorithm generates Candidate 1 – itemsets. Then, itemsets count and minimum support value are compared to find the set L1 (frequent itemsets).
• The second step, algorithm use L1 to construct the set C2 of Candidate 2 – itemsets. The process is finished when there are no more candidates.
Apriori Algorithm
In each phase, all the transaction in the data set are scanned.
Finally, all frequent itemsets are returned.
Disadvantage: Multiple database scan.
DIC (Dynamic Itemset Counting) algorithm which uses fewer database scan, presents a new approach for finding large itemsets.
Aim of the DIC algorithm is improving the performance and eliminating repeated database scan.
DIC algorithm divides the database into partitions ( intervals M ) and use a dynamic counting strategy. DIC algorithm determines some stop points for itemset counting. Any appropriate points, during the database scan, stopping counting, then starts to count with another itemsets.
Four symbols to indicate the different states of itemsets: Solid Box , Solid Circle, Dashed Box, Dashed Circle
DIC Algorithm
The algorithm is described as follows:
Step1: the empty itemset is marked with a solid box and all the 1-itemsets into dashed circle.
Step2: After reading one interval of M transactions from database, do the following steps:
• Check each itemset, in dashed circle. If it exceeds the support threshold, change it from dashed circle to a dashed box.• Check each super set of dashed circle. If all the subsets of dashed circle are in solid box or dashed box, then add it into dashed circle.• Check each set in dashed circle and dashed box. If it has been counted over all the transactions, change it into solid circle if it is in circle or change it into solid box if it is in box.
Step3: End of transactions is reached then, go back to the beginning and repeat step 2, until no itemset remains in dashed circle or dashed box.
FP-Growth
FP-Growth is an algorithm for generating frequent item sets for association rules. This algorithm compresses a large database into a compact, frequent pattern– tree (FP tree) structure.
FP – tree structure stores all necessary information about frequent itemsets in a database.
A frequent pattern tree (or FP-tree in short) is defined as
1. The root labeled with “null” and set of items as the children of the root.
2. Each node contains of three fields: item-name (holds the frequent item), count (number of transactions that share that node), and node- link (next node in the FP-tree).
3. Frequent-item header table contains two fields, item-name and head of node link (points to the first node in the FP-tree holding the item).
Use case Diagram for the proposed system
Apriori
Dynamic Itemset Counting
FP-Growth
User
Data SetFile
Identifying Classes form the above Use cases
Architectural design
The division of software into subsystems and components, as well as the process of deciding how these will be connected and how they will interact, include determining the interfaces.
GUI for Selecting the file ,support and algorithm
AprioriDynamic Itemset Counting
FP-Growth
Matrix Based Association
User interface design
The design of user interface is to display and obtain needed information in an accessible, efficient manner. The user interface can employ one or more windows. Each window should serve a clear, specific purpose.
Step1: Selection of the filename
Step 2: Display the contents of the file onto the text area
Step 3: Enter valid support
Step 5: Select the algorithm
Step 6: Display the frequent patterns for apriori
Step 7: If the selected algorithm is DIC, then enter the step length
Step 8: Display the frequent patterns for DIC
Step 8: Display the frequent patterns for FP-Growth
Step 9: Display the frequent patterns for MBA
The FPMiner tool is implemented using Java language and all the experiments are performed on 1.7GHz PC machine with 256MB memory. The Operating System is WindowsXP.
Experiment 1:
Execution times for different support for different algorithms can be tabulated as follows:
RESULTS
SupportExecution
time of AprioriT
Execution time of
DIC
Execution time ofFP-Growth
50 187ms 226754ms 94ms
60 110ms 184297ms 74ms
70 78ms 161265ms 46ms
80 47ms 106953ms 32ms
90 32ms 74984ms 31ms
Experiment 2:
The number of frequent itemsets generated using different algorithms:
Support Frequent itemsets generated
Apriori
50 153
60 51
70 31
80 23
90 9
MBA
50 153
60 51
70 31
80 23
90 9
Frequent Pattern mining is used for finding frequent itemsets among items in a given data set.
The results show that
• Apriori cannot be run very effective than FP -Tree. • Apriori on the other hand runs too slow because each transaction contains density. • DIC (Dynamic Itemset Counting) is much slower than every other algorithm for the real -dataset. • MBA is better than DIC but not very better than the other two in the case of MUSHROOM dataset.
CONCLUSION
There are still many interesting research issues related to the extensions of frequent pattern mining, such as mining structured patterns by further development of these approaches, mining approximate or fault-tolerant patterns in noisy environments, frequent-pattern-based clustering and classification, and so on.
FUTURE ENHANCEMENT
FP Array Techniques
FP Array technique that greatly reduce the needs to traverse FP Trees.
FP Array technique obtaining significance improved performance then FP Tree based Algorithms.
FP Array is new Algorithm in finding all Maximal and Closed Frequent Item sets.
Fp – tree use compact data structure based on the following properties,
- Frequent pattern generation mining perform one scan of database to
determine the set of frequent items.
- Method needs to store each item in a compact structure, thus more
than two database scan unnecessary.
- Each frequent item located in the FP – tree and each node hold items
and count of the frequent item.
- Each item have to be sorted in their frequency descending order.