+ All Categories
Home > Documents > C huang -K ai C hiou , J udy C. R T seng

C huang -K ai C hiou , J udy C. R T seng

Date post: 15-Jan-2016
Category:
Upload: clover
View: 46 times
Download: 0 times
Share this document with a friend
Description:
A S calable A ssociation R ules M ining A lgorithm B ased on S orting, I ndexing and T rim m ing. C huang -K ai C hiou , J udy C. R T seng. Proceedings of the Sixth International Conference on Machine Learning and Cybernetics Hong Kong, 19-22 August 2007. Outline. Introduction - PowerPoint PPT Presentation
Popular Tags:
15
A Scalable Association Rules Mining Algorithm Based on Sorting, Indexing and Trimming Chuang-Kai Chiou, Judy C. R Tseng Proceedings of the Sixth International Conference on Machine Learning and Cybernetics Hong Kong, 19-22 August 2007
Transcript
Page 1: C huang -K ai  C hiou , J udy  C. R T seng

A Scalable Association Rules Mining Algorithm Based on Sorting, Indexing and Trimming

Chuang-Kai Chiou, Judy C. R Tseng

Proceedings of the Sixth International Conference on Machine Learning and CyberneticsHong Kong, 19-22 August 2007

Page 2: C huang -K ai  C hiou , J udy  C. R T seng

Outline

Introduction Apriori Algorithm DHP Algorithm MPIP Algorithm SIT Algorithm Experiment and Evaluation Conclusion and Future works

Page 3: C huang -K ai  C hiou , J udy  C. R T seng

Introduction

Apriori algorithm Large amount of candidate itemsets will be generated.

Several hash-based algorithms use hash functions to filter out potential-less candidate itemsets. DHP algorithm MPIP algorithm

SIT algorithm Using the sorting, indexing, and trimming techniques to reduce

the amount of itemsets to be considered. Utilizing both the advantages of Apriori and MPIP algorithm.

Page 4: C huang -K ai  C hiou , J udy  C. R T seng

Apriori Algorithm

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2

Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Page 5: C huang -K ai  C hiou , J udy  C. R T seng

DHP Algorithm

Database

Page 6: C huang -K ai  C hiou , J udy  C. R T seng

MPIP Algorithm(1/2)

MPIP employs the minimal perfect hashing function for mining L1 and L2. It copes with the collision problem which occurred in DHP. The time needed for scanning and searching data items can be

reduced. It employs the Apriori algorithm for finding the frequent

k-itemsets for k>2.

Page 7: C huang -K ai  C hiou , J udy  C. R T seng

MPIP Algorithm(2/2)

Page 8: C huang -K ai  C hiou , J udy  C. R T seng

SIT Algorithm(1/5)

For mining association rules, we propose a revised algorithm, Sorting-Indexing-Trimming (SIT) approach.

SIT approach can avoid generating potential-less candidate itemsets and enhance the performance via Sorting, Indexing and Trimming.

Page 9: C huang -K ai  C hiou , J udy  C. R T seng

SIT Algorithm(2/5)

Sorting(1) There is the original

transaction database.

(2) Count the occurred frequency.

(3) Sort the items by the counts in increasing order and build a mapping table.

(4) Translate the items into mapping numbers.

(5) Re-sort the item ordering in each transaction.

Page 10: C huang -K ai  C hiou , J udy  C. R T seng

SIT Algorithm(3/5)

Indexing

Comparing count=69

Apriori IndexingIndex Table

Page 11: C huang -K ai  C hiou , J udy  C. R T seng

SIT Algorithm(4/5)

Trimming If the minimum support is 3, all the items with frequency less

than 3 will be trimmed. For reserving the data, physical trimming will be avoided.

We just record the starting position, and generate the hash table from this position.

L1

Page 12: C huang -K ai  C hiou , J udy  C. R T seng

SIT Algorithm(5/5)

The processes of SIT algorithm For finding L1 and L2:

Employ the Sorting, Indexing and Trimming techniques to the original database.

Employ MPIP algorithm to find L1 and L2

For finding the k-itemsets for k>2: Employ Apriori algorithm to database which has been sorted, indexed

and trimmed. Find out the frequent itemsets.

Page 13: C huang -K ai  C hiou , J udy  C. R T seng

Experiment and Evaluation(1/2)

The experiments are focus on two parts : Performance of Apriori, SI+Apriori, MPIP, and SIT. Performance of SIT and MPIP under different transaction

qualities and length.

Performance of Apriori, SI+Apriori, MPIP, and SIT.

Page 14: C huang -K ai  C hiou , J udy  C. R T seng

Experiment and Evaluation(2/2)

Performance of SIT and MPIP under different transaction qualities and length. The time of pre-sorting and pre-indexing are taken into

consideration in SIT2.

Page 15: C huang -K ai  C hiou , J udy  C. R T seng

Conclusion and Future works

SIT reduces the amount of candidate itemsets, and also avoids generating potential-less candidate itemsets.

The performance of SIT is better than Apriori, DHP and MPIP.

Some problems still need to be dealt with: When the data sets are increasing, we need to sort and index

again for association rule mining. Mapping items into corresponding index number is time-

consuming for the long transaction length.


Recommended