+ All Categories
Home > Documents > Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation...

Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation...

Date post: 16-Dec-2015
Category:
Upload: adam-sullivan
View: 217 times
Download: 2 times
Share this document with a friend
29
Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and Salah Ahmed Authors: Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, Philip S. Yu
Transcript
Page 1: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Mining Frequent Patterns in Data Streams

at Multiple Time Granularities

CS525 Paper Presentation

Presented by:

Pei Zhang, Jiahua Liu, Pengfei Geng and Salah Ahmed

Authors: Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, Philip S. Yu

Page 2: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Part 1

• Introduction

• Problem definition and analysis

• FP-Stream

Page 3: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Introduction

• Frequent pattern mining has been widely studied and used on static transaction data set, but it is challenging to extend it to data streams.

• Why it is difficult to mine frequent patterns in data streams?

— Mining frequent itemsets is a set of join operations.

Page 4: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Problem definition and analysis

• Our task is to find the complete set of grequent patterns in a data stream.

• Apriori algorithm: count only those itemsets whose every proper subset is frequent.

• Problems to use Apriori-like algorithm

— Join is a blocking operator

— Infrequent items can become frequent later on and hence cannot be ignored.

Page 5: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Definition

• The frequency of an itemset I over a time period T is the number of transactions in T in which I occurs. The support of I is the frequency divide by the total number of transactions observed in I.

• I is frequent if its support is no less than min_support σ.

• I is sub frequent if its support is less than σ but no less than the maximun support error ε.

• Otherwise, I is infrequent.

Page 6: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

FP-Stream

• This paper propose a time sensitive streaming model: FP-Stream, which includes two major components:

1. A global frequent pattern tree held in main memory.

2. Tilted time windows embedded in this pattern tree.

Page 7: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Part 2

• Mining Time-Sensitive Frequent Patterns in Data Streams

• Maintaining Tilted-Time Windows

Page 8: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Natural tilted-time window

• People are often interested in recent changes.

• Recent changes are depicted at a fine granularity, but long term changes at a Coarse granularity.

Page 9: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Frequent patterns for tilted-time windows

• To mine a variety of frequent patterns associated with time more flexibly, a frequent pattern set can be maintained.

Page 10: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Pattern tree

• For each tilted-time window, one can register window-based count for each frequent pattern.

• Each node represents a pattern and its frequency is recorded in the node

Page 11: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

FP-Stream

• Usually frequent patterns do not change dramatically over time.

• Overlap may occur• To save space, embed the tilted-time window structure into

each node

Page 12: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Maintaining Tilted-Time Windows

• With the arrival of new data

• In order to make the table compact

• Tilted-time window maintenance mechanism is needed

Page 13: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Logarithmic Tilted-time Window

• In the natural tilted-time window, at most 59 (4+24+31) tilted windows need to be maintained for a period of one month.

• We can reduce the number of tilted-time windows using logarithmic tilted-time windows schema

• According to logarithmic tilted-time window model, with one year of data and the finest precision at quarter, it needs

units of time instead of units.

171)424365(log 2 136,35424366

Page 14: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Logarithmic Tilted-time Window

• Break the stream of transactions into fixed sized batches B1, B2, B3, …, Bn…

• Bn is most current batch, B1 is the oldest

• For i ≥ j, let B(i, j) denotes Uik=j

Bk

• fI(i, j) denote the frequency of I in B(i, j)

• Frequencies for itemset I with ratio 2 (the growth rate of window size):

• Maintain intermediate buffer windows

Page 15: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Logarithmic Tilted-time Window Updating

• Given a new batch of transactions B• Replace level 0: f(n, n) with f(B)• Shift f(n, n) back to the next finest level of time (level

1)• Check status of intermediate window for level 1:

• Not full. Place f(n-1, n-1) in the intermediate window, stop the algorithm

• Full. f(n-1, n-1) + f(intermediate window) is shifted back to level 2

• Continue this process until shifting stops

Page 16: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Logarithmic Tilted-time Window Updating…Example

)[]1,4()[];5,6()[];7,7();8,8( ffff

)[]1,4()[];5,6()];7,7()[8,8();9,9( fffff

)[]1,4()];5,6()[7,8()[];9,9();10,10( fffff

)]1,4()[5,8()[];9,10()[];11,11();12,12( fffff

)[]1,4()];5,6()[7,8()];9,9()[10,10();11,11( ffffff

Page 17: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Part 3

• Tail Pruning

• Type I Pruning

• Type II Pruning

• Algorithm

Page 18: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Tail Pruning

• Let be the tilted-time windows where is the oldest.

• is the window size of .

• Drop tail sequences when the following condition holds,

ntt ,....,0 nt

iw it

Page 19: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Type I and Type II Pruning

• Type I Pruning:• If I is found in B but is not in the FP-stream structure, no

superset is in the structure.• Hence, if , then none of the supersets need

be examined.

• Type II Pruning:• If all of I’s tilted-time window table entries are pruned

(and I is dropped), then any superset will also be dropped.

Page 20: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

An Algorithm

• FP-streaming: Incremental update of the FP-stream structure with incoming stream data

• 1. Initialize the FP-tree to empty .• 2. Sort each incoming transaction t, according to f list, and

then insert it into the FP-tree without pruning any items.• 3. When all the transactions in Bi are accumulated, update

the FP-stream as follows.• Mine itemsets out of the FP-tree using FP-growth

algorithm• Scan the FP-stream structure

Page 21: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Part 4

• Experimental Set-Up

• Experimental Results

• Discussion

Page 22: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Experiments Set-Ups

• Experiments are performed using • Sun UltraSPARC-Iii Processors, 512 MB RAM

• Dataset Generation • 3 Million Transactions• 1k Distinct Items• Streams are broken into batches of size 50k transactions• For every 5 batches 200 random permutations are applied

Page 23: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

FP-stream time requirements

• Item permutations causes the behavior to jump at every 5 batches

• Stability is regained quickly.

• Required time increases as the average itemset length increases.

Page 24: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

FP-stream space requirements

• The overall space requirements are very attracting in call cases. It was less than 3MB.

Page 25: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

FP-stream average itemset length

• The average itemset length does not increase with the increase of average transaction length

• This result was also verified by Apriori running on 50k transactions.

Page 26: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

FP-stream total number of itemsets

• The total number of itemsets increase with the increase of average transaction length.

• This result was also verified by Apriori running on 50k transactions.

Page 27: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Discussion

• Further compression is possible. • If the support is stable for lots of entries, the table can be

compressed.• If the tilted time windows of parent node and child node are the

same, only one tilted time window can be maintained.

• It is a very nice idea to mine time sensitive frequent patterns.

• Mining and maintaining frequent patterns become realistic even with limited main memory.

Page 28: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Feedback

Comments and Questions

Page 29: Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.

Thank You


Recommended