Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | medge-hendrix |
View: | 26 times |
Download: | 4 times |
1
An Efficient Algorithm for Mining Frequent Itemestsover the Entire History of Data Streams
Hua-Fu Li, Suh-Yin Lee, and Man-Kwan Shan. Accepted for Hua-Fu Li, Suh-Yin Lee, and Man-Kwan Shan. Accepted for publication in the Proceedings of First International Workspublication in the Proceedings of First International Workshop on Knowledge Discovery in Data Streams, to be held in hop on Knowledge Discovery in Data Streams, to be held in conjunction with the 15th European Conference on Machinconjunction with the 15th European Conference on Machine Learning (ECML 2004).e Learning (ECML 2004).
Adviser: Jia-Ling KohAdviser: Jia-Ling KohSpeaker: Shu-Ning ShinSpeaker: Shu-Ning ShinDate: 2005.3.4Date: 2005.3.4
2
IntroductionIntroduction• This paper proposes a single-pass algorithm,
called DSMFI(Data Stream Mining for Frequent Itemsets), has three major features:– single streaming data scan for counting item
sets’ frequency information.– extended prefix-tree-based compact pattern
representation.– top-down frequent itemset discovery schem
e.
3
Problem Definition (1)Problem Definition (1)
• Item: Ψ={i1, i2, …, im}• Data Stream: DS=B1, B2, …,BN
• Block: B=[tsi, T1, T2, …, Tk]• Current Length of Data Stream: CL=|B1|+|
B2|+ …+|BN|• Minimum support: • Maximal estimated support error:
)1,0(ms
),0( ms
4
Problem Definition (2)Problem Definition (2)• True Support of itemset X: Tsup(X)• Estimated Support of X: Esup(X)• X is frequent itemset if
– • X is sub-frequent itemset if
– • X is infrequent itemset if
–
CLmsXT )sup(
CLXTCLms )sup(
CLXT )sup(
5
Data Structure: IsFI-forest (1)Data Structure: IsFI-forest (1)• Item-suffix Frequent Itemset forest.• Extended prefix-tree-based summary da
ta structure.• Definition:• 1. Is-FI forest consist of:
– DHT: Dynamic Header Table– a set of CFI-trees (item-suffixes): Candidate
Frequent Itemset trees of item-suffixes.
6
Data Structure: IsFI-forest (2)Data Structure: IsFI-forest (2)• 2. Entry of DHT:
– item-id, support, block-id, head-link.• 3. Entry of CFI-tree(item-suffix):
– item-id, support, block-id, node-link.• 4. Each CFI-tree(item-suffix) has a specifi
c DHT(item-sufix)
7
Construction of IsFI-forestConstruction of IsFI-forest• 1. Read a transaction T=(x1x2..xk) from current
block BN.• 2. Item-suffix projection IsProjection(T): transa
ction T is converted into k small transactions (x1x2..xk), (x2..xk),…, (xk-1xk), (xk).
• 3. insert IsProjection(T) into IsFI-forest.
8
Algorithm: DSM-FIAlgorithm: DSM-FI• DSM-FI is composed of four steps:
– step 1 : reading a block of transactions– step 2: constructing IsFI-forest– step 3: pruning the infrequent information fr
om IsFI-forest– Step 4: top-down frequent itemset discover
y scheme
9
ExampleExample• ms=30%, ε=25% • DS={B1, B2}
– B1={(acdef), (abe), (cef), (acdf), (cef), (df)}– B2={(def), (bef), (be), (bde)}
• DSM-FI:– Step 1. Read B1 into main memory– Step 2. constructing the IsFI-forest– Step 3. pruning infrequent item– Step 4. mine frequent itemset from IsFI-forest
10
Example – step 2 (1)Example – step 2 (1)• (1) T1=acdef
– IsProjection(acdef)=acdef, cdef, def, ef, f– [CFT-tree(a), DHT(a)], [CFT-tree(c), DHT(c)], [CFT-tree(d), DHT(d)],
[CFT-tree(e), DHT(e)] [CFT-tree(f), DHT(f)]c 1 1d 1 1e 1 1f 1 1
d 1 1e 1 1f 1 1
DHT(a)
a:1:1
DHT(c)e 1 1f 1 1
f 1 1DHT(d) DHT(e) DHT(f)
c:1:1 d:1:1 e:1:1 f:1:1
a:1:1
c:1:1
d:1:1
e:1:1
f:1:1
c:1:1
d:1:1
e:1:1
f:1:1
d:1:1
e:1:1
f:1:1
e:1:1
f:1:1
f:1:1
CFT-tree(a)
CFT-tree(c)
CFT-tree(d)
CFT-tree(e)
CFT-tree(f)
11
Example – step 2 (2)Example – step 2 (2)• (2) T2=abe
– IsProjection(abe)=abe, be, e– [CFT-tree(a), DHT(a)], [CFT-tree(b), DHT(b)], [CFT-tree(e), DHT(e)]
c 1 1d 1 1e 2 1f 1 1b 1 1
d 1 1e 1 1f 1 1
DHT(a)
a:2:1
DHT(c)e 1 1f 1 1
f 1 1DHT(d) DHT(e) DHT(f)
c:1:1 d:1:1 e:2:1 f:1:1
a:2:1
c:1:1
d:1:1
e:1:1
f:1:1
c:1:1
d:1:1
e:1:1
f:1:1
d:1:1
e:1:1
f:1:1
e:2:1
f:1:1
f:1:1
b:1:1
e:1:1
e 1 1
b:1:1
e:1:1
b:1:1
DHT(b)
12
Example – step 2 (3)Example – step 2 (3)• (6) T6=df
– IsProjection(abe)=df, f– [CFT-tree(d), DHT(d)], [CFT-tree(f), DHT(f)]
c 2 1d 2 1e 2 1f 2 1b 1 1
d 2 1e 3 1f 4 1
DHT(a)
a:3:1
DHT(c)e 1 1f 3 1
f 3 1DHT(d) DHT(e) DHT(f)
c:4:1 d:3:1 e:4:1 f:5:1
a:3:1
c:2:1
d:2:1
e:1:1
f:1:1
c:4:1
d:2:1
e:1:1
f:1:1
d:3:1
e:1:1
f:1:1
e:4:1
f:3:1
f:5:1
b:1:1
e:1:1
e 1 1
b:1:1
e:1:1
b:1:1
DHT(b)
f:2:1
f:1:1
e:2:1
f:2:1
f:1:1
13
Example – step 3Example – step 3• b is a infrequent item: Esup(b)<0.25*6• delete CFI-tree(b), DHT(b) and entry b.
c 2 1d 2 1e 2 1f 2 1b 1 1
d 2 1e 3 1f 4 1
DHT(a)
a:3:1
DHT(c)e 1 1f 3 1
f 3 1DHT(d) DHT(e) DHT(f)
c:4:1 d:3:1 e:4:1 f:5:1
a:3:1
c:2:1
d:2:1
e:1:1
f:1:1
c:4:1
d:2:1
e:1:1
f:1:1
d:3:1
e:1:1
f:1:1
e:4:1
f:3:1
f:5:1
b:1:1
e:1:1
e 1 1
b:1:1
e:1:1
b:1:1
DHT(b)
f:2:1
f:1:1
e:2:1
f:2:1
f:1:1
14
Example – Step 4Example – Step 4• Start top-down frequent itemset discovery from frequent item a: (frequent: ms*CL=0.3*10=
3)
c 2 1d 2 1e 2 1f 2 1
d 2 1e 3 1f 4 1
DHT(a)
a:3:1
DHT(c)e 3 1f 4 1
f 5 1DHT(d) DHT(e) DHT(f)
c:4:1 d:5:1 e:8:1 f:7:1
a:3:1
c:2:1
d:2:1
e:1:1
f:1:1
c:4:1
d:2:1
e:1:1
f:1:1
d:5:1
e:3:1
f:2:1
e:8:1
f:5:1
f:7:1
e:1:1
e 3 2f 1 2d 1 2
b:3:2
e:2:2
b:3:2
DHT(b)
f:2:1
f:1:1
e:2:1
f:2:1
f:1:1
f:1:2
d:1:2
e:1:2
Frequent: aMaximal candidate: cef
TSup(cef)=3, frequent{c, e, f, ce, cf, ef, cef}
Maximal candidate: defTwo-candidate: de, dfMaximal candidate: be
Tsup(ef)=2 not frequentde, df are frequent
Maximal candidate: ef
be is frequent
15
Upper Bound of SpaceUpper Bound of Space• i=1, 2, …, k• Nodes of all DHT(i):
– (k2-k)/2• Nodes of the set of CFI-trees:
–
k
j
j
i
ijiC
1
2/
1
2/
16
Maximal Estimated ErrorMaximal Estimated Error• (X, X.support, X.block_id)
– X.block_id=1: • Tsup(X)=Esup(X)
– X.block_id>1:• • k: the average size of block
kidblockXXEXT )1_.()sup()sup(
17
Performance (1)Performance (1)• IBM Dataset: T10.I5.D1M, T30.I20.D1M.• 20 blocks with size 50K.• Compare execution time and memory usage of
DSM-FI: