Post on 07-Apr-2018
transcript
8/4/2019 Approximate Frequency Counts Over Data Streams
1/87
Approximate FrequencyCounts over Data Streams
Gurmeet Singh Manku (Standford)
Rajeev Motwani (Standford)
Presented by Michal SpivakNovember, 2003
8/4/2019 Approximate Frequency Counts Over Data Streams
2/87
The Problem
Stream
Identify all elements whose current frequency exceeds support threshold s = 0.1%.
8/4/2019 Approximate Frequency Counts Over Data Streams
3/87
Related problem
Stream
Identify all subsets of items whose current frequency exceeds s=0.1%
8/4/2019 Approximate Frequency Counts Over Data Streams
4/87
Purpose of this paperPresent an algorithm for computing frequency countsexceeding a user-specified threshold over datastreams with the following advantages:
Simple
Low memory footprint
Output is approximate but guaranteed not to exceed
a user specified error parameter. Can be deployed for streams of singleton items and
handle streams of variable sized sets of items.
8/4/2019 Approximate Frequency Counts Over Data Streams
5/87
Overview Introduction
Frequency counting applications
Problem definition
Algorithm for Frequent Items
Algorithm for Frequent Sets of Items Experimental results
Summary
8/4/2019 Approximate Frequency Counts Over Data Streams
6/87
Introduction
8/4/2019 Approximate Frequency Counts Over Data Streams
7/87
Motivating examples Iceberg Query
Perform an aggregate function over an attribute and eliminatethose below some threshold.
Association RulesRequire computation of frequent itemsets.
Iceberg DatacubesGroup bys of a CUBE operator whose aggregate frequencyexceeds threshold
Traffic measurementRequire identification of flows that exceed a certain fraction oftotal traffic
8/4/2019 Approximate Frequency Counts Over Data Streams
8/87
Whats out there today Algorithms that compute exact results Attempt to minimize number of data passes
(best algorithms take two passes).
Problems when adapted to streams: Only one pass is allowed.
Results are expected to be available withshort response time. Fail to provide any a-priori guarantee on the
quality of their output.
8/4/2019 Approximate Frequency Counts Over Data Streams
9/87
Why Streams?
Streams vs. Stored dataVolume of a stream over its lifetime can
be huge
Queries for streams require timelyanswers, response times need to besmall
As a result it is not possible to store thestream as an entirety.
8/4/2019 Approximate Frequency Counts Over Data Streams
10/87
Frequency countingapplications
8/4/2019 Approximate Frequency Counts Over Data Streams
11/87
Existing applications for the
following problems Iceberg Query
Perform an aggregate function over an attribute andeliminate those below some threshold.
Association RulesRequire computation of frequent itemsets.
Iceberg DatacubesGroup bys of a CUBE operator whose aggregate
frequency exceeds threshold Traffic measurement
Require identification of flows that exceed a certainfraction of total traffic
8/4/2019 Approximate Frequency Counts Over Data Streams
12/87
Iceberg QueriesIdentify aggregates that exceed a user-specified threshold r
One of the published algorithms to compute iceberg queriesefficiently uses repeated hashing over multiple passes.*
Basic Idea: In the first pass a set of counters is maintained Each incoming item is hashed to one of the counters which is
incremented These counters are then compressed to a bitmap, with a 1
denoting large counter value
In the second pass exact frequencies are maintained for onlythose elements that hash to a counter whose bitmap value is 1
This algorithm is difficult to adapt for streams because it requirestwo passes
*M. FANG, N. SHIVAKUMAR, H. GARCIA-MOLINA,R. MOTWANI, AND J. ULLMAN. Computing icebergqueries efficiently. In Proc. of 24th Intl. Conf. on Very Large Data Bases, pages 299310, 1998.
8/4/2019 Approximate Frequency Counts Over Data Streams
13/87
Association RulesDefinitions
Transaction subset of items drawn from I, theuniverse of all Items.
Itemset X Ihas support sif X occurs as a subsetat least a fraction - s of all transactions
Associations rules over a set of transactions are ofthe form X=>Y, where X and Y are subsets ofIsuch
that XY = 0 and XUY has support exceeding a userspecified threshold s.
Confidence of a rule X => Y is the valuesupport(XUY) / support(X)
U|
8/4/2019 Approximate Frequency Counts Over Data Streams
14/87
Example - Market basket analysis
Transaction Id Purchased Items
1 {1, 2, 3}
2 {1, 4}3 {1, 3}
4 {2, 5, 6}
For support = 50%, confidence = 50%, we have the following rules1 => 3 with 50% support and 66% confidence3 => 1 with 50% support and 100% confidence
8/4/2019 Approximate Frequency Counts Over Data Streams
15/87
Reduce to computing frequent itemsets
TID Items
1 {1, 2, 3}
2 {1, 3}
3 {1, 4}
4 {2, 5, 6}
Frequent Itemset Support
{1} 75%
{2} 50%
{3} 50%
{1, 3} 50%For the rule 1 => 3:Support = Support({1, 3}) = 50%Confidence = Support({1,3})/Support({1}) = 66%
For support = 50%, confidence = 50%
8/4/2019 Approximate Frequency Counts Over Data Streams
16/87
Toivonens algorithm
Based on sampling of the data stream.
Basically, in the first pass, frequencies are computedfor samples of the stream, and in the second pass
these the validity of these items is determined.
Can be adapted for data stream
Problems:
- false negatives occur because the error infrequency counts is two sided- for small values ofe, the number of samples isenormous ~ 1/e (100 million samples)
8/4/2019 Approximate Frequency Counts Over Data Streams
17/87
Network flow identification
Flow sequence of transport layer packets thatshare the same source+destination addresses
Estan and Verghese proposed an algorithm for this
identifying flows that exceed a certain threshold.The algorithm is a combination of repeated hashingand sampling, similar to those for iceberg queries.
Algorithm presented in this paper is directlyapplicable to the problem of network flowidentification. It beats the algorithm in terms of spaceand requirements.
8/4/2019 Approximate Frequency Counts Over Data Streams
18/87
Problem definition
8/4/2019 Approximate Frequency Counts Over Data Streams
19/87
Problem Definition
Algorithm accepts two user-specified parameters- support threshold s E (0,1)- error parameter E (0,1)
-
8/4/2019 Approximate Frequency Counts Over Data Streams
20/87
Approximation guarantees
All item(set)s whose true frequencyexceeds sN are output. There are no
false negatives. No item(set)s whose true frequency is
less than (s- (N is output.
Estimated frequencies are lessthan truefrequencies by at mostN
8/4/2019 Approximate Frequency Counts Over Data Streams
21/87
Input Example
S = 0.1% as a rule of thumb, should be set to one-tenth or one-
twentieth of s. = 0.01%
As per property 1, ALL elements with frequency exceeding 0.1%will be output.
As per property 2, NO element with frequency below 0.09% willbe output
Elements between 0.09% and 0.1% may or may not be output.Those that make their way are false positives
As per property 3, all individual frequencies are less than theirtrue frequencies by at most 0.01%
8/4/2019 Approximate Frequency Counts Over Data Streams
22/87
Problem Definition cont
An algorithm maintains an -deficientsynopsis if its output satisifies the
aforementioned properties
Goal:to devise algorithms to support-
deficient synopsis using as little mainmemory as possible
8/4/2019 Approximate Frequency Counts Over Data Streams
23/87
The Algorithms for frequentItems
Sticky Sampling
Lossy Counting
8/4/2019 Approximate Frequency Counts Over Data Streams
24/87
Sticky Sampling Algorithm
Stream
Create counters by sampling
341530
28
3141
233519
8/4/2019 Approximate Frequency Counts Over Data Streams
25/87
Notations
Data structure S - set of entries of theform (e,f)
f estimates the frequency of anelement e.
r sampling rate. Sampling an element
with rate = r means we select theelement with probablity = 1/r
8/4/2019 Approximate Frequency Counts Over Data Streams
26/87
Sticky Sampling cont
Initially S is empty, r = 1. For each incoming element e
if (e exists in S)
increment corresponding felse {sample element with rate r
if (sampled)add entry (e,1) to S
else
ignore}
8/4/2019 Approximate Frequency Counts Over Data Streams
27/87
The sampling rate
Let t = 1/ log(s-1-1) ( = probability of failure)
First 2t elements are sampled at rate=1
The next 2t elements at rate=2
The next 4t elements at rate=4
and so on
8/4/2019 Approximate Frequency Counts Over Data Streams
28/87
Sticky Sampling cont
Whenever the sampling rate r changes:for each entry (e,f) in Srepeat {
toss an unbiased coinif (toss is not successful)diminsh f by one
if (f == 0) {delete entry from S
break}
} until toss is successful
8/4/2019 Approximate Frequency Counts Over Data Streams
29/87
Sticky Sampling cont
The number of unsuccessful coin tossesfolows a geometric distribution.
Effectively, after each rate change S istransformed to exactly the state it wouldhave been in, if the new rate had been usedfrom the beginning.
When a user requests a list of items withthreshold s, the output are those entries in Swhere f (s)N
8/4/2019 Approximate Frequency Counts Over Data Streams
30/87
Theorem 1
Sticky Sampling computes an-deficientsynopsis with probability at least 1 -using at most 2/ log(s-1-1) expected
number of entries.
8/4/2019 Approximate Frequency Counts Over Data Streams
31/87
Theorem 1 - proof
First 2t elements find their way into S
When r 2N = rt + rt` ( t`E [1,t) ) => 1/r t/N
Error in frequency corresponds to a sequence ofunsuccessful coin tosses during the first fewoccurrences of e.the probability that this length exceeds N is at most(1 1/r)N < (1 t/N)-N < e-t
Number of elements with f > s is no more than 1/s=> the probability that the estimate for anyof themis deficient by N is at most e-t/s
8/4/2019 Approximate Frequency Counts Over Data Streams
32/87
Theorem 1 proof cont
Probability of failure should be at most. This yields
e-t/s < t 1/ log(s-1-1)
since the space requirements are 2t,the space bound follows
8/4/2019 Approximate Frequency Counts Over Data Streams
33/87
Sticky Sampling summary
The algorithm name is called sticky sampling becauseS sweeps over the stream like a magnet, attractingall elements which already have an entry in S
The space complexity is independent of N The idea of maintaining samples was first presented
by Gibbons and Matias who used it to solve the top-kproblem.
This algorithm is different in that the sampling rate rincreases logarithmically to produce ALL items withfrequency > s, not just the top k
8/4/2019 Approximate Frequency Counts Over Data Streams
34/87
Lossy Counting
bucket 1 bucket 2 bucket 3
Divide the stream into bucketsKeep exact counters for items in the bucketsPrune entrys at bucket boundaries
8/4/2019 Approximate Frequency Counts Over Data Streams
35/87
Lossy Counting cont
A deterministic algorithm that computesfrequency counts over a stream of singleitem
transactions, satisfying the guaranteesoutlined in Section 3 using at most 1/log(N)space where N denotes the current length ofthe stream.
The user specifies two parameters:- support s- error
8/4/2019 Approximate Frequency Counts Over Data Streams
36/87
Definitions
The incoming stream is conceptually dividedinto bucketsof width w = ceil(1/e)
Buckets are labeled with bucket ids, starting
from 1 Denote the current bucket idby bcurrent whose
value is ceil(N/w) Denote feto be the true frequency of an
element e in the stream seen so far Data stucture D is a set of entries of the form
(e,f,D)
8/4/2019 Approximate Frequency Counts Over Data Streams
37/87
The algorithm
Initially D is empty
Receive element eif (e exists in D)
increment its frequency (f) by 1else
create a new entry (e, 1, bcurrent 1)
If it bucket boundary prune D by the following therule:(e,f,D) is deleted if f + D bcurrent
When the user requests a list of items with thresholds, output those entries in D where f (s)N
8/4/2019 Approximate Frequency Counts Over Data Streams
38/87
Some algorithm facts
For an entry (e,f,D) f represents the exactfrequency count for e ever since it was
inserted into D. The value D is the maximum number of times
e could have occurred in the first bcurrent 1buckets ( this value is exactly bcurrent 1)
Once a value is inserted into D its value D isunchanged
8/4/2019 Approximate Frequency Counts Over Data Streams
39/87
Lossy counting in action
D is Empty
FrequencyCounts
At window boundary, remove entries that for them
f+D
bcurrent
+
First Bucket
8/4/2019 Approximate Frequency Counts Over Data Streams
40/87
Lossy counting in action contFrequencyCounts
Next Bucket
+
At window boundary, remove entries that for themf+D bcurrent
8/4/2019 Approximate Frequency Counts Over Data Streams
41/87
Lemma 1
Whenver deletions occur, bcurrenteN
Proof:N = bcurrentw + neN = bcurrent+ en
eN bcurrent
8/4/2019 Approximate Frequency Counts Over Data Streams
42/87
Lemma 2
Whenever an entry (e,f,D) gets deletedfe bcurrent
Proof by induction Base case: bcurrent = 1
(e,f,D) is deleted only if f = 1 Thus fe bcurrent (fe = f) Induction step:
- Consider (e,f,D) that gets deleted for some bcurrent > 1.- This entry was inserted when bucket D+1 was being processed.
- It was deleted at late as the time as bucket D became full.- By induction the true frequency for e was no more than D.- f is the true frequency of e since it was inserted.- fe f+D combined with the deletion rule f+D bcurrent =>
fe bcurrent
8/4/2019 Approximate Frequency Counts Over Data Streams
43/87
Lemma 3
If e does not appear in D, then feeN
Proof:If the lemma is true for an element ewhenever it gets deleted, it is true forall other N also.From lemmas 1, 2 we infer that fe eNwhenever it gets deleted.
8/4/2019 Approximate Frequency Counts Over Data Streams
44/87
Lemma 4
If (e,f,D)E D, then f fe f +eN
Proof:IfD=0, f=fe.Otherwise
e was possibly deleted in the first D buckets.
From lemma 2 fe f+
DD bcurrent 1 eNConclusion
f fe f +eN
8/4/2019 Approximate Frequency Counts Over Data Streams
45/87
Lossy Counting cont
Lemma 3 shows that all elements whose truefrequency exceed eN have entries in D
Lemma 4 shows that the estimated frequencyof all such elements are accurate to within eN
=> D correctly maintains an e-deficientsynopsis
8/4/2019 Approximate Frequency Counts Over Data Streams
46/87
Theorem 2
Lossy counting computes an e-deficientsynopsis using at most 1/elog(eN)entries
8/4/2019 Approximate Frequency Counts Over Data Streams
47/87
Theorem 2 - proof
Let B = bcurrent di denote the number of entries in D
whose bucket id is B - i + 1 (iE[1,B]) e corresponding to di must occur at
least i times in buckets B-i+1 throughB, otherwise it would have been deleted
We get the following constraint:(1) Sidi jw for j = 1,2,B. i = 1..j
8/4/2019 Approximate Frequency Counts Over Data Streams
48/87
Theorem 2 proof
The following inequality can be provedby induction:
Sdi Sw/i for j = 1,2,B i = 1..j |D| = Sdi for i = 1..B
From the above inequality
|D| Sw/i 1/elogB = 1/elog(eN)
8/4/2019 Approximate Frequency Counts Over Data Streams
49/87
Sticky Sampling vs.Lossy counting
Support s = 1%
Error = 0.1%
Noofentries
Log10 of N (stream length)
8/4/2019 Approximate Frequency Counts Over Data Streams
50/87
Sticky Sampling vs.Lossy counting cont
N (stream length)
Noofent
ries
Kinks in the curve for sticky sampling correspond to re-samplingKinks in the curve for lossy counting correspond to bucket boundaries
8/4/2019 Approximate Frequency Counts Over Data Streams
51/87
Sticky Sampling vs.Lossy counting cont
SS Sticky Sampling LC Lossy Counting
Zipf zipfian distribution Uniq stream with no duplicates
es
SSworst
LCworst
SSZipf
LCZipf
SSUniq
LCUniq
0.1% 1.0% 27K 9K 6K 419 27K 1K
0.05% 0.5% 58K 17K 11K 709 58K 2K
0.01% 0.1% 322K 69K 37K 2K 322K 10K
0.005% 0.05% 672K 124K 62K 4K 672K 20K
k l
8/4/2019 Approximate Frequency Counts Over Data Streams
52/87
Sticky Sampling vs.Lossy summary
Lossy counting is superior by a largefactor
Sticky sampling performs worsebecause of its tendency to rememberevery unique element that gets sampled
Lossy counting is good at pruning lowfrequency elements quickly
C h l
8/4/2019 Approximate Frequency Counts Over Data Streams
53/87
Comparison with alternativeapproaches
Toivonen sampling algorithm forassociation rules.
Sticky sampling beats the approach byroughly a factor of1/e
C i i h l i
8/4/2019 Approximate Frequency Counts Over Data Streams
54/87
Comparison with alternativeapproaches cont
KPS02 In the first path the algorithmmaintains 1/e elements with theirfrequencies. If a counter exists for an
element it is increased, if there is a freecounter it is inserted, otherwise all existingcounters are reduced by one
Can be used to maintain e-deficient synopsiswith exactly 1/e space
If the input stream is ZipfianLossy Counting takes less than 1/e spacefor e=0.01% roughly 2000 entries ~ 20% 1/e
8/4/2019 Approximate Frequency Counts Over Data Streams
55/87
Frequent Sets of Items
From theory to Practice
8/4/2019 Approximate Frequency Counts Over Data Streams
56/87
Frequent Sets of Items
Stream
Identify all subsets of items whosecurrent frequency exceeds s = 0.1%.
8/4/2019 Approximate Frequency Counts Over Data Streams
57/87
Frequent itemsets algorithm
Input: stream of transactions, eachtransaction is a set of items from I
N: length of the stream
User specifies two parameters:support s, error e
Challenge:
- handling variable sized transactions- avoiding explicit enumeration of all subsetsof any transaction
8/4/2019 Approximate Frequency Counts Over Data Streams
58/87
Notations
Data structure D set of entries of the form (set, f, D)
Transactions are divided into buckets
w = ceil(1/e) no. of transactions in each bucket
bcurrent current bucket id
Transactions are not processed one by one. Mainmemory is filled with as many transactions as possible.Processing is done on a batch of transactions.
B no. of buckets in main memory in the currentbatch being processed.
8/4/2019 Approximate Frequency Counts Over Data Streams
59/87
Update D
UPDATE_SET: for each entry (f,set,D) E D,update f by counting occurrences of set in the
current batch. If the updated entry satisfiesf+D bcurrent, we delete this entry
NEW_SET: if a set sethas frequency f B inthe current batch and set does not occur in
D, create a new entry (set,f,bcurrentB)
8/4/2019 Approximate Frequency Counts Over Data Streams
60/87
Algorithm facts
If fset eN it has an entry in D
If (set,f,D)ED then the true frequency of fsetsatisfies the inequality f f
set f+D
When a user requests a list of items withthreshold s, output those entries in D wheref (s-e)N
B needs to be a large number.Any subset of I that occurs B+1 times ormore contributes to D.
8/4/2019 Approximate Frequency Counts Over Data Streams
61/87
Three modules
BUFFER
TRIE
SUBSET-GEN
maintains the data structureD
operates on the current batch of transactions
repeatedly reads in a batch of transactionsinto available main memory
implement UPDATE_SET, NEW_SET
8/4/2019 Approximate Frequency Counts Over Data Streams
62/87
Module 1 - Buffer
bucket 1 bucket 2 bucket 3 bucket 4 bucket 5 bucket 6
In Main Memory
Read a batch of transactions
Transactions are laid out one after the other in a big array A bitmap is used to remember transaction boundaries After reading in a batch, BUFFER sorts each transactionby its item-ids
8/4/2019 Approximate Frequency Counts Over Data Streams
63/87
Module 2 - TRIE
50
40
30
31 29 32
45
42
50 40 30 31 2945 32 42
Sets with frequency counts
8/4/2019 Approximate Frequency Counts Over Data Streams
64/87
Module 2 TRIE cont
Nodes are labeled {item-id, f, D, level}
Children of any node are ordered by their item-ids
Root nodes are also ordered by their item-ids
A node represents an itemset consisting of item-idsin that node and all its ancestors
TRIE is maintained as an array of entries of the form{item-id, f, D, level} (pre-order of the trees).
Equivalent to a lexicographicordering of subsets itencodes.
No pointers, levels compactly encode the underlyingtree structure.
8/4/2019 Approximate Frequency Counts Over Data Streams
65/87
Module 3 - SetGen
BUFFER
3 3 3 4
2 2 12 1311
Frequency countsof subsetsin lexicographic order
SetGen uses the following pruning rule:if a subset S does not make its way into TRIE after application of bothUPDATE_SET and NEW_SET, then no supersets of S should be considered
8/4/2019 Approximate Frequency Counts Over Data Streams
66/87
Overall Algorithm
BUFFER
3 3 3 42 2 1
2 131
1 SUBSET-GEN
TRIE new TRIE
Effi i t I l t ti
8/4/2019 Approximate Frequency Counts Over Data Streams
67/87
Efficient ImplementationsBuffer
If item-ids are successive integers from1 thru |I|, and Iis small enough (less
than 1 million)Maintain exact frequency counts forsingleton sets. Prune away those item-
ids whose frequency is less than eNand then sort the transactions
If |I| = 105, array size = 0.4 MB
Effi i t I l t ti
8/4/2019 Approximate Frequency Counts Over Data Streams
68/87
Efficient ImplementationsTRIE
Take advantage of the fact that the setsproduced by SetGen are lexicographic.
Maintain TRIE as a set of fairly large-sized
chunks of memory instead of one huge array Instead of modifying the original TRIE, create
a new TRIE. Chunks from the old TRIE are freed as soon
as they are not required. By the time SetGen finishes, the chunks of
the original TRIE have been discarder.
Effi i t I l t ti
8/4/2019 Approximate Frequency Counts Over Data Streams
69/87
Efficient ImplementationsSetGen
Employs a priority queue called Heap Initially contains pointers to smallest item-ids
of all transactions in buffer Duplicate members are maintained together
and constitute a single item in the Heap.Chain all these pointers together.
Derive the space from BUFFER. Change item-ids with pointers.
Effi i t I l t ti
8/4/2019 Approximate Frequency Counts Over Data Streams
70/87
Efficient ImplementationsSetGen cont
316542541321
12
Effi i t I l t ti
8/4/2019 Approximate Frequency Counts Over Data Streams
71/87
Efficient ImplementationsSetGen cont
Repeatedly process the smallest item-id in Heaptogenerate singleton sets.
If the singleton belongs to TRIE after UPDATE_SET
and NEW_SET try to generate the next set byextending the current singleton set.
This is done by invoking SetGen recursively with anew Heapcreated out of successors of the pointersto item-ids just processed and removed.
When the recursive call returns, the smallest entry inHeapis removed and all successors of the currentlysmallest item-id are added.
Efficient Implementations
8/4/2019 Approximate Frequency Counts Over Data Streams
72/87
Efficient ImplementationsSetGen cont
316542521321
12
2
3
8/4/2019 Approximate Frequency Counts Over Data Streams
73/87
System issues and optimizations
Buffer scans the incoming stream by memorymapping the input file.
Use standard qsort to sort transactions Threading SetGen and Buffer does not help
because SetGen is significantly slower. The rate at which tries are scanned is much
smaller than the rate at which sequiential disk
I/O can be done Possible to maintain TRIE on disk without loss
in performance
S i d i i i
8/4/2019 Approximate Frequency Counts Over Data Streams
74/87
System issues and optimizationsTRIE on disk advantages
The size of TRIE is not limited by mainmemory this algorithm can function
with a low amount of main memory. Since most available main memory can
be devoted to BUFFER, this algorithm
can handle smaller values ofe thanother algorithms can handle.
8/4/2019 Approximate Frequency Counts Over Data Streams
75/87
Novel features of this technique
No candidate generation phase.
Compact disk-based tries is novel
Able to compute frequent itemsetsunder low memory conditions.
Able to handle smaller values of support
threshold than previously possible.
8/4/2019 Approximate Frequency Counts Over Data Streams
76/87
Experimental results
8/4/2019 Approximate Frequency Counts Over Data Streams
77/87
Experimental Results
IBM synthetic dataset T10.I4.1000KN = 1Million Avg Tran Size = 10 Input Size = 49MB
IBM synthetic dataset T15.I6.1000KN = 1Million Avg Tran Size = 15 Input Size = 69MB
Frequent word pairs in 100K web documentsN = 100K Avg Tran Size = 134 Input Size = 54MB
Frequent word pairs in 806K Reuters newsreportsN = 806K Avg Tran Size = 61 Input Size = 210MB
8/4/2019 Approximate Frequency Counts Over Data Streams
78/87
What is studied
Support threshold s
Number of transactions N
Size of BUFFER B
Total time taken t
set e = 0.1s
8/4/2019 Approximate Frequency Counts Over Data Streams
79/87
Varying buffer sizes and support s
Timetakenins
econds
Support threshold s
B = 4 MBB = 16 MBB = 28 MBB = 40 MB
Decreasing s leads to increases in running time
Timeinsecon
ds
Support threshold s
B = 4 MB
B = 16 MBB = 28 MBB = 40 MB
IBM test dataset T10.I4.1000K. Reuters 806k docs.
8/4/2019 Approximate Frequency Counts Over Data Streams
80/87
Varying support s and buffer size B
BUFFER size in MB
S = 0.001S = 0.002S = 0.004S = 0.008T
imetakenin
seconds
Kinks occur due to TRIE optimization
on last batch
BUFFER size in MB
Timeinseco
nds
S = 0.004S = 0.008S = 0.012S = 0.016S = 0.020
IBM test dataset T10.I4.1000K. Reuters 806k docs.
8/4/2019 Approximate Frequency Counts Over Data Streams
81/87
Varying length N and support s
Length of stream in Thousands
S = 0.001
S = 0.002
S = 0.004
Timetakeninsec
onds
Running time is linear proportional to the length of the streamThe curve flattens in the end as processing the last batch is faster
Length of stream in Thousands
S = 0.001
S = 0.002
S = 0.004
IBM test dataset T10.I4.1000K. Reuters 806k docs.
8/4/2019 Approximate Frequency Counts Over Data Streams
82/87
Comparison with Apriori
APriori Our Algorithmwith 4MB Buffer
Our Algorithm
with 44MB Buffer
Support Time Memory Time Memory Time Memory
0.001 99 s 82 MB 111 s 12 MB 27 s 45 MB
0.002 25 s 53 MB 94 s 10 MB 15 s 45 MB
0.004 14 s 48 MB 65 s 7MB 8 s 45 MB
0.006 13 s 48 MB 46 s 6 MB 6 s 45 MB
0.008 13 s 48 MB 34 s 5 MB 4 s 45 MB
0.010 14 s 48 MB 26 s 5 MB 4 s 45 MB
Dataset: IBM T10.I4.1000K with 1M transactions, average size 10
8/4/2019 Approximate Frequency Counts Over Data Streams
83/87
Comparison with Iceberg Queries
[FSGM+98] multiple pass algorithm:7000 seconds with 30 MB memory
Our single-pass algorithm:4500 seconds with 26 MB memory
Query: Identify all word pairs in 100K web documentswhich co-occur in at least 0.5% of the documents.
8/4/2019 Approximate Frequency Counts Over Data Streams
84/87
Summary
A Novel Algorithm for computing approximatefrequency counts over Data Streams
S
8/4/2019 Approximate Frequency Counts Over Data Streams
85/87
SummaryAdvantages of the algorithms presented
Require provably small main memoryfootprints
Each of the motivating examples cannow be solved over streaming data
Handle smaller values of supportthreshold than previously possible
Remains practical in environments withmoderate main memory
8/4/2019 Approximate Frequency Counts Over Data Streams
86/87
Summary cont
Give an Apriori error guarantee
Work for variable sized transactions.
Optimized implementation for frequentitemsets
For the datasets tested, the algorithm
runs in one pass and produces exactresults, beating previous algorithms interms of time.
8/4/2019 Approximate Frequency Counts Over Data Streams
87/87
Questions?
More questions/comments can besent to
michal.spivak@sun.com