Approximate Frequency Counts Over Data Streams

transcript

8/4/2019 Approximate Frequency Counts Over Data Streams

1/87

Approximate FrequencyCounts over Data Streams

Gurmeet Singh Manku (Standford)

Rajeev Motwani (Standford)

Presented by Michal SpivakNovember, 2003


2/87

The Problem

Stream

Identify all elements whose current frequency exceeds support threshold s = 0.1%.


3/87

Related problem

Stream

Identify all subsets of items whose current frequency exceeds s=0.1%


4/87

Purpose of this paperPresent an algorithm for computing frequency countsexceeding a user-specified threshold over datastreams with the following advantages:

Simple

Low memory footprint

Output is approximate but guaranteed not to exceed

a user specified error parameter. Can be deployed for streams of singleton items and

handle streams of variable sized sets of items.


5/87

Overview Introduction

Frequency counting applications

Problem definition

Algorithm for Frequent Items

Algorithm for Frequent Sets of Items Experimental results

Summary


6/87

Introduction


7/87

Motivating examples Iceberg Query

Perform an aggregate function over an attribute and eliminatethose below some threshold.

Association RulesRequire computation of frequent itemsets.

Iceberg DatacubesGroup bys of a CUBE operator whose aggregate frequencyexceeds threshold

Traffic measurementRequire identification of flows that exceed a certain fraction oftotal traffic


8/87

Whats out there today Algorithms that compute exact results Attempt to minimize number of data passes

(best algorithms take two passes).

Problems when adapted to streams: Only one pass is allowed.

Results are expected to be available withshort response time. Fail to provide any a-priori guarantee on the

quality of their output.


9/87

Why Streams?

Streams vs. Stored dataVolume of a stream over its lifetime can

be huge

Queries for streams require timelyanswers, response times need to besmall

As a result it is not possible to store thestream as an entirety.


10/87

Frequency countingapplications


11/87

Existing applications for the

following problems Iceberg Query

Perform an aggregate function over an attribute andeliminate those below some threshold.

Association RulesRequire computation of frequent itemsets.

Iceberg DatacubesGroup bys of a CUBE operator whose aggregate

frequency exceeds threshold Traffic measurement

Require identification of flows that exceed a certainfraction of total traffic


12/87

Iceberg QueriesIdentify aggregates that exceed a user-specified threshold r

One of the published algorithms to compute iceberg queriesefficiently uses repeated hashing over multiple passes.*

Basic Idea: In the first pass a set of counters is maintained Each incoming item is hashed to one of the counters which is

incremented These counters are then compressed to a bitmap, with a 1

denoting large counter value

In the second pass exact frequencies are maintained for onlythose elements that hash to a counter whose bitmap value is 1

This algorithm is difficult to adapt for streams because it requirestwo passes

*M. FANG, N. SHIVAKUMAR, H. GARCIA-MOLINA,R. MOTWANI, AND J. ULLMAN. Computing icebergqueries efficiently. In Proc. of 24th Intl. Conf. on Very Large Data Bases, pages 299310, 1998.


13/87

Association RulesDefinitions

Transaction subset of items drawn from I, theuniverse of all Items.

Itemset X Ihas support sif X occurs as a subsetat least a fraction - s of all transactions

Associations rules over a set of transactions are ofthe form X=>Y, where X and Y are subsets ofIsuch

that XY = 0 and XUY has support exceeding a userspecified threshold s.

Confidence of a rule X => Y is the valuesupport(XUY) / support(X)

U|


14/87

Example - Market basket analysis

Transaction Id Purchased Items

1 {1, 2, 3}

2 {1, 4}3 {1, 3}

4 {2, 5, 6}

For support = 50%, confidence = 50%, we have the following rules1 => 3 with 50% support and 66% confidence3 => 1 with 50% support and 100% confidence


15/87

Reduce to computing frequent itemsets

TID Items

1 {1, 2, 3}

2 {1, 3}

3 {1, 4}

4 {2, 5, 6}

Frequent Itemset Support

{1} 75%

{2} 50%

{3} 50%

{1, 3} 50%For the rule 1 => 3:Support = Support({1, 3}) = 50%Confidence = Support({1,3})/Support({1}) = 66%

For support = 50%, confidence = 50%


16/87

Toivonens algorithm

Based on sampling of the data stream.

Basically, in the first pass, frequencies are computedfor samples of the stream, and in the second pass

these the validity of these items is determined.

Can be adapted for data stream

Problems:

- false negatives occur because the error infrequency counts is two sided- for small values ofe, the number of samples isenormous ~ 1/e (100 million samples)


17/87

Network flow identification

Flow sequence of transport layer packets thatshare the same source+destination addresses

Estan and Verghese proposed an algorithm for this

identifying flows that exceed a certain threshold.The algorithm is a combination of repeated hashingand sampling, similar to those for iceberg queries.

Algorithm presented in this paper is directlyapplicable to the problem of network flowidentification. It beats the algorithm in terms of spaceand requirements.


18/87

Problem definition


19/87

Problem Definition

Algorithm accepts two user-specified parameters- support threshold s E (0,1)- error parameter E (0,1)

-


20/87

Approximation guarantees

All item(set)s whose true frequencyexceeds sN are output. There are no

false negatives. No item(set)s whose true frequency is

less than (s- (N is output.

Estimated frequencies are lessthan truefrequencies by at mostN


21/87

Input Example

S = 0.1% as a rule of thumb, should be set to one-tenth or one-

twentieth of s. = 0.01%

As per property 1, ALL elements with frequency exceeding 0.1%will be output.

As per property 2, NO element with frequency below 0.09% willbe output

Elements between 0.09% and 0.1% may or may not be output.Those that make their way are false positives

As per property 3, all individual frequencies are less than theirtrue frequencies by at most 0.01%


22/87

Problem Definition cont

An algorithm maintains an -deficientsynopsis if its output satisifies the

aforementioned properties

Goal:to devise algorithms to support-

deficient synopsis using as little mainmemory as possible


23/87

The Algorithms for frequentItems

Sticky Sampling

Lossy Counting


24/87

Sticky Sampling Algorithm

Stream

Create counters by sampling

341530

28

3141

233519


25/87

Notations

Data structure S - set of entries of theform (e,f)

f estimates the frequency of anelement e.

r sampling rate. Sampling an element

with rate = r means we select theelement with probablity = 1/r


26/87

Sticky Sampling cont

Initially S is empty, r = 1. For each incoming element e

if (e exists in S)

increment corresponding felse {sample element with rate r

if (sampled)add entry (e,1) to S

else

ignore}


27/87

The sampling rate

Let t = 1/ log(s-1-1) ( = probability of failure)

First 2t elements are sampled at rate=1

The next 2t elements at rate=2

The next 4t elements at rate=4

and so on


28/87


Whenever the sampling rate r changes:for each entry (e,f) in Srepeat {

toss an unbiased coinif (toss is not successful)diminsh f by one

if (f == 0) {delete entry from S

break}

} until toss is successful


29/87


The number of unsuccessful coin tossesfolows a geometric distribution.

Effectively, after each rate change S istransformed to exactly the state it wouldhave been in, if the new rate had been usedfrom the beginning.

When a user requests a list of items withthreshold s, the output are those entries in Swhere f (s)N


30/87

Theorem 1

Sticky Sampling computes an-deficientsynopsis with probability at least 1 -using at most 2/ log(s-1-1) expected

number of entries.


31/87

Theorem 1 - proof

First 2t elements find their way into S

When r 2N = rt + rt` ( t`E [1,t) ) => 1/r t/N

Error in frequency corresponds to a sequence ofunsuccessful coin tosses during the first fewoccurrences of e.the probability that this length exceeds N is at most(1 1/r)N < (1 t/N)-N < e-t

Number of elements with f > s is no more than 1/s=> the probability that the estimate for anyof themis deficient by N is at most e-t/s


32/87

Theorem 1 proof cont

Probability of failure should be at most. This yields

e-t/s < t 1/ log(s-1-1)

since the space requirements are 2t,the space bound follows


33/87

Sticky Sampling summary

The algorithm name is called sticky sampling becauseS sweeps over the stream like a magnet, attractingall elements which already have an entry in S

The space complexity is independent of N The idea of maintaining samples was first presented

by Gibbons and Matias who used it to solve the top-kproblem.

This algorithm is different in that the sampling rate rincreases logarithmically to produce ALL items withfrequency > s, not just the top k


34/87

Lossy Counting

bucket 1 bucket 2 bucket 3

Divide the stream into bucketsKeep exact counters for items in the bucketsPrune entrys at bucket boundaries


35/87

Lossy Counting cont

A deterministic algorithm that computesfrequency counts over a stream of singleitem

transactions, satisfying the guaranteesoutlined in Section 3 using at most 1/log(N)space where N denotes the current length ofthe stream.

The user specifies two parameters:- support s- error


36/87

Definitions

The incoming stream is conceptually dividedinto bucketsof width w = ceil(1/e)

Buckets are labeled with bucket ids, starting

from 1 Denote the current bucket idby bcurrent whose

value is ceil(N/w) Denote feto be the true frequency of an

element e in the stream seen so far Data stucture D is a set of entries of the form

(e,f,D)


37/87

The algorithm

Initially D is empty

Receive element eif (e exists in D)

increment its frequency (f) by 1else

create a new entry (e, 1, bcurrent 1)

If it bucket boundary prune D by the following therule:(e,f,D) is deleted if f + D bcurrent

When the user requests a list of items with thresholds, output those entries in D where f (s)N


38/87

Some algorithm facts

For an entry (e,f,D) f represents the exactfrequency count for e ever since it was

inserted into D. The value D is the maximum number of times

e could have occurred in the first bcurrent 1buckets ( this value is exactly bcurrent 1)

Once a value is inserted into D its value D isunchanged


39/87

Lossy counting in action

D is Empty

FrequencyCounts

At window boundary, remove entries that for them

f+D

bcurrent

+

First Bucket


40/87

Lossy counting in action contFrequencyCounts

Next Bucket

+

At window boundary, remove entries that for themf+D bcurrent


41/87

Lemma 1

Whenver deletions occur, bcurrenteN

Proof:N = bcurrentw + neN = bcurrent+ en

eN bcurrent


42/87

Lemma 2

Whenever an entry (e,f,D) gets deletedfe bcurrent

Proof by induction Base case: bcurrent = 1

(e,f,D) is deleted only if f = 1 Thus fe bcurrent (fe = f) Induction step:

- Consider (e,f,D) that gets deleted for some bcurrent > 1.- This entry was inserted when bucket D+1 was being processed.

- It was deleted at late as the time as bucket D became full.- By induction the true frequency for e was no more than D.- f is the true frequency of e since it was inserted.- fe f+D combined with the deletion rule f+D bcurrent =>

fe bcurrent


43/87

Lemma 3

If e does not appear in D, then feeN

Proof:If the lemma is true for an element ewhenever it gets deleted, it is true forall other N also.From lemmas 1, 2 we infer that fe eNwhenever it gets deleted.


44/87

Lemma 4

If (e,f,D)E D, then f fe f +eN

Proof:IfD=0, f=fe.Otherwise

e was possibly deleted in the first D buckets.

From lemma 2 fe f+

DD bcurrent 1 eNConclusion

f fe f +eN


45/87

Lossy Counting cont

Lemma 3 shows that all elements whose truefrequency exceed eN have entries in D

Lemma 4 shows that the estimated frequencyof all such elements are accurate to within eN

=> D correctly maintains an e-deficientsynopsis


46/87

Theorem 2

Lossy counting computes an e-deficientsynopsis using at most 1/elog(eN)entries


47/87

Theorem 2 - proof

Let B = bcurrent di denote the number of entries in D

whose bucket id is B - i + 1 (iE[1,B]) e corresponding to di must occur at

least i times in buckets B-i+1 throughB, otherwise it would have been deleted

We get the following constraint:(1) Sidi jw for j = 1,2,B. i = 1..j


48/87

Theorem 2 proof

The following inequality can be provedby induction:

Sdi Sw/i for j = 1,2,B i = 1..j |D| = Sdi for i = 1..B

From the above inequality

|D| Sw/i 1/elogB = 1/elog(eN)


49/87

Sticky Sampling vs.Lossy counting

Support s = 1%

Error = 0.1%

Noofentries

Log10 of N (stream length)


50/87

Sticky Sampling vs.Lossy counting cont

N (stream length)

Noofent

ries

Kinks in the curve for sticky sampling correspond to re-samplingKinks in the curve for lossy counting correspond to bucket boundaries


51/87

Sticky Sampling vs.Lossy counting cont

SS Sticky Sampling LC Lossy Counting

Zipf zipfian distribution Uniq stream with no duplicates

es

SSworst

LCworst

SSZipf

LCZipf

SSUniq

LCUniq

0.1% 1.0% 27K 9K 6K 419 27K 1K

0.05% 0.5% 58K 17K 11K 709 58K 2K

0.01% 0.1% 322K 69K 37K 2K 322K 10K

0.005% 0.05% 672K 124K 62K 4K 672K 20K

k l


52/87

Sticky Sampling vs.Lossy summary

Lossy counting is superior by a largefactor

Sticky sampling performs worsebecause of its tendency to rememberevery unique element that gets sampled

Lossy counting is good at pruning lowfrequency elements quickly

C h l


53/87

Comparison with alternativeapproaches

Toivonen sampling algorithm forassociation rules.

Sticky sampling beats the approach byroughly a factor of1/e

C i i h l i


54/87

Comparison with alternativeapproaches cont

KPS02 In the first path the algorithmmaintains 1/e elements with theirfrequencies. If a counter exists for an

element it is increased, if there is a freecounter it is inserted, otherwise all existingcounters are reduced by one

Can be used to maintain e-deficient synopsiswith exactly 1/e space

If the input stream is ZipfianLossy Counting takes less than 1/e spacefor e=0.01% roughly 2000 entries ~ 20% 1/e


55/87

Frequent Sets of Items

From theory to Practice


56/87

Frequent Sets of Items

Stream

Identify all subsets of items whosecurrent frequency exceeds s = 0.1%.


57/87

Frequent itemsets algorithm

Input: stream of transactions, eachtransaction is a set of items from I

N: length of the stream

User specifies two parameters:support s, error e

Challenge:

- handling variable sized transactions- avoiding explicit enumeration of all subsetsof any transaction


58/87

Notations

Data structure D set of entries of the form (set, f, D)

Transactions are divided into buckets

w = ceil(1/e) no. of transactions in each bucket

bcurrent current bucket id

Transactions are not processed one by one. Mainmemory is filled with as many transactions as possible.Processing is done on a batch of transactions.

B no. of buckets in main memory in the currentbatch being processed.


59/87

Update D

UPDATE_SET: for each entry (f,set,D) E D,update f by counting occurrences of set in the

current batch. If the updated entry satisfiesf+D bcurrent, we delete this entry

NEW_SET: if a set sethas frequency f B inthe current batch and set does not occur in

D, create a new entry (set,f,bcurrentB)


60/87

Algorithm facts

If fset eN it has an entry in D

If (set,f,D)ED then the true frequency of fsetsatisfies the inequality f f

set f+D

When a user requests a list of items withthreshold s, output those entries in D wheref (s-e)N

B needs to be a large number.Any subset of I that occurs B+1 times ormore contributes to D.


61/87

Three modules

BUFFER

TRIE

SUBSET-GEN

maintains the data structureD

operates on the current batch of transactions

repeatedly reads in a batch of transactionsinto available main memory

implement UPDATE_SET, NEW_SET


62/87

Module 1 - Buffer

bucket 1 bucket 2 bucket 3 bucket 4 bucket 5 bucket 6

In Main Memory

Read a batch of transactions

Transactions are laid out one after the other in a big array A bitmap is used to remember transaction boundaries After reading in a batch, BUFFER sorts each transactionby its item-ids


63/87

Module 2 - TRIE

50

40

30

31 29 32

45

42

50 40 30 31 2945 32 42

Sets with frequency counts


64/87

Module 2 TRIE cont

Nodes are labeled {item-id, f, D, level}

Children of any node are ordered by their item-ids

Root nodes are also ordered by their item-ids

A node represents an itemset consisting of item-idsin that node and all its ancestors

TRIE is maintained as an array of entries of the form{item-id, f, D, level} (pre-order of the trees).

Equivalent to a lexicographicordering of subsets itencodes.

No pointers, levels compactly encode the underlyingtree structure.


65/87

Module 3 - SetGen

BUFFER

3 3 3 4

2 2 12 1311

Frequency countsof subsetsin lexicographic order

SetGen uses the following pruning rule:if a subset S does not make its way into TRIE after application of bothUPDATE_SET and NEW_SET, then no supersets of S should be considered


66/87

Overall Algorithm

BUFFER

3 3 3 42 2 1

2 131

1 SUBSET-GEN

TRIE new TRIE

Effi i t I l t ti


67/87

Efficient ImplementationsBuffer

If item-ids are successive integers from1 thru |I|, and Iis small enough (less

than 1 million)Maintain exact frequency counts forsingleton sets. Prune away those item-

ids whose frequency is less than eNand then sort the transactions

If |I| = 105, array size = 0.4 MB

Effi i t I l t ti


68/87

Efficient ImplementationsTRIE

Take advantage of the fact that the setsproduced by SetGen are lexicographic.

Maintain TRIE as a set of fairly large-sized

chunks of memory instead of one huge array Instead of modifying the original TRIE, create

a new TRIE. Chunks from the old TRIE are freed as soon

as they are not required. By the time SetGen finishes, the chunks of

the original TRIE have been discarder.

Effi i t I l t ti


69/87

Efficient ImplementationsSetGen

Employs a priority queue called Heap Initially contains pointers to smallest item-ids

of all transactions in buffer Duplicate members are maintained together

and constitute a single item in the Heap.Chain all these pointers together.

Derive the space from BUFFER. Change item-ids with pointers.

Effi i t I l t ti


70/87

Efficient ImplementationsSetGen cont

316542541321

12

Effi i t I l t ti


71/87


Repeatedly process the smallest item-id in Heaptogenerate singleton sets.

If the singleton belongs to TRIE after UPDATE_SET

and NEW_SET try to generate the next set byextending the current singleton set.

This is done by invoking SetGen recursively with anew Heapcreated out of successors of the pointersto item-ids just processed and removed.

When the recursive call returns, the smallest entry inHeapis removed and all successors of the currentlysmallest item-id are added.

Efficient Implementations


72/87


316542521321

12

2

3


73/87

System issues and optimizations

Buffer scans the incoming stream by memorymapping the input file.

Use standard qsort to sort transactions Threading SetGen and Buffer does not help

because SetGen is significantly slower. The rate at which tries are scanned is much

smaller than the rate at which sequiential disk

I/O can be done Possible to maintain TRIE on disk without loss

in performance

S i d i i i


74/87

System issues and optimizationsTRIE on disk advantages

The size of TRIE is not limited by mainmemory this algorithm can function

with a low amount of main memory. Since most available main memory can

be devoted to BUFFER, this algorithm

can handle smaller values ofe thanother algorithms can handle.


75/87

Novel features of this technique

No candidate generation phase.

Compact disk-based tries is novel

Able to compute frequent itemsetsunder low memory conditions.

Able to handle smaller values of support

threshold than previously possible.


76/87

Experimental results


77/87

Experimental Results

IBM synthetic dataset T10.I4.1000KN = 1Million Avg Tran Size = 10 Input Size = 49MB

IBM synthetic dataset T15.I6.1000KN = 1Million Avg Tran Size = 15 Input Size = 69MB

Frequent word pairs in 100K web documentsN = 100K Avg Tran Size = 134 Input Size = 54MB

Frequent word pairs in 806K Reuters newsreportsN = 806K Avg Tran Size = 61 Input Size = 210MB


78/87

What is studied

Support threshold s

Number of transactions N

Size of BUFFER B

Total time taken t

set e = 0.1s


79/87

Varying buffer sizes and support s

Timetakenins

econds

Support threshold s

B = 4 MBB = 16 MBB = 28 MBB = 40 MB

Decreasing s leads to increases in running time

Timeinsecon

ds

Support threshold s

B = 4 MB

B = 16 MBB = 28 MBB = 40 MB

IBM test dataset T10.I4.1000K. Reuters 806k docs.


80/87

Varying support s and buffer size B

BUFFER size in MB

S = 0.001S = 0.002S = 0.004S = 0.008T

imetakenin

seconds

Kinks occur due to TRIE optimization

on last batch

BUFFER size in MB

Timeinseco

nds

S = 0.004S = 0.008S = 0.012S = 0.016S = 0.020



81/87

Varying length N and support s

Length of stream in Thousands

S = 0.001

S = 0.002

S = 0.004

Timetakeninsec

onds

Running time is linear proportional to the length of the streamThe curve flattens in the end as processing the last batch is faster

Length of stream in Thousands

S = 0.001

S = 0.002

S = 0.004



82/87

Comparison with Apriori

APriori Our Algorithmwith 4MB Buffer

Our Algorithm

with 44MB Buffer

Support Time Memory Time Memory Time Memory

0.001 99 s 82 MB 111 s 12 MB 27 s 45 MB

0.002 25 s 53 MB 94 s 10 MB 15 s 45 MB

0.004 14 s 48 MB 65 s 7MB 8 s 45 MB

0.006 13 s 48 MB 46 s 6 MB 6 s 45 MB

0.008 13 s 48 MB 34 s 5 MB 4 s 45 MB

0.010 14 s 48 MB 26 s 5 MB 4 s 45 MB

Dataset: IBM T10.I4.1000K with 1M transactions, average size 10


83/87

Comparison with Iceberg Queries

[FSGM+98] multiple pass algorithm:7000 seconds with 30 MB memory

Our single-pass algorithm:4500 seconds with 26 MB memory

Query: Identify all word pairs in 100K web documentswhich co-occur in at least 0.5% of the documents.


84/87

Summary

A Novel Algorithm for computing approximatefrequency counts over Data Streams

S


85/87

SummaryAdvantages of the algorithms presented

Require provably small main memoryfootprints

Each of the motivating examples cannow be solved over streaming data

Handle smaller values of supportthreshold than previously possible

Remains practical in environments withmoderate main memory


86/87

Summary cont

Give an Apriori error guarantee

Work for variable sized transactions.

Optimized implementation for frequentitemsets

For the datasets tested, the algorithm

runs in one pass and produces exactresults, beating previous algorithms interms of time.


87/87

Questions?

More questions/comments can besent to

michal.spivak@sun.com

Approximate Frequency Counts Over Data Streams

Documents