Wavelet decomposition of data streams
by Dragana Veljkovic
Motivation
• Continuous data streams arise naturally in: • telecommunication and internet traffic• retail and banking transactions• web server log records etc.
• Many applications need this data to be processed on a 24*7 basis in only one pass
Motivation cont.
• Usually this data is accumulated and archived for later use, but not always (e.g. network security)
• The ability to make decisions and interpret interesting patterns online can be crucial and has real dollar value for large corporations (e.g. fraud detection)
Our motivation
• Currently working on data collected from 100 electrodes receiving electrical potential of monkey brain over long periods of time
• We want to look at this data in real time and seek patterns, trends and surprises
Outline
• Background • streams• wavelets• sketches• error analysis
• Results • Implementation details• Strengths and weaknesses of this
approach
Data streams
• Sequence of unbounded, real time data with high rate that can only be read once by an application
• Problems: • Unbounded memory requirements• High data rate
Underlying signal
• Signal is one dimensional function a: [0, …, N-1] ? Z+
• Data item that arrives in time is an ordered pair: <domain, value>
Example: voting results<Texas, 60>
Example: phone call records<210-748, 12>
Data model
Two different data models used for rendering the underlying signal:
• Cash register• Aggregate
Example: cash register model<210-748,10>, <210-689,13>, <210-748, 20>, <210-740, 5>, <210-748, 2>, <210-740, 30>…
where the underlying signal is<210-748, 32>, <210-689, 13>, <210-740, 35>
Stream format
Two distinct formats for the stream– Ordered – Unordered
Example: Aggregate ordered stream – any time seriesExample: Unordered cash-register stream – phone call
records
Ordered cash-register is trivial to convert to order aggregate
Wavelets• Basis functions of limited duration and average
value of zero
• Basis functions are shifted and scaled versions of the original wavelet
Discrete wavelet transform• Uses only fixed values for
wavelet scales based on powers of two
• Wavelet positions are also fixed and non overlapping
• Wavelets form a set of wavelet basis vectors of length N
Example: Haar wavelets on signal of length N = 8
• j = 1,…, logN levels• k = 0,…, 2j-1 spaces for each
levelHaar wavelets for signal of size 8
Wavelet decomposition• Wavelet decomposition can be regarded as projection of
the signal on the set of wavelet basis vectors• Each wavelet coefficient can be computed as the dot
product of the signal with the corresponding basis vector
Example:
Table 1. from Gilbert et al. 2003.
Best B-term decomposition• The signal can be fully recovered from the wavelet
decomposition
• Best B-term decomposition uses only a small number of coefficients, B, that carry the highest energy
• The signal reconstructed using the B-term coefficients and the corresponding vectors is called the best B-term approximation
• Most signals that occur in nature can be well approximated using only a small number of coefficients (5-10).
Computing best B-term decomposition in runtime
For the ordered aggregate model• Maintain two sets of items
• Highest B wavelet basis coefficients for the signal seen so far• logN straddling coefficients, one for each level
• When the data item is read the affected straddling coefficients get updated.
• If a coefficient is no longer straddling it is compared to existing highest B coefficient and the set is updated if necessary. New straddling coefficient is initialized.
• Takes O(B + logN) storage and time for the ordered aggregate model
Sketches
• Sketch is made by projecting a signal onto several different low dimensional spaces which are chosen at random
• Many properties of the signal, such as histograms, can be accurately estimated by looking at the sketch
Definition of a sketch
• Atomic sketch of signal a is the dot product <a, r> where r is a random vector of ±1 valued random variables
• A sketch of a signal is k independent atomic sketches, each with a different random vector rj
• Sketch size is small compared to the signal size
Sketches
• Maintaining the sketch is easy as we are receiving the data
• If element <i, a(i)> arrives, add a(i)*rij to
the sketch corresponding to random vector rj
Example: In cash-register receive <5, 10>, need to add 10* r5
j to each atomic sketch corresponding to the random vector rj
Error metrics
• SSE (sum squared error) – if R is a representation of the signal a then SSE is defined as
• Pseudoenergy of the representation R is computed as
Query processing
• Batched – queries are posed at certain periodic intervals
• Ad hoc – a query may be posed at any time
Batch query using best B-term approximation for day 0 of call records
Figure 2. from Gilbert et al. 2003.
Batch query using best B-term approximation for all 7 days of call records
Figure 3. from Gilbert et al. 2003.
Estimating a point query
Answer to point query i is a(i)• Direct point estimate – directly estimating a(i)
using the sketch• Direct wavelet estimate – use the sketch to
estimate the wavelet coefficients whose support intersects i and reconstruct a(i) using these coefficients
• Another way is to compute a(i) using only the high wavelet coefficients (like the known B-term approximation) whose support intersects a(i)
Using sketches to estimate dot product
• Following parameters characterize how well the sketch does
• e – distortion parameter• d – failure probability• ? – failure threshold
• Sketch of a signal is independent atomic sketches, each with a different random vector
• If the cosine between vectors a and b is greater than ? we estimate the dot product within (1±e) with probability at least 1- d
Sketches and random vectors
• If element <i, a(i)> arrives, add a(i)*rij to the
sketch corresponding to random vector rj
• In order to use the sketches we need to get the elements rj quickly.
• rj is of size N, it can not be stored explicitly
Generating random vectors
• The paper shows that rij can be generated
by a pseudorandom number generator using a seed sj of size logO(1)N
• Generator G is based on second order Reed-Muller codes
• The generator G takes sj and i and outputs ri
j = G(sj, i) quickly
Estimation of dot products using sketches
Lemma:Lemma: Let X be a Let X be a O(logNO(logN/ / dd))--wise median of O(1/ wise median of O(1/ ee22))--wise means of independent copies ofwise means of independent copies of
then we have with probability of 1then we have with probability of 1-- dd
NoteNote: use b=a to estimate energy of a using this : use b=a to estimate energy of a using this lemmalemma
Example: Want to estimate dot product of vectors a and b with no more than 30% error with probability of 80%, assuming the cosine between these two vectors is greater then 0.25
That is e = 0.3, ? = 0.25 and d = 0.2 and for a signal of size N=1024 we would need about 30 atomic sketches
TheoremThere is a streaming algorithm, A, such that, given a signal a[1,…, N] with energy ||a||22 if there is a B-term representation with energy at least ?*||a||22, then, with probability at least (1-d) A finds a representation of at most B terms with pseudoenergy at least (1-e) ?*||a||22. If there is no such B-term representation with energy ?*||a||22, A reports “no good representation”. In any case A uses
space and per item time while processing the stream. This holds with both aggregate and cash-register models
Example: take ?=0.3, d=0.2, e=0.3 and B=10. Then if there exists a 10 terms representation of the signal that captures at least 30% of the signal’s energy the algorithm will output a 10 term representation withenergy at least 21% of the signal with 80% probability
Strengths and weaknesses
• Good example how to work with cash-register models
• Shows several ways to estimate the signal using a sketch
• Time requirements seem higher than the paper claims
• On-line algorithms do not seem as promising as batch algorithms
References1. A. C. Gilbert, Y. Kotidis, S. Muthukrishnan and M. J. Strauss, "One-
pass wavelet decomposition of data streams," IEEE transactions on knowledge and data engineering, Vol. 15, No. 3, May/June 2003.
2. A. C. Gilbert, Y. Kotidis, S. Muthukrishnan and M. J. Strauss, "Surfing wavelets on streams: one-pass summaries for approximate aggregate queries," Proceedings of the 27th VLDB Conference, Roma, Italy 2001.
3. A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan and M. J. Strauss, "Fast, small-space algorithms for approximate histogram maintenance," STOC ’02, May 19- 21, 2002, Montreal, Quebec, Canada.
Answering queries on-lineComparison of sse/energy of top –B wavelets against direct estimates
Table 1. from Gilbert et al. 2003.
Table 2. from Gilbert et al. 2003.
Direct estimates for the top 10 heavy hitters
Figure 6. from Gilbert et al. 2003.
Direct estimates for the top 10 heavy hitters using the greedy algorithm
Figure 7. from Gilbert et al. 2003.
Adaptive greedy pursuit for heavy hitters
• Obtain a very accurate estimate for the first heavy hitter• Get a new sketch by subtracting this value from the
original sketch. This can be done because sketches are linear
• New sketch is a good estimation of the residual distribution in which the second heavy hitter is the peak value
• Use the new sketch to estimate the second heavy hitter• Repeat procedure for more heavy hitters• Each estimate introduces an error and after many
iterations the errors tend to overwhelm the benefits