1
CMU SCS
15-721 DB Sys. Design & Impl.
Sensors & wavelets
Christos Faloutsoswww.cs.cmu.edu/~christos
15-721 C. Faloutsos 2
CMU SCS
Roadmap1) Roots: System R and Ingres2) Implementation: buffering, indexing, q-opt<...>7) Data Analysis - data mining
...sensors, time series, indexing and waveletssensors and forecasting
8) Benchmarks9) vision statements extras (streams/sensors, graphs, multimedia, web, fractals)
15-721 C. Faloutsos 3
CMU SCS
Citation
C. Faloutsos: Searching Multimedia Databases by Content,Kluwer Academic Press, 1996– chapter on GEMINI– chapter on FastMap– Appendices on DFT and wavelets
15-721 C. Faloutsos 4
CMU SCS
Outline
• Motivation• Similarity Search and Indexing• DSP (Digital Signal Processing)• Linear Forecasting• Bursty traffic - fractals and multifractals• Non-linear forecasting• Conclusions
15-721 C. Faloutsos 5
CMU SCS
Problem definition
• Given: one or more sequencesx1 , x2 , … , xt , …(y1, y2, … , yt, …… )
• Find– similar sequences; forecasts– patterns; clusters; outliers
15-721 C. Faloutsos 6
CMU SCS
Motivation - Applications• Financial, sales, economic series
• Medical
– ECGs +; blood pressure etc monitoring
– reactions to new drugs
– elderly care
2
15-721 C. Faloutsos 7
CMU SCS
Motivation - Applications(cont’d)
• ‘Smart house’
– sensors monitor temperature, humidity,air quality
• video surveillance
15-721 C. Faloutsos 8
CMU SCS
Motivation - Applications(cont’d)
• civil/automobile infrastructure
– bridge vibrations [Oppenheim+02]
– road conditions / traffic monitoring
15-721 C. Faloutsos 9
CMU SCS
Stream Data: automobile traffic
Automobile traffic
0200400600800
100012001400160018002000
time
# cars
15-721 C. Faloutsos 10
CMU SCS
Motivation - Applications(cont’d)
• Weather, environment/anti-pollution
– volcano monitoring
– air/water pollutant monitoring
15-721 C. Faloutsos 11
CMU SCS
Stream Data: Sunspots
Sunspot Data
0
50
100
150
200
250
300
time
#sunspots per month
15-721 C. Faloutsos 12
CMU SCS
Motivation - Applications(cont’d)
• Computer systems
– ‘Active Disks’ (buffering, prefetching)
– web servers (ditto)
– network traffic monitoring
– ...
3
15-721 C. Faloutsos 13
CMU SCS
Stream Data: Disk accesses
Disk traffic
0
5000000
10000000
15000000
20000000
time
#bytes
15-721 C. Faloutsos 14
CMU SCS
Settings & Applications
• One or more sensors, collecting time-seriesdata
15-721 C. Faloutsos 15
CMU SCS
Settings & Applications
Each sensor collects data (x1, x2, …, xt, …)
15-721 C. Faloutsos 16
CMU SCS
Settings & Applications
Some sensors ‘report’ to others or to the central site
15-721 C. Faloutsos 17
CMU SCS
Settings & Applications
Goal #1:Finding patternsin a single time sequence
15-721 C. Faloutsos 18
CMU SCS
Settings & Applications
Goal #2:Finding patternsin many time sequences
4
15-721 C. Faloutsos 19
CMU SCS
Problem #1:
Goal: given a signal (eg., #packets over time)Find: patterns, periodicities, and/or compress
year
count lynx caught per year(packets per day;temperature per day)
15-721 C. Faloutsos 20
CMU SCS
Problem#2: ForecastGiven xt, xt-1, … , forecast xt+1
0102030405060708090
1 3 5 7 9 11Time Tick
Num
ber
of p
acke
ts s
ent
??
15-721 C. Faloutsos 21
CMU SCS
Problem#2’: Similarity searchEg., Find a 3-tick pattern, similar to the last one
0102030405060708090
1 3 5 7 9 11Time Tick
Num
ber
of p
acke
ts s
ent
??
15-721 C. Faloutsos 22
CMU SCS
Problem #3:• Given: A set of correlated time sequences• Forecast ‘Sent(t)’
0102030405060708090
1 3 5 7 9 11Time Tick
Num
ber
of p
acke
ts
sent
lost
repeated
15-721 C. Faloutsos 23
CMU SCS
Differences from DSP/Stat
• Semi-infinite streams– we need on-line, ‘any-time’ algorithms
• Can not afford human intervention– need automatic methods
• sensors have limited memory /processing / transmitting power– need for (lossy) compression
15-721 C. Faloutsos 24
CMU SCS
Important observations
Patterns, rules, forecasting and similarityindexing are closely related:
• To do forecasting, we need– to find patterns/rules– to find similar settings in the past
• to find outliers, we need to have forecasts– (outlier = too far away from our forecast)
5
15-721 C. Faloutsos 25
CMU SCS
Important topics NOT in thistutorial:
• Continuous queries– [Babu+Widom ] [Gehrke+][Madden+03]
• Categorical data streams– [Hatonen+96]
• Outlier detection (discontinuities)– [Breunig+00]
15-721 C. Faloutsos 26
CMU SCS
Outline
• Motivation• Similarity Search and Indexing• DSP• Linear Forecasting• Bursty traffic - fractals and multifractals• Non-linear forecasting• Conclusions
15-721 C. Faloutsos 27
CMU SCS
Outline
• Motivation• Similarity Search and Indexing
– distance functions: Euclidean;Time-warping– indexing– feature extraction
• DSP• ...
15-721 C. Faloutsos 28
CMU SCS
Importance of distance functions
Subtle, but absolutely necessary:• A ‘must’ for similarity indexing (->
forecasting)• A ‘must’ for clusteringTwo major families
– Euclidean and Lp norms– Time warping and variations
15-721 C. Faloutsos 29
CMU SCS
Euclidean and Lp
¦�
� n
iii yxyxD
1
2)(),(&&
x(t) y(t)
...
¦�
� n
i
piip yxyxL
1
||),(&&
•L1: city-block = Manhattan•L2 = Euclidean•L �
15-721 C. Faloutsos 30
CMU SCS
Observation #1
• Time sequence -> n-dvector
...
Day-1
Day-2
Day-n
6
15-721 C. Faloutsos 31
CMU SCS
Observation #2
Euclidean distance isclosely related to– cosine similarity– dot product– ‘cross-correlation’
function
...
Day-1
Day-2
Day-n
15-721 C. Faloutsos 32
CMU SCS
Time Warping
• allow accelerations - decelerations– (with or w/o penalty)
• THEN compute the (Euclidean) distance (+penalty)
• related to the string-editing distance
15-721 C. Faloutsos 33
CMU SCS
Time Warping
‘stutters’ :
15-721 C. Faloutsos 34
CMU SCS
Time warpingQ: how to compute it?A: dynamic programming D( i, j ) = cost to matchprefix of length i of first sequence x with prefix
of length j of second sequence y
Skip
15-721 C. Faloutsos 35
CMU SCS
Thus, with no penalty for stutter, for sequencesx1, x2, … , xi,; y1, y2, … , yj
°̄°®
����
�� ),1()1,(
)1,1(min][][),(
jiD
jiD
jiD
jyixjiD x-stutter
y-stutter
no stutter
Time warping Time warpingSkip
15-721 C. Faloutsos 36
CMU SCS
Time warping• Complexity: O(M*N) - quadratic on the
length of the strings• Many variations (penalty for stutters; limit
on the number/percentage of stutters; … )• popular in voice processing
[Rabiner+Juang]
Skip
7
15-721 C. Faloutsos 37
CMU SCS
Other Distance functions
• piece-wise linear/flat approx.; comparepieces [Keogh+01] [Faloutsos+97]
• ‘cepstrum’ (for voice [Rabiner+Juang])– do DFT; take log of amplitude; do DFT again!
• Allow for small gaps [Agrawal+95]See tutorial by [Gunopulos Das, SIGMOD01]
15-721 C. Faloutsos 38
CMU SCS
Conclusions
Prevailing distances:– Euclidean and– time-warping
15-721 C. Faloutsos 39
CMU SCS
Outline
• Motivation• Similarity Search and Indexing
– distance functions– indexing– feature extraction
• DSP• ...
15-721 C. Faloutsos 40
CMU SCS
Indexing
Problem:• given a set of time sequences,• find the ones similar to a desirable query
sequence
15-721 C. Faloutsos 41
CMU SCS
day
$price
1 365
day
$price
1 365
day
$price
1 365
distance function: by expert
15-721 C. Faloutsos 42
CMU SCS
Idea: ‘GEMINI’
Eg., ‘find stocks similar to MSFT’Seq. scanning: too slowHow to accelerate the search?
8
15-721 C. Faloutsos 43
CMU SCS
day1 365
day1 365
S1
Sn
F(S1)
F(Sn)
‘GEMINI’ - Pictorially
eg, avg
eg,. std
15-721 C. Faloutsos 44
CMU SCS
GEMINI
Solution: Quick-and-dirty’ filter:• extract n features (numbers, eg., avg., etc.)• map into a point in n-d feature space• organize points with off-the-shelf spatial
access method (‘SAM’ )• discard false alarms
15-721 C. Faloutsos 45
CMU SCS
Examples of GEMINI
• Time sequences: DFT (up to 100 timesfaster) [SIGMOD94];
• [Kanellakis+], [Mendelzon+]
15-721 C. Faloutsos 46
CMU SCS
Examples of GEMINI
Even on other-than-sequence data:• Images (QBIC) [JIIS94]• tumor-like shapes [VLDB96]• video [Informedia + S-R-trees]• automobile part shapes [Kriegel+97]
15-721 C. Faloutsos 47
CMU SCS
Indexing - SAMs
Q: How do Spatial Access Methods (SAMs)work?
A:
15-721 C. Faloutsos 48
CMU SCS
Indexing - SAMs
Q: How do Spatial Access Methods (SAMs)work?
A: they group nearby points (or regions)together, on nearby disk pages, and answerspatial queries quickly (‘range queries’ ,‘nearest neighbor’ queries etc)
For example:
9
15-721 C. Faloutsos 49
CMU SCS
R-trees
• [Guttman84] eg., w/ fanout 4: group nearbyrectangles to parent MBRs; each group ->disk pageA
B
C
DE
FG
H
I
J
Skip
15-721 C. Faloutsos 50
CMU SCS
R-trees
• eg., w/ fanout 4:
A
B
C
DE
FG
H
I
J
P1
P2
P3
P4F GD E
H I JA B C
Skip
15-721 C. Faloutsos 51
CMU SCS
R-trees
• eg., w/ fanout 4:
A
B
C
DE
FG
H
I
J
P1
P2
P3
P4
P1 P2 P3 P4
F GD E
H I JA B C
Skip
15-721 C. Faloutsos 52
CMU SCS
Conclusions
• Fast indexing: through GEMINI– feature extraction and– (off the shelf) Spatial Access Methods
[Gaede+98]
15-721 C. Faloutsos 53
CMU SCS
Outline
• Motivation• Similarity Search and Indexing
– distance functions– indexing– feature extraction
• DSP• ...
15-721 C. Faloutsos 54
CMU SCS
Outline
• Motivation• Similarity Search and Indexing
– distance functions– indexing– feature extraction
• DFT, DWT, DCT (data independent)• SVD, etc (data dependent)• MDS, FastMap
10
15-721 C. Faloutsos 55
CMU SCS
DFT and cousins
• very good for compressing real signals• more details on DFT/DCT/DWT: later
15-721 C. Faloutsos 56
CMU SCS
DFT and stocks
0.00
2000.00
4000.00
6000.00
8000.00
10000.00
12000.00
1 11 21 31 41 51 61 71 81 91 101 111 121
Fourier appx actual
• Dow Jones Industrialindex, 6/18/2001-12/21/2001
15-721 C. Faloutsos 57
CMU SCS
DFT and stocks
0.00
2000.00
4000.00
6000.00
8000.00
10000.00
12000.00
1 11 21 31 41 51 61 71 81 91 101 111 121
Fourier appx actual
• Dow Jones Industrialindex, 6/18/2001-12/21/2001
• just 3 DFTcoefficients give verygood approximation
1
10
100
1000
10000
100000
1000000
10000000
1 11 21 31 41 51 61 71 81 91 101 111 121
Series1
freq
Log(ampl)
15-721 C. Faloutsos 58
CMU SCS
Outline
• Motivation• Similarity Search and Indexing
– distance functions– indexing– feature extraction
• DFT, DWT, DCT (data independent)• SVD etc (data dependent)• MDS, FastMap
15-721 C. Faloutsos 59
CMU SCS
SVD
• THE optimal method for dimensionalityreduction– (under the Euclidean metric)
15-721 C. Faloutsos 60
CMU SCS
Singular Value Decomposition(SVD)
• SVD (~LSI ~ KL ~ PCA ~ spectralanalysis...) LSI: S. Dumais; M. Berry
KL: eg, Duda+Hart
PCA: eg., Jolliffe
Details: [Press+],
[Faloutsos96]
day1
day2
11
15-721 C. Faloutsos 61
CMU SCS
SVD
• Extremely useful tool– (also behind PageRank/google and Kleinberg’ s
algorithm for hubs and authorities)
• But may be slow: O(N * M * M) if N>M• any approximate, faster method?
15-721 C. Faloutsos 62
CMU SCS
SVD shorcuts• random projections (Johnson-Lindenstrauss
thm [Papadimitriou+ pods98])
15-721 C. Faloutsos 63
CMU SCS
Random projections
• pick ‘enough’ random directions (will be~orthogonal, in high-d!!)
• distances are preserved probabilistically,within epsilon
• (also, use as a pre-processing step for SVD[Papadimitriou+ PODS98])
15-721 C. Faloutsos 64
CMU SCS
Feature extraction - w/ fractals
• Main idea: drop those attributes that don’ taffect the intrinsic (‘fractal’ ) dimensionality[Traina+, SBBD 2000]
• ie., drop attributes that depend on others(linearly or non-linearly!)
Skip
15-721 C. Faloutsos 65
CMU SCS
Feature extraction - w/ fractals
y
xxx
yy(a) Quarter-circle (c) Spike(b)Line1
0000
1
PFD~1
PFD~1global FD=1
Skip
15-721 C. Faloutsos 66
CMU SCS
Outline
• Motivation• Similarity Search and Indexing
– distance functions– indexing– feature extraction
• DFT, DWT, DCT (data independent)• SVD (data dependent)• MDS, FastMap
12
15-721 C. Faloutsos 67
CMU SCS
MDS / FastMap
• but, what if we have NO points to startwith?(eg. Time-warping distance)
• A: Multi-dimensional Scaling (MDS) ;FastMap
15-721 C. Faloutsos 68
CMU SCS
MDS/FastMap
01100100100O5
10100100100O4
100100011O3
100100101O2
100100110O1
O5O4O3O2O1~100
~1
15-721 C. Faloutsos 69
CMU SCS
MDS
Multi Dimensional Scaling
15-721 C. Faloutsos 70
CMU SCS
FastMap
• Multi-dimensional scaling (MDS) can dothat, but in O(N**2) time
• FastMap [Faloutsos+95] takes O(N) time
15-721 C. Faloutsos 71
CMU SCS
FastMap: Application
VideoTrails [Kobla+97]
scene-cut detection (about 10% errors)
15-721 C. Faloutsos 72
CMU SCS
Outline
• Motivation• Similarity Search and Indexing
– distance functions– indexing– feature extraction
• DFT, DWT, DCT (data independent)• SVD (data dependent)• MDS, FastMap
13
15-721 C. Faloutsos 73
CMU SCS
Conclusions - Practitioner’ sguide
Similarity search in time sequences1) establish/choose distance (Euclidean, time-
warping,… )2) extract features (SVD, DWT, MDS), and use
an SAM (R-tree/variant) or a Metric Tree (M-tree)
2’ ) for high intrinsic dimensionalities, consider sequentialscan (it might win… )
15-721 C. Faloutsos 74
CMU SCS
Books
• William H. Press, Saul A. Teukolsky, William T.Vetterling and Brian P. Flannery: Numerical Recipes in C,Cambridge University Press, 1992, 2nd Edition. (Greatdescription, intuition and code for SVD)
• C. Faloutsos: Searching Multimedia Databases by Content,Kluwer Academic Press, 1996 (introduction to SVD, andGEMINI)
15-721 C. Faloutsos 75
CMU SCS
References
• Agrawal, R., K.-I. Lin, et al. (Sept. 1995). Fast SimilaritySearch in the Presence of Noise, Scaling and Translation inTime-Series Databases. Proc. of VLDB, Zurich,Switzerland.
• Babu, S. and J. Widom (2001). “Continuous Queries overData Streams.” SIGMOD Record 30(3): 109-120.
• Breunig, M. M., H.-P. Kriegel, et al. (2000). LOF:Identifying Density-Based Local Outliers. SIGMOD
Conference, Dallas, TX.• Berry, Michael: http://www.cs.utk.edu/~lsi/
15-721 C. Faloutsos 76
CMU SCS
References
• Ciaccia, P., M. Patella, et al. (1997). M-tree: An EfficientAccess Method for Similarity Search in Metric Spaces.VLDB.
• Foltz, P. W. and S. T. Dumais (Dec. 1992). “PersonalizedInformation Delivery: An Analysis of InformationFiltering Methods.” Comm. of ACM (CACM) 35(12): 51-60.
• Guttman, A. (June 1984). R-Trees: A Dynamic IndexStructure for Spatial Searching. Proc. ACM SIGMOD,Boston, Mass.
15-721 C. Faloutsos 77
CMU SCS
References
• Gaede, V. and O. Guenther (1998). “MultidimensionalAccess Methods.” Computing Surveys 30(2): 170-231.
• Gehrke, J. E., F. Korn, et al. (May 2001). On ComputingCorrelated Aggregates Over Continual Data Streams.ACM Sigmod, Santa Barbara, California.
15-721 C. Faloutsos 78
CMU SCS
References
• Gunopulos, D. and G. Das (2001). Time Series SimilarityMeasures and Time Series Indexing. SIGMODConference, Santa Barbara, CA.
• Hatonen, K., M. Klemettinen, et al. (1996). KnowledgeDiscovery from Telecommunication Network AlarmDatabases. ICDE, New Orleans, Louisiana.
• Jolliffe, I. T. (1986). Principal Component Analysis,Springer Verlag.
14
15-721 C. Faloutsos 79
CMU SCS
References
• Keogh, E. J., K. Chakrabarti, et al. (2001). LocallyAdaptive Dimensionality Reduction for Indexing LargeTime Series Databases. SIGMOD Conference, SantaBarbara, CA.
• Kobla, V., D. S. Doermann, et al. (Nov. 1997).VideoTrails: Representing and Visualizing Structure inVideo Sequences. ACM Multimedia 97, Seattle, WA.
15-721 C. Faloutsos 80
CMU SCS
References
• Oppenheim, I. J., A. Jain, et al. (March 2002). A MEMSUltrasonic Transducer for Resident Monitoring of SteelStructures. SPIE Smart Structures Conference SS05, SanDiego.
• Papadimitriou, C. H., P. Raghavan, et al. (1998). LatentSemantic Indexing: A Probabilistic Analysis. PODS,Seattle, WA.
• Rabiner, L. and B.-H. Juang (1993). Fundamentals ofSpeech Recognition, Prentice Hall.
15-721 C. Faloutsos 81
CMU SCS
References
• Traina, C., A. Traina, et al. (October 2000). Fast featureselection using the fractal dimension,. XV BrazilianSymposium on Databases (SBBD), Paraiba, Brazil.
15-721 C. Faloutsos 82
CMU SCS
15-721 C. Faloutsos 83
CMU SCS
Outline
• Motivation• Similarity Search and Indexing• DSP (DFT, DWT)• Linear Forecasting• Bursty traffic - fractals and multifractals• Non-linear forecasting• Conclusions
15-721 C. Faloutsos 84
CMU SCS
Outline
• DFT– Definition of DFT and properties– how to read the DFT spectrum
• DWT– Definition of DWT and properties– how to read the DWT scalogram
15
15-721 C. Faloutsos 85
CMU SCS
Introduction - Problem#1
Goal: given a signal (eg., packets over time)Find: patterns and/or compress
year
count
lynx caught per year(packets per day;automobiles per hour)
-2000
0
2000
4000
6000
8000
1 14 27 40 53 66 79 92 105
15-721 C. Faloutsos 86
CMU SCS
What does DFT do?
A: highlights the periodicities
15-721 C. Faloutsos 87
CMU SCS
DFT: definition• For a sequence x0, x1, … xn-1
• the (n-point) Discrete Fourier Transform is• X0, X1, … Xn-1 :
)/2exp(*/1
)1(
1,,0)/2exp(*/1
1
0
1
0
ntfjXnx
j
nfntfjxnX
n
tft
n
ttf
�
�
��
��
����
¦
¦�
�
�
�
inverse DFT
Skip
15-721 C. Faloutsos 88
CMU SCS
DFT: definition
• Good news: Available in all symbolic mathpackages, eg., in ‘mathematica’x = [1,2,1,2];X = Fourier[x];Plot[ Abs[X] ];
15-721 C. Faloutsos 89
CMU SCS
DFT: Amplitude spectrum
actual mean mean+freq12
1 12 23 34 45 56 67 78 89 100
111
year
count
Freq.
Ampl.
freq=12
freq=0
)(Im)(Re 222
fff XXA � Amplitude:
15-721 C. Faloutsos 90
CMU SCS
DFT: examples
flat
0.000
0.010
0.020
0.030
0.040
0.050
0.060
0.070
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
time freq
Amplitude
16
15-721 C. Faloutsos 91
CMU SCS
DFT: examples
Low frequency sinusoid
-0.150
-0.100
-0.050
0.000
0.050
0.100
0.150
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
time freq
15-721 C. Faloutsos 92
CMU SCS
DFT: examples
• Sinusoid - symmetry property: Xf = X*n-f
-0.150
-0.100
-0.050
0.000
0.050
0.100
0.150
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
time freq
15-721 C. Faloutsos 93
CMU SCS
DFT: examples
• Higher freq. sinusoid
-0.080-0.060-0.040-0.0200.0000.0200.0400.0600.080
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
0.1
0.2
0.3
0.4
0.5
0.6
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
time freq
15-721 C. Faloutsos 94
CMU SCS
DFT: examples
examples
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
=
+
+
15-721 C. Faloutsos 95
CMU SCS
DFT: examples
examples
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Freq.
Ampl.
15-721 C. Faloutsos 96
CMU SCS
Outline
• Motivation• Similarity Search and Indexing• DSP• Linear Forecasting• Bursty traffic - fractals and multifractals• Non-linear forecasting• Conclusions
17
15-721 C. Faloutsos 97
CMU SCS
Outline
• Motivation• Similarity Search and Indexing• DSP
– DFT• Definition of DFT and properties• how to read the DFT spectrum
– DWT
15-721 C. Faloutsos 98
CMU SCS
DFT: Amplitude spectrum
actual mean mean+freq12
1 12 23 34 45 56 67 78 89 100
111
year
count
Freq.
Ampl.
freq=12
freq=0
)(Im)(Re 222
fff XXA � Amplitude:
15-721 C. Faloutsos 99
CMU SCS
DFT: Amplitude spectrum
actual mean mean+freq12
1 12 23 34 45 56 67 78 89 100
111
year
count
Freq.
Ampl.
freq=12
freq=0
15-721 C. Faloutsos 100
CMU SCS
1 12 23 34 45 56 67 78 89 100
111
DFT: Amplitude spectrum
actual mean mean+freq12
year
count
Freq.
Ampl.
freq=12
freq=0
15-721 C. Faloutsos 101
CMU SCS
DFT: Amplitude spectrum
• excellent approximation, with only 2frequencies!
• so what?
actual mean mean+freq12
1 12 23 34 45 56 67 78 89 100
111
Freq.15-721 C. Faloutsos 102
CMU SCS
DFT: Amplitude spectrum
• excellent approximation, with only 2frequencies!
• so what?• A1: (lossy) compression• A2: pattern discovery 1 12 23 34 45 56 67 78 89 10
0
111
18
15-721 C. Faloutsos 103
CMU SCS
DFT: Amplitude spectrum
• excellent approximation, with only 2frequencies!
• so what?• A1: (lossy) compression• A2: pattern discovery
actual mean mean+freq12
15-721 C. Faloutsos 104
CMU SCS
DFT - Conclusions
• It spots periodicities (with the‘amplitude spectrum’)
• can be quickly computed (O( n log n)),thanks to the FFT algorithm.
• standard tool in signal processing(speech, image etc signals)
• (closely related to DCT and JPEG)
15-721 C. Faloutsos 105
CMU SCS
Outline
• Motivation• Similarity Search and Indexing• DSP
– DFT– DWT
• Definition of DWT and properties• how to read the DWT scalogram
15-721 C. Faloutsos 106
CMU SCS
Problem #1:
Goal: given a signal (eg., #packets over time)Find: patterns, periodicities, and/or compress
year
count lynx caught per year(packets per day;virus infections per month)
15-721 C. Faloutsos 107
CMU SCS
Wavelets - DWT
• DFT is great - but, how about compressinga spike?
value
time
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
15-721 C. Faloutsos 108
CMU SCS
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Wavelets - DWT
• DFT is great - but, how about compressinga spike?
• A: Terrible - all DFT coefficients needed!
00.20.40.6
0.81
1.2
1 3 5 7 9 11 13 15
Freq.
Ampl.value
time
19
15-721 C. Faloutsos 109
CMU SCS
Wavelets - DWT
• DFT is great - but, how about compressinga spike?
• A: Terrible - all DFT coefficients needed!
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
value
time
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
15-721 C. Faloutsos 110
CMU SCS
Wavelets - DWT
• Similarly, DFT suffers on short-durationwaves (eg., baritone, silence, soprano)
time
value
15-721 C. Faloutsos 111
CMU SCS
Wavelets - DWT
• Solution#1: Short window Fouriertransform (SWFT)
• But: how short should be the window?
time
freq
time
value
15-721 C. Faloutsos 112
CMU SCS
Wavelets - DWT
• Answer: multiple window sizes! -> DWT
time
freq
Timedomain DFT SWFT DWT
15-721 C. Faloutsos 113
CMU SCS
Haar Wavelets
• subtract sum of left half from right half• repeat recursively for quarters, eight-ths, ...
15-721 C. Faloutsos 114
CMU SCS
Wavelets - construction
x0 x1 x2 x3 x4 x5 x6 x7
20
15-721 C. Faloutsos 115
CMU SCS
Wavelets - construction
x0 x1 x2 x3 x4 x5 x6 x7
s1,0+
-
d1,0 s1,1d1,1 .......level 1
15-721 C. Faloutsos 116
CMU SCS
Wavelets - construction
d2,0
x0 x1 x2 x3 x4 x5 x6 x7
s1,0+
-
d1,0 s1,1d1,1 .......
s2,0level 2
15-721 C. Faloutsos 117
CMU SCS
Wavelets - construction
d2,0
x0 x1 x2 x3 x4 x5 x6 x7
s1,0+
-
d1,0 s1,1d1,1 .......
s2,0
etc ...
15-721 C. Faloutsos 118
CMU SCS
Wavelets - construction
d2,0
x0 x1 x2 x3 x4 x5 x6 x7
s1,0+
-
d1,0 s1,1d1,1 .......
s2,0
Q: map each coefficient
on the time-freq. plane
t
f
15-721 C. Faloutsos 119
CMU SCS
Wavelets - construction
d2,0
x0 x1 x2 x3 x4 x5 x6 x7
s1,0+
-
d1,0 s1,1d1,1 .......
s2,0
Q: map each coefficient
on the time-freq. plane
t
f
15-721 C. Faloutsos 120
CMU SCS
Haar wavelets - code#!/usr/bin/perl5# expects a file with numbers# and prints the dwt transform# The number of time-ticks should be a power of 2# USAGE# haar.pl <fname>
my @vals=();my @smooth; # the smooth component of the signalmy @diff; # the high-freq. component
# collect the values into the array @valwhile(<>){
@vals = ( @vals , split );}
my $len = scalar(@vals);my $half = int($len/2);while($half >= 1 ){ for(my $i=0; $i< $half; $i++){
$diff [$i] = ($vals[2*$i] - $vals[2*$i + 1] )/ sqrt(2); print "\t", $diff[$i]; $smooth [$i] = ($vals[2*$i] + $vals[2*$i + 1] )/ sqrt(2);
} print "\n"; @vals = @smooth; $half = int($half/2);}print "\t", $vals[0], "\n" ; # the final, smooth component
21
15-721 C. Faloutsos 121
CMU SCS
Wavelets - construction
Observation1:‘+’ can be some weighted addition‘-’ is the corresponding weighted difference
(‘Quadrature mirror filters’ )
Observation2: unlike DFT/DCT,there are *many* wavelet bases: Haar, Daubechies-
4, Daubechies-6, Coifman, Morlet, Gabor, ...
15-721 C. Faloutsos 122
CMU SCS
Wavelets - how do they looklike?
• E.g., Daubechies-4
15-721 C. Faloutsos 123
CMU SCS
Wavelets - how do they looklike?
• E.g., Daubechies-4
?
?
15-721 C. Faloutsos 124
CMU SCS
Wavelets - how do they looklike?
• E.g., Daubechies-4
15-721 C. Faloutsos 125
CMU SCS
Outline
• Motivation• Similarity Search and Indexing• DSP
– DFT– DWT
• Definition of DWT and properties• how to read the DWT scalogram
15-721 C. Faloutsos 126
CMU SCS
Wavelets - Drill#1:
t
f
• Q: baritone/silence/soprano - DWT?
time
value
22
15-721 C. Faloutsos 127
CMU SCS
Wavelets - Drill#1:
t
f
• Q: baritone/soprano - DWT?
time
value
15-721 C. Faloutsos 128
CMU SCS
Wavelets - Drill#2:
• Q: spike - DWT?
t
f
1 2 3 4 5 6 7 8
15-721 C. Faloutsos 129
CMU SCS
Wavelets - Drill#2:
t
f
• Q: spike - DWT?
1 2 3 4 5 6 7 8
0.00 0.00 0.71 0.00
0.00 0.50-0.350.35
15-721 C. Faloutsos 130
CMU SCS
Wavelets - Drill#3:
• Q: weekly + daily periodicity, + spike -DWT?
t
f
15-721 C. Faloutsos 131
CMU SCS
Wavelets - Drill#3:
• Q: weekly + daily periodicity, + spike -DWT?
t
f
15-721 C. Faloutsos 132
CMU SCS
Wavelets - Drill#3:
• Q: weekly + daily periodicity, + spike -DWT?
t
f
23
15-721 C. Faloutsos 133
CMU SCS
Wavelets - Drill#3:
• Q: weekly + daily periodicity, + spike -DWT?
t
f
15-721 C. Faloutsos 134
CMU SCS
Wavelets - Drill#3:
• Q: weekly + daily periodicity, + spike -DWT?
t
f
15-721 C. Faloutsos 135
CMU SCS
Wavelets - Drill#3:
• Q: DFT?
t
f
t
f
DWT DFT
15-721 C. Faloutsos 136
CMU SCS
Advantages of Wavelets
• Better compression (better RMSE with samenumber of coefficients - used in JPEG-2000)
• fast to compute (usually: O(n)!)• very good for ‘spikes’• mammalian eye and ear: Gabor wavelets
15-721 C. Faloutsos 137
CMU SCS
Overall Conclusions
• DFT, DCT spot periodicities• DWT : multi-resolution - matches
processing of mammalian ear/eye better• All three: powerful tools for compression,
pattern detection in real signals• All three: included in math packages
– (matlab, ‘R’ , mathematica, … - often inspreadsheets!)
15-721 C. Faloutsos 138
CMU SCS
Overall Conclusions
• DWT : very suitable for self-similartraffic
• DWT: used for summarization of streams[Gilbert+01], db histograms etc
24
15-721 C. Faloutsos 139
CMU SCS
Resources - software and urls
• http://www.dsptutor.freeuk.com/jsanalyser/FFTSpectrumAnalyser.html : Nice javaapplets for FFT
• http://www.relisoft.com/freeware/freq.htmlvoice frequency analyzer (needsmicrophone)
15-721 C. Faloutsos 140
CMU SCS
Resources: software and urls
• xwpl: open source wavelet package fromYale, with excellent GUI
• http://monet.me.ic.ac.uk/people/gavin/java/waveletDemos.html : wavelets andscalograms
15-721 C. Faloutsos 141
CMU SCS
Books
• William H. Press, Saul A. Teukolsky, William T.Vetterling and Brian P. Flannery: Numerical Recipes in C,Cambridge University Press, 1992, 2nd Edition. (Greatdescription, intuition and code for DFT, DWT)
• C. Faloutsos: Searching Multimedia Databases by Content,Kluwer Academic Press, 1996 (introduction to DFT,DWT)
15-721 C. Faloutsos 142
CMU SCS
Additional Reading
• [Gilbert+01] Anna C. Gilbert, Yannis Kotidis and S.Muthukrishnan and Martin Strauss, Surfing Wavelets onStreams: One-Pass Summaries for Approximate AggregateQueries, VLDB 2001