Indexing and Data Mining in Multimedia Databases
Christos Faloutsos
CMU www.cs.cmu.edu/~christos
USC 2001 C. Faloutsos 2
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
• New tools for Data Mining: Fractals
• Conclusions
• Resources
USC 2001 C. Faloutsos 3
Problem
Given a large collection of (multimedia) records, find similar/interesting things, ie:
• Allow fast, approximate queries, and
• Find rules/patterns
USC 2001 C. Faloutsos 4
Sample queries
• Similarity search– Find pairs of branches with similar sales
patterns– find medical cases similar to Smith's– Find pairs of sensor series that move in sync– Find shapes like a spark-plug
USC 2001 C. Faloutsos 5
Sample queries –cont’d
• Rule discovery– Clusters (of branches; of sensor data; ...)– Forecasting (total sales for next year?)– Outliers (eg., unexpected part failures; fraud
detection)
USC 2001 C. Faloutsos 6
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
• New tools for Data Mining: Fractals
• Conclusions
• related projects @ CMU and resourses
USC 2001 C. Faloutsos 7
Indexing - Multimedia
Problem:
• given a set of (multimedia) objects,
• find the ones similar to a desirable query object
USC 2001 C. Faloutsos 8
day
$price
1 365
day
$price
1 365
day
$price
1 365
distance function: by expert
USC 2001 C. Faloutsos 9
day1 365
day1 365
S1
Sn
F(S1)
F(Sn)
‘GEMINI’ - Pictorially
eg, avg
eg,. std
USC 2001 C. Faloutsos 10
Remaining issues
• how to extract features automatically?
• how to merge similarity scores from different media
USC 2001 C. Faloutsos 11
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search– Visualization: Fastmap– Relevance feedback: FALCON
• Data Mining / Fractals
• Conclusions
USC 2001 C. Faloutsos 12
FastMap
O1 O2 O3 O4 O5
O1 0 1 1 100 100
O2 1 0 1 100 100
O3 1 1 0 100 100
O4 100 100 100 0 1
O5 100 100 100 1 0
~100
~1
??
USC 2001 C. Faloutsos 13
FastMap
• Multi-dimensional scaling (MDS) can do that, but in O(N**2) time
• We want a linear algorithm: FastMap [SIGMOD95]
USC 2001 C. Faloutsos 14
Applications: time sequences
• given n co-evolving time sequences
• visualize them + find rules [ICDE00]
time
rate
HKD
JPY
DEM
USC 2001 C. Faloutsos 15
Applications - financial• currency exchange rates [ICDE00]
USD(t)
USD(t-5)
FRFGBPJPYHKD
USC 2001 C. Faloutsos 16
Applications - financial• currency exchange rates [ICDE00]
USD
HKD
JPY
FRFDEM
GBP
USD(t)
USD(t-5)
USC 2001 C. Faloutsos 17
Application: VideoTrails
[ACM MM97]
USC 2001 C. Faloutsos 18
VideoTrails - usage
• scene-cut detection (about 10% errors)
• scene classification (eg., dialogue vs action)
USC 2001 C. Faloutsos 19
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search– Visualization: Fastmap– Relevance feedback: FALCON
• Data Mining / Fractals
• Conclusions
USC 2001 C. Faloutsos 20
Merging similarity scores
• eg., video: text, color, motion, audio– weights change with the query!
• solution 1: user specifies weights
• solution 2: user gives examples – and we ‘learn’ what he/she wants: rel. feedback
(Rocchio, MARS, MindReader)– but: how about disjunctive queries?
USC 2001 C. Faloutsos 21
‘FALCON’Inverted VsVs
Trader wants only ‘unstable’ stocks
USC 2001 C. Faloutsos 22
“Single query point” methods
Rocchio
+
+ ++
++
x
USC 2001 C. Faloutsos 23
“Single query point” methods
Rocchio MindReader
+
+ ++
++ +
+ ++
++ +
+ ++
++
MARS
The averaging affect in action...
x x x
USC 2001 C. Faloutsos 24
++
+
++
Main idea: FALCON Contours
feature1 (eg., temperature)
feature2
eg., frequency
[Wu+, vldb2000]
USC 2001 C. Faloutsos 25
Conclusions for indexing + visualization
• GEMINI: fast indexing, exploiting off-the-shelf SAMs
• FastMap: automatic feature extraction in O(N) time
• FALCON: relevance feedback for disjunctive queries
USC 2001 C. Faloutsos 26
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
• New tools for Data Mining: Fractals
• Conclusions
• Resourses
USC 2001 C. Faloutsos 27
Data mining & fractals – Road map
• Motivation – problems / case study
• Definition of fractals and power laws
• Solutions to posed problems
• More examples
USC 2001 C. Faloutsos 28
Problem #1 - spatial d.m.
Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’
galaxies
(stores & households ; mpg & MTBF...)
- patterns? (not Gaussian; not uniform)
-attraction/repulsion?
- separability??
USC 2001 C. Faloutsos 29
Problem#2: dim. reduction
• given attributes x1, ... xn
– possibly, non-linearly correlated
• drop the useless ones
(Q: why?
A: to avoid the ‘dimensionality curse’)
USC 2001 C. Faloutsos 30
Answer:
• Fractals / self-similarities / power laws
USC 2001 C. Faloutsos 31
What is a fractal?
= self-similar point set, e.g., Sierpinski triangle:
...zero area;
infinite length!
USC 2001 C. Faloutsos 32
Definitions (cont’d)
• Paradox: Infinite perimeter ; Zero area!
• ‘dimensionality’: between 1 and 2
• actually: Log(3)/Log(2) = 1.58… (long story)
USC 2001 C. Faloutsos 33
Intrinsic (‘fractal’) dimension
• Q: fractal dimension of a line?
x y
5 1
4 2
3 3
2 4
Eg:
#cylinders; miles / gallon
USC 2001 C. Faloutsos 34
Intrinsic (‘fractal’) dimension
• Q: fractal dimension of a line?
• A: nn ( <= r ) ~ r^1(‘power law’: y=x^a)
USC 2001 C. Faloutsos 35
Intrinsic (‘fractal’) dimension
• Q: fractal dimension of a line?
• A: nn ( <= r ) ~ r^1(‘power law’: y=x^a)
• Q: fd of a plane?• A: nn ( <= r ) ~ r^2fd== slope of (log(nn) vs
log(r) )
USC 2001 C. Faloutsos 36
Sierpinsky triangle
log( r )
log(#pairs within <=r )
1.58
== ‘correlation integral’
USC 2001 C. Faloutsos 37
Road map
• Motivation – problems / case studies
• Definition of fractals and power laws
• Solutions to posed problems
• More examples
• Conclusions
USC 2001 C. Faloutsos 38
Solution#1: spatial d.m.Galaxies (Sloan Digital Sky Survey w/ B.
Nichol - ‘BOPS’ plot - [sigmod2000])
•clusters?
•separable?
•attraction/repulsion?
•data ‘scrubbing’ – duplicates?
USC 2001 C. Faloutsos 39
Solution#1: spatial d.m.
log(r)
log(#pairs within <=r )
spi-spi
spi-ell
ell-ell
- 1.8 slope
- plateau!
- repulsion!
USC 2001 C. Faloutsos 40
Solution#1: spatial d.m.
log(r)
log(#pairs within <=r )
spi-spi
spi-ell
ell-ell
- 1.8 slope
- plateau!
- repulsion!
[w/ Seeger, Traina, Traina, SIGMOD00]
USC 2001 C. Faloutsos 41
spatial d.m.
r1r2
r1
r2
Heuristic on choosing # of clusters
USC 2001 C. Faloutsos 42
Solution#1: spatial d.m.
log(r)
log(#pairs within <=r )
spi-spi
spi-ell
ell-ell
- 1.8 slope
- plateau!
- repulsion!
USC 2001 C. Faloutsos 43
Solution#1: spatial d.m.
log(r)
log(#pairs within <=r )
spi-spi
spi-ell
ell-ell
- 1.8 slope
- plateau!
-repulsion!!
-duplicates
USC 2001 C. Faloutsos 44
Problem #2: Dim. reduction
y
xxx
yy(a) Quarter-circle (c) Spike(b)Line1
0000
1
USC 2001 C. Faloutsos 45
Solution:
• drop the attributes that don’t increase the ‘partial f.d.’ PFD
• dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00]
USC 2001 C. Faloutsos 46
Problem #2: dim. reduction
y
xxx
yy(a) Quarter-circle (c) Spike(b)Line1
0000
1
PFD~1
PFD~1global FD=1 PFD=1
PFD=0PFD=1
USC 2001 C. Faloutsos 47
Problem #2: dim. reduction
y
xxx
yy(a) Quarter-circle (c) Spike(b)Line1
0000
1
PFD~1
PFD=1global FD=1PFD=1
PFD=0PFD=1
Notice: ‘max variance’ would fail here
USC 2001 C. Faloutsos 48
Problem #2: dim. reduction
y
xxx
yy(a) Quarter-circle (c) Spike(b)Line1
0000
1
PFD~1
PFD~1global FD=1
PFD=1
PFD=0PFD=1
Notice: SVD would fail here
USC 2001 C. Faloutsos 49
Road map
• Motivation – problems / case studies
• Definition of fractals and power laws
• Solutions to posed problems
• More examples– fractals– power laws
• Conclusions
USC 2001 C. Faloutsos 50
disk traffic
• Not Poisson, not(?) iid - BUT: self-similar• How to model it?
time
#bytes
USC 2001 C. Faloutsos 51
traffic
• disk traces (80-20 ‘law’ = ‘multifractal’ [ICDE’02])
time
#bytes
20% 80%
USC 2001 C. Faloutsos 52
Traffic
Many other time-sequences are bursty/clustered: (such as?)
USC 2001 C. Faloutsos 53
Tape accesses
time
Tape#1 Tape# N
# tapes needed, to retrieve n records?
(# days down, due to failures / hurricanes / communication noise...)
USC 2001 C. Faloutsos 54
Tape accesses
time
Tape#1 Tape# N
# tapes retrieved
# qual. records
50-50 = Poisson
real
USC 2001 C. Faloutsos 55
More apps: Brain scans
• Oct-trees; brain-scans
octree levels
Log(#octants)
2.63 = fd
USC 2001 C. Faloutsos 56
Cross-roads of Montgomery county:
•any rules?
GIS points
USC 2001 C. Faloutsos 57
GIS
A: self-similarity:• intrinsic dim. = 1.51• avg#neighbors(<= r )
= r^D
log( r )
log(#pairs(within <= r))
1.51
USC 2001 C. Faloutsos 58
Examples:LB county
• Long Beach county of CA (road end-points)
USC 2001 C. Faloutsos 59
More fractals:
• cardiovascular system: 3 (!)
• stock prices (LYCOS) - random walks: 1.5
• Coastlines: 1.2-1.58 (?)
1 year 2 years
USC 2001 C. Faloutsos 60
USC 2001 C. Faloutsos 61
Road map
• Motivation – problems / case studies
• Definition of fractals and power laws
• Solutions to posed problems
• More examples – fractals– power laws
• Conclusions
USC 2001 C. Faloutsos 62
Fractals <-> Power laws
self-similarity ->• <=> fractals • <=> scale-free• <=> power-laws (y=x^a, F=C*r^(-2))
log( r )
log(#pairs within <=r )
1.58
USC 2001 C. Faloutsos 63
Bible
RANK-FREQUENCY plot: (in log-log scales)
Zipf’s (first) Law:
Zipf’s law
log(rank)
log(freq)
“the”
“and”
USC 2001 C. Faloutsos 64
Zipf’s law
• similarly for first names (slope ~-1)
• last names (~ -0.7)
• etc
USC 2001 C. Faloutsos 65
More power laws
• Energy of earthquakes (Gutenberg-Richter law) [simscience.org]
log(count)
magnitudeday
amplitude
USC 2001 C. Faloutsos 66
<url, u-id, ....>
Web Site Traffic
log(freq)
log(count)
Zipf
Clickstream data
USC 2001 C. Faloutsos 67
Lotka’s law
• library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001)
log(#citations)
log(count)
J. Ullman
USC 2001 C. Faloutsos 68
Korcak’s law
Scandinavian lakes area vs complementary cumulative count (log-log axes)
log(count( >= area))
log(area)
USC 2001 C. Faloutsos 69
More power laws: Korcak
Japan islands;
area vs cumulative count (log-log axes) log(area)
log(count( >= area))
USC 2001 C. Faloutsos 70
(Korcak’s law: Aegean islands)
USC 2001 C. Faloutsos 71
Olympic medals:
y = -0.9676x + 2.3054
R2 = 0.9458
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2
Series1
Linear (Series1)
log rank
log(# medals)
USA
ChinaRussia
USC 2001 C. Faloutsos 72
SALES data – store#96
# units sold
count of products
USC 2001 C. Faloutsos 73
TELCO data
# of service units
count ofcustomers
USC 2001 C. Faloutsos 74
More power laws on the Internet
degree vs rank, for Internet domains (log-log) [sigcomm99]
log(rank)
log(degree)
-0.82
USC 2001 C. Faloutsos 75
Even more power laws:
• Income distribution (Pareto’s law);
• duration of UNIX jobs [Harchol-Balter] • Distribution of UNIX file sizes• Web graph [CLEVER-IBM; Barabasi]
USC 2001 C. Faloutsos 76
Overall Conclusions:
‘Find similar/interesting things’ in multimedia databases
• Indexing: feature extraction (‘GEMINI’)– automatic feature extraction: FastMap– Relevance feedback: FALCON
USC 2001 C. Faloutsos 77
Conclusions - cont’d
• New tools for Data Mining: Fractals/power laws:– appear everywhere– lead to skewed distributions (Gaussian,
Poisson, uniformity, independence)– ‘correlation integral’ for separability/cluster
detection– PFD for dimensionality reduction
USC 2001 C. Faloutsos 78
Resources:
• Software and papers:– www.cs.cmu.edu/~christos– Fractal dimension (FracDim)– Separability (sigmod 2000, kdd2001)– Relevance feedback for query by content
(FALCON – vldb 2000)
USC 2001 C. Faloutsos 79
Resources
• Manfred Schroeder “Chaos, Fractals and Power Laws”