+ All Categories
Home > Documents > Indexing and Data Mining in Multimedia Databases

Indexing and Data Mining in Multimedia Databases

Date post: 27-Jan-2015
Category:
Upload: tommy96
View: 111 times
Download: 1 times
Share this document with a friend
Description:
 
Popular Tags:
79
Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU www.cs.cmu.edu/~christos
Transcript
Page 1: Indexing and Data Mining in Multimedia Databases

Indexing and Data Mining in Multimedia Databases

Christos Faloutsos

CMU www.cs.cmu.edu/~christos

Page 2: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 2

Outline

Goal: ‘Find similar / interesting things’

• Problem - Applications

• Indexing - similarity search

• New tools for Data Mining: Fractals

• Conclusions

• Resources

Page 3: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 3

Problem

Given a large collection of (multimedia) records, find similar/interesting things, ie:

• Allow fast, approximate queries, and

• Find rules/patterns

Page 4: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 4

Sample queries

• Similarity search– Find pairs of branches with similar sales

patterns– find medical cases similar to Smith's– Find pairs of sensor series that move in sync– Find shapes like a spark-plug

Page 5: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 5

Sample queries –cont’d

• Rule discovery– Clusters (of branches; of sensor data; ...)– Forecasting (total sales for next year?)– Outliers (eg., unexpected part failures; fraud

detection)

Page 6: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 6

Outline

Goal: ‘Find similar / interesting things’

• Problem - Applications

• Indexing - similarity search

• New tools for Data Mining: Fractals

• Conclusions

• related projects @ CMU and resourses

Page 7: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 7

Indexing - Multimedia

Problem:

• given a set of (multimedia) objects,

• find the ones similar to a desirable query object

Page 8: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 8

day

$price

1 365

day

$price

1 365

day

$price

1 365

distance function: by expert

Page 9: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 9

day1 365

day1 365

S1

Sn

F(S1)

F(Sn)

‘GEMINI’ - Pictorially

eg, avg

eg,. std

Page 10: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 10

Remaining issues

• how to extract features automatically?

• how to merge similarity scores from different media

Page 11: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 11

Outline

Goal: ‘Find similar / interesting things’

• Problem - Applications

• Indexing - similarity search– Visualization: Fastmap– Relevance feedback: FALCON

• Data Mining / Fractals

• Conclusions

Page 12: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 12

FastMap

O1 O2 O3 O4 O5

O1 0 1 1 100 100

O2 1 0 1 100 100

O3 1 1 0 100 100

O4 100 100 100 0 1

O5 100 100 100 1 0

~100

~1

??

Page 13: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 13

FastMap

• Multi-dimensional scaling (MDS) can do that, but in O(N**2) time

• We want a linear algorithm: FastMap [SIGMOD95]

Page 14: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 14

Applications: time sequences

• given n co-evolving time sequences

• visualize them + find rules [ICDE00]

time

rate

HKD

JPY

DEM

Page 15: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 15

Applications - financial• currency exchange rates [ICDE00]

USD(t)

USD(t-5)

FRFGBPJPYHKD

Page 16: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 16

Applications - financial• currency exchange rates [ICDE00]

USD

HKD

JPY

FRFDEM

GBP

USD(t)

USD(t-5)

Page 17: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 17

Application: VideoTrails

[ACM MM97]

Page 18: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 18

VideoTrails - usage

• scene-cut detection (about 10% errors)

• scene classification (eg., dialogue vs action)

Page 19: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 19

Outline

Goal: ‘Find similar / interesting things’

• Problem - Applications

• Indexing - similarity search– Visualization: Fastmap– Relevance feedback: FALCON

• Data Mining / Fractals

• Conclusions

Page 20: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 20

Merging similarity scores

• eg., video: text, color, motion, audio– weights change with the query!

• solution 1: user specifies weights

• solution 2: user gives examples – and we ‘learn’ what he/she wants: rel. feedback

(Rocchio, MARS, MindReader)– but: how about disjunctive queries?

Page 21: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 21

‘FALCON’Inverted VsVs

Trader wants only ‘unstable’ stocks

Page 22: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 22

“Single query point” methods

Rocchio

+

+ ++

++

x

Page 23: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 23

“Single query point” methods

Rocchio MindReader

+

+ ++

++ +

+ ++

++ +

+ ++

++

MARS

The averaging affect in action...

x x x

Page 24: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 24

++

+

++

Main idea: FALCON Contours

feature1 (eg., temperature)

feature2

eg., frequency

[Wu+, vldb2000]

Page 25: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 25

Conclusions for indexing + visualization

• GEMINI: fast indexing, exploiting off-the-shelf SAMs

• FastMap: automatic feature extraction in O(N) time

• FALCON: relevance feedback for disjunctive queries

Page 26: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 26

Outline

Goal: ‘Find similar / interesting things’

• Problem - Applications

• Indexing - similarity search

• New tools for Data Mining: Fractals

• Conclusions

• Resourses

Page 27: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 27

Data mining & fractals – Road map

• Motivation – problems / case study

• Definition of fractals and power laws

• Solutions to posed problems

• More examples

Page 28: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 28

Problem #1 - spatial d.m.

Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’

galaxies

(stores & households ; mpg & MTBF...)

- patterns? (not Gaussian; not uniform)

-attraction/repulsion?

- separability??

Page 29: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 29

Problem#2: dim. reduction

• given attributes x1, ... xn

– possibly, non-linearly correlated

• drop the useless ones

(Q: why?

A: to avoid the ‘dimensionality curse’)

Page 30: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 30

Answer:

• Fractals / self-similarities / power laws

Page 31: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 31

What is a fractal?

= self-similar point set, e.g., Sierpinski triangle:

...zero area;

infinite length!

Page 32: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 32

Definitions (cont’d)

• Paradox: Infinite perimeter ; Zero area!

• ‘dimensionality’: between 1 and 2

• actually: Log(3)/Log(2) = 1.58… (long story)

Page 33: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 33

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of a line?

x y

5 1

4 2

3 3

2 4

Eg:

#cylinders; miles / gallon

Page 34: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 34

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of a line?

• A: nn ( <= r ) ~ r^1(‘power law’: y=x^a)

Page 35: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 35

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of a line?

• A: nn ( <= r ) ~ r^1(‘power law’: y=x^a)

• Q: fd of a plane?• A: nn ( <= r ) ~ r^2fd== slope of (log(nn) vs

log(r) )

Page 36: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 36

Sierpinsky triangle

log( r )

log(#pairs within <=r )

1.58

== ‘correlation integral’

Page 37: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 37

Road map

• Motivation – problems / case studies

• Definition of fractals and power laws

• Solutions to posed problems

• More examples

• Conclusions

Page 38: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 38

Solution#1: spatial d.m.Galaxies (Sloan Digital Sky Survey w/ B.

Nichol - ‘BOPS’ plot - [sigmod2000])

•clusters?

•separable?

•attraction/repulsion?

•data ‘scrubbing’ – duplicates?

Page 39: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 39

Solution#1: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

Page 40: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 40

Solution#1: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

[w/ Seeger, Traina, Traina, SIGMOD00]

Page 41: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 41

spatial d.m.

r1r2

r1

r2

Heuristic on choosing # of clusters

Page 42: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 42

Solution#1: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

Page 43: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 43

Solution#1: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

-repulsion!!

-duplicates

Page 44: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 44

Problem #2: Dim. reduction

y

xxx

yy(a) Quarter-circle (c) Spike(b)Line1

0000

1

Page 45: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 45

Solution:

• drop the attributes that don’t increase the ‘partial f.d.’ PFD

• dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00]

Page 46: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 46

Problem #2: dim. reduction

y

xxx

yy(a) Quarter-circle (c) Spike(b)Line1

0000

1

PFD~1

PFD~1global FD=1 PFD=1

PFD=0PFD=1

Page 47: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 47

Problem #2: dim. reduction

y

xxx

yy(a) Quarter-circle (c) Spike(b)Line1

0000

1

PFD~1

PFD=1global FD=1PFD=1

PFD=0PFD=1

Notice: ‘max variance’ would fail here

Page 48: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 48

Problem #2: dim. reduction

y

xxx

yy(a) Quarter-circle (c) Spike(b)Line1

0000

1

PFD~1

PFD~1global FD=1

PFD=1

PFD=0PFD=1

Notice: SVD would fail here

Page 49: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 49

Road map

• Motivation – problems / case studies

• Definition of fractals and power laws

• Solutions to posed problems

• More examples– fractals– power laws

• Conclusions

Page 50: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 50

disk traffic

• Not Poisson, not(?) iid - BUT: self-similar• How to model it?

time

#bytes

Page 51: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 51

traffic

• disk traces (80-20 ‘law’ = ‘multifractal’ [ICDE’02])

time

#bytes

20% 80%

Page 52: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 52

Traffic

Many other time-sequences are bursty/clustered: (such as?)

Page 53: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 53

Tape accesses

time

Tape#1 Tape# N

# tapes needed, to retrieve n records?

(# days down, due to failures / hurricanes / communication noise...)

Page 54: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 54

Tape accesses

time

Tape#1 Tape# N

# tapes retrieved

# qual. records

50-50 = Poisson

real

Page 55: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 55

More apps: Brain scans

• Oct-trees; brain-scans

octree levels

Log(#octants)

2.63 = fd

Page 56: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 56

Cross-roads of Montgomery county:

•any rules?

GIS points

Page 57: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 57

GIS

A: self-similarity:• intrinsic dim. = 1.51• avg#neighbors(<= r )

= r^D

log( r )

log(#pairs(within <= r))

1.51

Page 58: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 58

Examples:LB county

• Long Beach county of CA (road end-points)

Page 59: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 59

More fractals:

• cardiovascular system: 3 (!)

• stock prices (LYCOS) - random walks: 1.5

• Coastlines: 1.2-1.58 (?)

1 year 2 years

Page 60: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 60

Page 61: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 61

Road map

• Motivation – problems / case studies

• Definition of fractals and power laws

• Solutions to posed problems

• More examples – fractals– power laws

• Conclusions

Page 62: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 62

Fractals <-> Power laws

self-similarity ->• <=> fractals • <=> scale-free• <=> power-laws (y=x^a, F=C*r^(-2))

log( r )

log(#pairs within <=r )

1.58

Page 63: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 63

Bible

RANK-FREQUENCY plot: (in log-log scales)

Zipf’s (first) Law:

Zipf’s law

log(rank)

log(freq)

“the”

“and”

Page 64: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 64

Zipf’s law

• similarly for first names (slope ~-1)

• last names (~ -0.7)

• etc

Page 65: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 65

More power laws

• Energy of earthquakes (Gutenberg-Richter law) [simscience.org]

log(count)

magnitudeday

amplitude

Page 66: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 66

<url, u-id, ....>

Web Site Traffic

log(freq)

log(count)

Zipf

Clickstream data

Page 67: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 67

Lotka’s law

• library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001)

log(#citations)

log(count)

J. Ullman

Page 68: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 68

Korcak’s law

Scandinavian lakes area vs complementary cumulative count (log-log axes)

log(count( >= area))

log(area)

Page 69: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 69

More power laws: Korcak

Japan islands;

area vs cumulative count (log-log axes) log(area)

log(count( >= area))

Page 70: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 70

(Korcak’s law: Aegean islands)

Page 71: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 71

Olympic medals:

y = -0.9676x + 2.3054

R2 = 0.9458

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2

Series1

Linear (Series1)

log rank

log(# medals)

USA

ChinaRussia

Page 72: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 72

SALES data – store#96

# units sold

count of products

Page 73: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 73

TELCO data

# of service units

count ofcustomers

Page 74: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 74

More power laws on the Internet

degree vs rank, for Internet domains (log-log) [sigcomm99]

log(rank)

log(degree)

-0.82

Page 75: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 75

Even more power laws:

• Income distribution (Pareto’s law);

• duration of UNIX jobs [Harchol-Balter] • Distribution of UNIX file sizes• Web graph [CLEVER-IBM; Barabasi]

Page 76: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 76

Overall Conclusions:

‘Find similar/interesting things’ in multimedia databases

• Indexing: feature extraction (‘GEMINI’)– automatic feature extraction: FastMap– Relevance feedback: FALCON

Page 77: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 77

Conclusions - cont’d

• New tools for Data Mining: Fractals/power laws:– appear everywhere– lead to skewed distributions (Gaussian,

Poisson, uniformity, independence)– ‘correlation integral’ for separability/cluster

detection– PFD for dimensionality reduction

Page 78: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 78

Resources:

• Software and papers:– www.cs.cmu.edu/~christos– Fractal dimension (FracDim)– Separability (sigmod 2000, kdd2001)– Relevance feedback for query by content

(FALCON – vldb 2000)

Page 79: Indexing and Data Mining in Multimedia Databases

USC 2001 C. Faloutsos 79

Resources

• Manfred Schroeder “Chaos, Fractals and Power Laws”


Recommended