Indexing and Data Mining in Multimedia Databases

Post on 27-Jan-2015

111 views 1 download

Tags:

description

 

transcript

Indexing and Data Mining in Multimedia Databases

Christos Faloutsos

CMU www.cs.cmu.edu/~christos

USC 2001 C. Faloutsos 2

Outline

Goal: ‘Find similar / interesting things’

• Problem - Applications

• Indexing - similarity search

• New tools for Data Mining: Fractals

• Conclusions

• Resources

USC 2001 C. Faloutsos 3

Problem

Given a large collection of (multimedia) records, find similar/interesting things, ie:

• Allow fast, approximate queries, and

• Find rules/patterns

USC 2001 C. Faloutsos 4

Sample queries

• Similarity search– Find pairs of branches with similar sales

patterns– find medical cases similar to Smith's– Find pairs of sensor series that move in sync– Find shapes like a spark-plug

USC 2001 C. Faloutsos 5

Sample queries –cont’d

• Rule discovery– Clusters (of branches; of sensor data; ...)– Forecasting (total sales for next year?)– Outliers (eg., unexpected part failures; fraud

detection)

USC 2001 C. Faloutsos 6

Outline

Goal: ‘Find similar / interesting things’

• Problem - Applications

• Indexing - similarity search

• New tools for Data Mining: Fractals

• Conclusions

• related projects @ CMU and resourses

USC 2001 C. Faloutsos 7

Indexing - Multimedia

Problem:

• given a set of (multimedia) objects,

• find the ones similar to a desirable query object

USC 2001 C. Faloutsos 8

day

$price

1 365

day

$price

1 365

day

$price

1 365

distance function: by expert

USC 2001 C. Faloutsos 9

day1 365

day1 365

S1

Sn

F(S1)

F(Sn)

‘GEMINI’ - Pictorially

eg, avg

eg,. std

USC 2001 C. Faloutsos 10

Remaining issues

• how to extract features automatically?

• how to merge similarity scores from different media

USC 2001 C. Faloutsos 11

Outline

Goal: ‘Find similar / interesting things’

• Problem - Applications

• Indexing - similarity search– Visualization: Fastmap– Relevance feedback: FALCON

• Data Mining / Fractals

• Conclusions

USC 2001 C. Faloutsos 12

FastMap

O1 O2 O3 O4 O5

O1 0 1 1 100 100

O2 1 0 1 100 100

O3 1 1 0 100 100

O4 100 100 100 0 1

O5 100 100 100 1 0

~100

~1

??

USC 2001 C. Faloutsos 13

FastMap

• Multi-dimensional scaling (MDS) can do that, but in O(N**2) time

• We want a linear algorithm: FastMap [SIGMOD95]

USC 2001 C. Faloutsos 14

Applications: time sequences

• given n co-evolving time sequences

• visualize them + find rules [ICDE00]

time

rate

HKD

JPY

DEM

USC 2001 C. Faloutsos 15

Applications - financial• currency exchange rates [ICDE00]

USD(t)

USD(t-5)

FRFGBPJPYHKD

USC 2001 C. Faloutsos 16

Applications - financial• currency exchange rates [ICDE00]

USD

HKD

JPY

FRFDEM

GBP

USD(t)

USD(t-5)

USC 2001 C. Faloutsos 17

Application: VideoTrails

[ACM MM97]

USC 2001 C. Faloutsos 18

VideoTrails - usage

• scene-cut detection (about 10% errors)

• scene classification (eg., dialogue vs action)

USC 2001 C. Faloutsos 19

Outline

Goal: ‘Find similar / interesting things’

• Problem - Applications

• Indexing - similarity search– Visualization: Fastmap– Relevance feedback: FALCON

• Data Mining / Fractals

• Conclusions

USC 2001 C. Faloutsos 20

Merging similarity scores

• eg., video: text, color, motion, audio– weights change with the query!

• solution 1: user specifies weights

• solution 2: user gives examples – and we ‘learn’ what he/she wants: rel. feedback

(Rocchio, MARS, MindReader)– but: how about disjunctive queries?

USC 2001 C. Faloutsos 21

‘FALCON’Inverted VsVs

Trader wants only ‘unstable’ stocks

USC 2001 C. Faloutsos 22

“Single query point” methods

Rocchio

+

+ ++

++

x

USC 2001 C. Faloutsos 23

“Single query point” methods

Rocchio MindReader

+

+ ++

++ +

+ ++

++ +

+ ++

++

MARS

The averaging affect in action...

x x x

USC 2001 C. Faloutsos 24

++

+

++

Main idea: FALCON Contours

feature1 (eg., temperature)

feature2

eg., frequency

[Wu+, vldb2000]

USC 2001 C. Faloutsos 25

Conclusions for indexing + visualization

• GEMINI: fast indexing, exploiting off-the-shelf SAMs

• FastMap: automatic feature extraction in O(N) time

• FALCON: relevance feedback for disjunctive queries

USC 2001 C. Faloutsos 26

Outline

Goal: ‘Find similar / interesting things’

• Problem - Applications

• Indexing - similarity search

• New tools for Data Mining: Fractals

• Conclusions

• Resourses

USC 2001 C. Faloutsos 27

Data mining & fractals – Road map

• Motivation – problems / case study

• Definition of fractals and power laws

• Solutions to posed problems

• More examples

USC 2001 C. Faloutsos 28

Problem #1 - spatial d.m.

Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’

galaxies

(stores & households ; mpg & MTBF...)

- patterns? (not Gaussian; not uniform)

-attraction/repulsion?

- separability??

USC 2001 C. Faloutsos 29

Problem#2: dim. reduction

• given attributes x1, ... xn

– possibly, non-linearly correlated

• drop the useless ones

(Q: why?

A: to avoid the ‘dimensionality curse’)

USC 2001 C. Faloutsos 30

Answer:

• Fractals / self-similarities / power laws

USC 2001 C. Faloutsos 31

What is a fractal?

= self-similar point set, e.g., Sierpinski triangle:

...zero area;

infinite length!

USC 2001 C. Faloutsos 32

Definitions (cont’d)

• Paradox: Infinite perimeter ; Zero area!

• ‘dimensionality’: between 1 and 2

• actually: Log(3)/Log(2) = 1.58… (long story)

USC 2001 C. Faloutsos 33

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of a line?

x y

5 1

4 2

3 3

2 4

Eg:

#cylinders; miles / gallon

USC 2001 C. Faloutsos 34

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of a line?

• A: nn ( <= r ) ~ r^1(‘power law’: y=x^a)

USC 2001 C. Faloutsos 35

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of a line?

• A: nn ( <= r ) ~ r^1(‘power law’: y=x^a)

• Q: fd of a plane?• A: nn ( <= r ) ~ r^2fd== slope of (log(nn) vs

log(r) )

USC 2001 C. Faloutsos 36

Sierpinsky triangle

log( r )

log(#pairs within <=r )

1.58

== ‘correlation integral’

USC 2001 C. Faloutsos 37

Road map

• Motivation – problems / case studies

• Definition of fractals and power laws

• Solutions to posed problems

• More examples

• Conclusions

USC 2001 C. Faloutsos 38

Solution#1: spatial d.m.Galaxies (Sloan Digital Sky Survey w/ B.

Nichol - ‘BOPS’ plot - [sigmod2000])

•clusters?

•separable?

•attraction/repulsion?

•data ‘scrubbing’ – duplicates?

USC 2001 C. Faloutsos 39

Solution#1: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

USC 2001 C. Faloutsos 40

Solution#1: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

[w/ Seeger, Traina, Traina, SIGMOD00]

USC 2001 C. Faloutsos 41

spatial d.m.

r1r2

r1

r2

Heuristic on choosing # of clusters

USC 2001 C. Faloutsos 42

Solution#1: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

USC 2001 C. Faloutsos 43

Solution#1: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

-repulsion!!

-duplicates

USC 2001 C. Faloutsos 44

Problem #2: Dim. reduction

y

xxx

yy(a) Quarter-circle (c) Spike(b)Line1

0000

1

USC 2001 C. Faloutsos 45

Solution:

• drop the attributes that don’t increase the ‘partial f.d.’ PFD

• dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00]

USC 2001 C. Faloutsos 46

Problem #2: dim. reduction

y

xxx

yy(a) Quarter-circle (c) Spike(b)Line1

0000

1

PFD~1

PFD~1global FD=1 PFD=1

PFD=0PFD=1

USC 2001 C. Faloutsos 47

Problem #2: dim. reduction

y

xxx

yy(a) Quarter-circle (c) Spike(b)Line1

0000

1

PFD~1

PFD=1global FD=1PFD=1

PFD=0PFD=1

Notice: ‘max variance’ would fail here

USC 2001 C. Faloutsos 48

Problem #2: dim. reduction

y

xxx

yy(a) Quarter-circle (c) Spike(b)Line1

0000

1

PFD~1

PFD~1global FD=1

PFD=1

PFD=0PFD=1

Notice: SVD would fail here

USC 2001 C. Faloutsos 49

Road map

• Motivation – problems / case studies

• Definition of fractals and power laws

• Solutions to posed problems

• More examples– fractals– power laws

• Conclusions

USC 2001 C. Faloutsos 50

disk traffic

• Not Poisson, not(?) iid - BUT: self-similar• How to model it?

time

#bytes

USC 2001 C. Faloutsos 51

traffic

• disk traces (80-20 ‘law’ = ‘multifractal’ [ICDE’02])

time

#bytes

20% 80%

USC 2001 C. Faloutsos 52

Traffic

Many other time-sequences are bursty/clustered: (such as?)

USC 2001 C. Faloutsos 53

Tape accesses

time

Tape#1 Tape# N

# tapes needed, to retrieve n records?

(# days down, due to failures / hurricanes / communication noise...)

USC 2001 C. Faloutsos 54

Tape accesses

time

Tape#1 Tape# N

# tapes retrieved

# qual. records

50-50 = Poisson

real

USC 2001 C. Faloutsos 55

More apps: Brain scans

• Oct-trees; brain-scans

octree levels

Log(#octants)

2.63 = fd

USC 2001 C. Faloutsos 56

Cross-roads of Montgomery county:

•any rules?

GIS points

USC 2001 C. Faloutsos 57

GIS

A: self-similarity:• intrinsic dim. = 1.51• avg#neighbors(<= r )

= r^D

log( r )

log(#pairs(within <= r))

1.51

USC 2001 C. Faloutsos 58

Examples:LB county

• Long Beach county of CA (road end-points)

USC 2001 C. Faloutsos 59

More fractals:

• cardiovascular system: 3 (!)

• stock prices (LYCOS) - random walks: 1.5

• Coastlines: 1.2-1.58 (?)

1 year 2 years

USC 2001 C. Faloutsos 60

USC 2001 C. Faloutsos 61

Road map

• Motivation – problems / case studies

• Definition of fractals and power laws

• Solutions to posed problems

• More examples – fractals– power laws

• Conclusions

USC 2001 C. Faloutsos 62

Fractals <-> Power laws

self-similarity ->• <=> fractals • <=> scale-free• <=> power-laws (y=x^a, F=C*r^(-2))

log( r )

log(#pairs within <=r )

1.58

USC 2001 C. Faloutsos 63

Bible

RANK-FREQUENCY plot: (in log-log scales)

Zipf’s (first) Law:

Zipf’s law

log(rank)

log(freq)

“the”

“and”

USC 2001 C. Faloutsos 64

Zipf’s law

• similarly for first names (slope ~-1)

• last names (~ -0.7)

• etc

USC 2001 C. Faloutsos 65

More power laws

• Energy of earthquakes (Gutenberg-Richter law) [simscience.org]

log(count)

magnitudeday

amplitude

USC 2001 C. Faloutsos 66

<url, u-id, ....>

Web Site Traffic

log(freq)

log(count)

Zipf

Clickstream data

USC 2001 C. Faloutsos 67

Lotka’s law

• library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001)

log(#citations)

log(count)

J. Ullman

USC 2001 C. Faloutsos 68

Korcak’s law

Scandinavian lakes area vs complementary cumulative count (log-log axes)

log(count( >= area))

log(area)

USC 2001 C. Faloutsos 69

More power laws: Korcak

Japan islands;

area vs cumulative count (log-log axes) log(area)

log(count( >= area))

USC 2001 C. Faloutsos 70

(Korcak’s law: Aegean islands)

USC 2001 C. Faloutsos 71

Olympic medals:

y = -0.9676x + 2.3054

R2 = 0.9458

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2

Series1

Linear (Series1)

log rank

log(# medals)

USA

ChinaRussia

USC 2001 C. Faloutsos 72

SALES data – store#96

# units sold

count of products

USC 2001 C. Faloutsos 73

TELCO data

# of service units

count ofcustomers

USC 2001 C. Faloutsos 74

More power laws on the Internet

degree vs rank, for Internet domains (log-log) [sigcomm99]

log(rank)

log(degree)

-0.82

USC 2001 C. Faloutsos 75

Even more power laws:

• Income distribution (Pareto’s law);

• duration of UNIX jobs [Harchol-Balter] • Distribution of UNIX file sizes• Web graph [CLEVER-IBM; Barabasi]

USC 2001 C. Faloutsos 76

Overall Conclusions:

‘Find similar/interesting things’ in multimedia databases

• Indexing: feature extraction (‘GEMINI’)– automatic feature extraction: FastMap– Relevance feedback: FALCON

USC 2001 C. Faloutsos 77

Conclusions - cont’d

• New tools for Data Mining: Fractals/power laws:– appear everywhere– lead to skewed distributions (Gaussian,

Poisson, uniformity, independence)– ‘correlation integral’ for separability/cluster

detection– PFD for dimensionality reduction

USC 2001 C. Faloutsos 78

Resources:

• Software and papers:– www.cs.cmu.edu/~christos– Fractal dimension (FracDim)– Separability (sigmod 2000, kdd2001)– Relevance feedback for query by content

(FALCON – vldb 2000)

USC 2001 C. Faloutsos 79

Resources

• Manfred Schroeder “Chaos, Fractals and Power Laws”