Advanced topics in databases
V. MegalooikonomouGeneric Multimedia Indexing
(slides are based on notes by C. Faloutsos)
General Overview
Multimedia Indexing Spatial Access Methods (SAMs)
k-d trees Point Quadtrees MX-Quadtree z-ordering R-trees
Generic Multimedia Indexing
Mutlimedia Indexing – Detailed outline
Generic Multimedia Indexing problem dfn Distance function Similarity queries – Types Requirements (ideal method) Basic idea, Lower-bounding Gemini approach Applications
1-D Time sequences 2-D Color images
Generic Multimedia Indexing - problem
Given a database of multimedia objects Design fast search algorithms that locate
objects that match a query object, exactly or approximately Objects:
1-d time sequences Digitized voice or music 2-d color images 2-d or 3-d gray scale medical images Video clips
E.g.: “Find companies whose stock prices move similarly”
Mutlimedia Indexing – Detailed outline
Generic Multimedia Indexing problem dfn Distance function Similarity queries – Types Requirements (ideal method) Basic idea, Lower-bounding Gemini approach Applications
1-D Time sequences 2-D Color images
Generic Multimedia Indexing- problem
1st step: provide a measure for the distance between two objects Distance function D():
Given two objects OA, OB the distance (=dis-similarity) of the two objects is denoted by
D(OA, OB)
E.g., Euclidean distance (sum of squared differences) of two equal-length time series
Mutlimedia Indexing – Detailed outline
Generic Multimedia Indexing problem dfn Distance function Similarity queries Requirements (ideal method) Basic idea, Lower-bounding Gemini approach Applications
1-D Time sequences 2-D Color images
Types of Similarity Queries
Similarity queries are classified into: Whole match queries:
Given a collection of N objects O1,…, ON and a query object Q find data objects that are within distance from Q
Sub-pattern Match: Given a collection of N objects O1,…, ON and a
query (sub-) object Q and a tolerance identify the parts of the data objects that match the query Q
S1
Snavg
1 365
day1 365F(S1)
F(Sn)
std
day
Types of Similarity Queries
Similarity queries are classified into: Whole match queries:
Given a collection of N objects O1,…, ON and a query object Q find data objects that are within distance from Q
Sub-pattern Match: Given a collection of N objects O1,…, ON and a
query (sub-) object Q and a tolerance identify the parts of the data objects that match the query Q
S1
Snavg
1 365
day1 365F(S1)
F(Sn)
std
day
Types of Similarity Queries
Similarity queries are classified into: Whole match queries:
Given a collection of N objects O1,…, ON and a query object Q find data objects that are within distance from Q
Sub-pattern Match: Given a collection of N objects O1,…, ON and a
query (sub-) object Q and a tolerance identify the parts of the data objects that match the query Q
S1
Snavg
1 365
day1 365F(S1)
F(Sn)
std
day
Types of Similarity Queries
Similarity queries are classified into: Whole match queries:
Given a collection of N objects O1,…, ON and a query object Q find data objects that are within distance from Q
Sub-pattern Match: Given a collection of N objects O1,…, ON and a
query (sub-) object Q and a tolerance identify the parts of the data objects that match the query Q
Types of Similarity Queries
Additional types of queries: K- Nearest Neighbor queries:
Given a collection of N objects O1,…, ON and a query object Q find the K most similar data objects to Q
All pairs queries (or ‘spatial joins’): Given a collection of N objects O1,…, ON find all
objects that are within distance from each other
S1
Snavg
1 365
day1 365F(S1)
F(Sn)
std
day
Types of Similarity Queries
Additional types of queries: K- Nearest Neighbor queries:
Given a collection of N objects O1,…, ON and a query object Q find the K most similar data objects to Q
All pairs queries (or ‘spatial joins’): Given a collection of N objects O1,…, ON find all
objects that are within distance from each other
S1
Snavg
1 365
day1 365F(S1)
F(Sn)
std
day
Mutlimedia Indexing – Detailed outline
Generic Multimedia Indexing problem dfn Distance function Similarity queries – Types Requirements (ideal method) Basic idea, Lower-bounding Gemini approach Applications
1-D Time sequences 2-D Color images
Idea method – requirements
Fast: sequential scanning and distance calculation with each and every object too slow for large databases
“Correct”: No false dismissals. False alarms are acceptable. Why?
Small space overhead Dynamic: easy to insert, delete, and
update objects
Approach Outline
Use k feature extraction functions to map objects into k-dimensional space (applying a mapping F () )
Use highly fine-tuned database SAMs (Spatial Access Methods) like R-trees to accelerate the search (by pruning out large portions of the database that are not promising)…
Mutlimedia Indexing – Detailed outline
Generic Multimedia Indexing problem dfn Distance function Similarity queries – Types Requirements (ideal method) Basic idea, Lower-bounding Gemini approach Applications
1-D Time sequences 2-D Color images
Basic idea
Focus on ‘whole match’ queries Given a collection of N objects O1,…, ON, a
distance/dis-similarity function D(Oi, Oj), and a query object Q find data objects that are within distance from Q
Sequential scanning?
Basic idea
Focus on ‘whole match’ queries Given a collection of N objects O1,…, ON, a
distance/dis-similarity function D(Oi, Oj), and a query object Q find data objects that are within distance from Q
Sequential scanning? May be too slow.. Why?
Basic idea
Focus on ‘whole match’ queries Given a collection of N objects O1,…, ON, a
distance/dis-similarity function D(Oi, Oj), and a query object Q find data objects that are within distance from Q
Sequential scanning? May be too slow.. for the following
reasons: Distance computation is expensive (e.g., editing
distance in DNA strings) The Database size N may be huge
Faster alternative?
Basic idea
Faster alternative: Step 1: a ‘quick and dirty’ test to discard
quickly the vast majority of non-qualifying objects
Step 2: use of SAMs to achieve faster than sequential searching
Example: Database of yearly stock price movements Euclidean distance function Characterize with a single number (‘feature’) Or use two or more features
2/1
1
2])[][(),(
i
iQiSQSD
Basic idea - illustration
A query with tolerance becomes a sphere with radius
day1 365
day1 365
S1
Sn
F(S1)
F(Sn)
Feature1
Feature2
Basic idea – caution! The mapping F() from objects to k-d
points should not distort the distances D(): distance of two objects Df(): distance of their corresponding
feature vectors Ideally, perfect preservation of
distances In practice, a guarantee of no false
dismissals How?
Basic idea – caution! The mapping F() from objects to k-d points
should not distort the distances D(): distance of two objects Df(): distance of the corresponding feature
vectors Ideally, perfect preservation of distances In practice, a guarantee of no false
dismissals How? If the distance in f-space matches or
underestimates the distance between two objects in the original space
Basic idea – Lower bounding
Let O1, O2 be two objects with distance function D() and F(O1), F(O2), be their feature vectors with distance function Df(), then:
To guarantee no false dismissals for whole match queries, the feature extraction function F() should satisfy:
Df(F(O1), F(O2)) D(O1, O2)
for every pair of objects O1, O2
Lower bounding - Proof
Let Q be the query object and O be the qualifying object and be the tolerance.
Prove: If object O qualifies it will be retrieved by a range query in the f-space
Or, D(Q, O) Df(F(Q), F(O)) However, Df(F(Q), F(O)) D(Q, O) What about ‘all-pairs’? What about ‘nearest-neighbor’ queries?
Lower bounding - Proof
Let Q be the query object and O be the qualifying object and be the tolerance.
Prove: If object O qualifies it will be retrieved by a range query in the f-space
Or, D(Q, O) Df(F(Q), F(O)) However, Df(F(Q), F(O)) D(Q, O) What about ‘all-pairs’? (‘spatial join’ on f-
space) What about ‘nearest-neighbor’ queries?
Lower bounding - Proof
Let Q be the query object and O be the qualifying object and be the tolerance.
Prove: If object O qualifies it will be retrieved by a range query in the f-space
Or, D(Q, O) Df(F(Q), F(O)) However, Df(F(Q), F(O)) D(Q, O) What about ‘all-pairs’? (‘spatial join’ on f-
space) What about ‘nearest-neighbor’ queries? ??
Mutlimedia Indexing – Detailed outline
Generic Multimedia Indexing problem dfn Distance function Similarity queries – Types Requirements (ideal method) Basic idea, Lower-bounding Gemini approach Applications
1-D Time sequences 2-D Color images
GEneric Multimedia object INdexIng
GEMINI approach:1. Determine distance function D()2. Find one or more numerical feature-extraction
functions (to provide a ‘quick and dirty’ test)3. Prove that Df() lower-bounds D() to guarantee no
false dismissals4. Use a SAM (e.g., R-tree) to store and retrieve k-d
feature vectors !!! The methodology focuses on the speed of
search only; not on the quality of the results which relies on the distance function
Generic Multimedia Object Indexing
Applications: 1-d time sequences 2-d color images
Problems to solve: How to apply the lower-bounding lemma ‘Curse of Dimensionality’ (time sequences) ‘Cross-talk’ of features (color images)
Mutlimedia Indexing – Detailed outline
Generic Multimedia Indexing problem dfn Distance function Similarity queries – Types Requirements (ideal method) Basic idea, Lower-bounding Gemini approach Applications
1-D Time sequences 2-D Color images
1-D Time Sequences
Distance function: Euclidean distance Find features that:
Preserve/lower-bound the distance Carry as much information as possible(reduce false
alarms) If we are allowed to use only one feature what
would this be?
1-D Time Sequences
Distance function: Euclidean distance Find features that:
Preserve/lower-bound the distance Carry as much information as possible(reduce false
alarms) If we are allowed to use only one feature what
would this be? The average. … extending it…
1-D Time Sequences
Distance function: Euclidean distance Find features that:
Preserve/lower-bound the distance Carry as much information as possible(reduce false
alarms) If we are allowed to use only one feature what
would this be? The average. … extending it… The average of 1st half, of the 2nd half, of the 1st
quarter, etc. Coefficients of the Fourier transform (DFT),
wavelet transform, etc.
1-D Time Sequences
Show that the distance in feature space lower-bounds the actual distance
What about DFT?
1-D Time Sequences
Show that the distance in feature space lower-bounds the actual distance
What about DFT? Parseval’s Theorem: DFT preserves the energy of the
signal as well as the distances between two signals. D(x,y) = D(X,Y) where X and Y are the Fourier transforms of x and y If we keep the first k n coefficients of DFT we lower-
bound the actual distance ),())(),((
21
0
21
0
21
0
yxDyxYXYXyFxFDn
iii
n
fff
k
ffff
1-D Time Sequences
Response time improves as the transform concentrates more the energy of the signal
DFT concentrates the energy for a large class of signals, the colored noises
Colored noises: skewed energy spectrum that drops as O(f -b)
Energy spectrum or power spectrum of a signal is the square of the amplitude |Xf| as a function of the frequency f
b = 2: random walks or brown noise (very predictable) b 2: black noises b = 1: pink noise b = 0: white noise (completely unpredictable) Colored noises even in images (photographs)
Mutlimedia Indexing – Detailed outline
Generic Multimedia Indexing problem dfn Distance function Similarity queries – Types Requirements (ideal method) Basic idea, Lower-bounding Gemini approach Applications
1-D Time sequences 2-D Color images
2-D color images
Image features for Content Based Image Retrieval (CBIR):
Low Level: Color – color histograms Texture – directionality, granularity, contrast Shape – turning angle, moments of inertia,
pattern spectrum Position – 2D strings method …etc
Object Level: Regions
2-D color images – Color histograms
Each color image – a 2-d array of pixels Each pixel – 3 color components (R,G,B) h colors – each color denoting a point in 3-d color
space (as high as 224 colors) For each image compute the h-element color
histogram – each component is the percentage of pixels that are most similar to that color
The histogram of image I is defined as:For a color Ci , Hci(I) represents the number of pixels of color Ci
in image I OR:For any pixel in image I, Hci(I) represents the possibility of that pixel having color Ci.
2-D color images – Color histograms
Usually cluster similar colors together and choose one representative color for each ‘color bin’
Most commercial CBIR systems include color histogram as one of the features (e.g., QBIC of IBM)
No space information
Color histograms - distance One method to measure the distance between
two histograms x and y is: where the color-to-color similarity matrix A has
entries aij that describe the similarity between color i and color j
h
i
h
jjjiiij
th yxyxayxAyxyxd ))(()()(),(2
Color histograms – lower bounding
Two obstacles for using color-histograms as feature vectors in GEMINI:
‘Dimensionality curse’ (h is large 64, 128) Distance function is quadratic
It involves all cross terms (‘cross-talk’ among features) - expensive to compute - precludes the use of SAMs
e.g.,64 colors
bright redpink
orange
x
q
Color histograms – lower bounding
1st step: define the distance function between two color images D()=dh()
2nd step: find numerical features (one or more) whose Euclidean distance lower-bounds dh()
If we allowed to use one numerical feature to describe the color image what should it be?
Avg. amount for each color component (R,G,B)
Where … , similarly for G and B
Where P is the number of pixels in the image, R(p) is the red component (intensity) of the p-th pixel
tavgavgavg BGRx ),,(
P
pavg pRPR
1
)()/1(
Color histograms – lower bounding
Given the average color vectors and of two images we define davg() as the Euclidean distance between the 3-d average color vectors
3rd step: to prove that the feature distance davg() lower-bounds the actual distance dh()
Main idea of approach: First a filtering using the average (R,G,B) color, then a more accurate matching using the full h-element
histogram
x y
3
1
22 )()()(),(i
iit
avg yxyxyxyxd
Color auto-correlogram
pick any pixel p1 of color Ci in the image I at distance k away from p1 pick another
pixel p2 what is the probability that p2 is also of
color Ci ?
P1
P2
Red ?
Image: I
k
Color auto-correlogram
The auto-correlogram of image I for color Ci , distance k:
Integrate both color information and space information.
]|,|Pr[|)( 1221)(
iii CCkC IpIpkppI
Color auto-correlogram
Implementations
Pixel Distance Measures Use D8 distance (also called chessboard distance):
Choose distance k=1,3,5,7 Computation complexity:
Histogram: Correlogram:
|)||,max(|),(8 yyxx qpqpqpD
)*134( 2n)( 2n
Implementations
Features Distance Measures: D( f(I1) - f(I2) ) is small I1 and I2 are similar. Example: f(a)=1000, f(a’)=1050; f(b)=100,
f(b’)=150 For histogram:
For correlogram:
][ )'()(1
|)'()(||'|
mi CC
CCh IhIh
IhIhII
ii
ii
][],[)()(
)()(
)'()(1
|)'()(||'|
dkmikC
kC
kC
kC
II
IIII
ii
ii
Color Histogram vs Correlogram
If there is no differenceno difference between the query and the target images, both methods have good performance.
Query Query ImageImage
(512 colors)(512 colors)
CorrelograCorrelogram methodm method
Histogram Histogram methodmethod
1s1stt
2nd2nd 3r3rdd
4t4thh
5t5thh
1s1stt
2nd2nd 3r3rdd
4t4thh
5t5thh
Color Histogram vs Correlogram
The correlogram method is more stable to color changecolor change than the histogram method.
QuerQueryy
TargetTarget
Correlogram method: 1st
Histogram method: 48th
Color Histogram vs Correlogram
The correlogram method is more stable to large appearance changelarge appearance change than the histogram method
QuerQueryy
TargetTarget
Correlogram method: 1st
Histogram method: 31th
Color Histogram vs Correlogram
The correlogram method is more stable to contrast & brightness changecontrast & brightness change than the histogram method.
Query Query 11
TargetTarget C: 178th
H: 230th
Query Query 22
Query Query 33
Query Query 44
C: 1st
H: 1st
C: 1st
H: 3rd
C: 5th
H: 18th
Color Histogram vs Correlogram
The color correlogram describes the global distribution of local spatial correlations of colors.
It’s easy to compute It’s more stable than the color histogram
method
Mutlimedia Indexing – Conclusions
GEMINI is a popular method Whole matching problem Should pay attention to:
Distance functions Feature Extraction functions Lower Bounding Particular application
Sub-pattern matching?