Date post: | 07-May-2017 |
Category: |
Documents |
Upload: | nideenishok |
View: | 229 times |
Download: | 0 times |
Similarity-based Ranking and Query Processing
in Multimedia Databases
K. Selcuk Candan Wen-Syan Li
Computer Science and Engineering Dept. C&C Research Laboratories
Arizona State University, Box 875406 NEC USA, Inc., MS/SJ10
Tempe, AZ 85287-5406, USA. San Jose, CA 95134,USA.
[email protected] [email protected]
M. Lakshmi Priya
Computer Science and Engineering Dept.
Arizona State University, Box 875406
Tempe, AZ 85287-5406, USA.
Abstract
Since media-based evaluation yields similarity values, results to a multimedia database query,Q(Y1
; : : : ; Y
n
),
is defined as an ordered list SQ
of n-tuples of the form hX
1
; : : : ; X
n
i. The query Q itself is composed of a set
of fuzzy and crisp predicates, constants, variables, and conjunction, disjunction, and negation operators. Since
many multimedia applications require partial matches, SQ
includes results which do not satisfy all predicates.
Due to the ranking and partial match requirements, traditional query processing techniques do not apply to
multimedia databases. In this paper, we first focus on the problem of “given a multimedia query which consists
of multiple fuzzy and crisp predicates, providing the user with a meaningful final ranking.” More specifically,
we study the problem of merging similarity values in queries with multiple fuzzy predicates. We describe the
essential multimedia retrieval semantics, compare these with the known approaches, and propose a semantics
which captures the requirements of multimedia retrieval problem. We then build on these results in answering
the related problem of “given a multimedia query which consists of multiple fuzzy and crisp predicates, finding
an efficient way to process the query.” We develop an algorithm to efficiently process queries with unordered
fuzzy predicates (sub-queries). Although, this algorithm can work with different fuzzy semantics, it benefits
from the statistical properties of the semantics proposed in this paper. We also present experimental results
for evaluating the proposed algorithm in terms of quality of results and search space reduction.
1
woman 0.43
man 0.87
computer 0.84
television 0.75
frame20
Figure 1: Fuzzy media modeling example
1 Introduction
Multimedia data includes image and video data which are very complex in terms of their visual and semantic
contents. Depending on the application, multimedia objects are modeled and indexed using their (1) visual prop-
erties (or a set of relevant visual features), (2) semantic properties, and/or (3) the spatial/temporal relationships
of subobjects.
Example 1.1 For instance, Figure 1 gives an example where multimedia data is modeled using both visual and
semantic features.
Figure 1(a) shows an image, Boy bike.gif, whose structure is viewed as a hierarchy with two image com-
ponents (i.e. boy and bicycle). These components are identified based on color/shape region division (visual
interpretation) and their real world meanings (semantic interpretation). The relationship between an image and
its components is contains. The spatial relationships can be described in 2D-String representation [1]. Objects,
Obj1 and Obj2, have both visual and semantic properties that can be used in retrieval.
Figure 1 shows a video frame whose structure can be viewed as a hierarchy with two image components
identified based on color/shape region division (visual interpretation) and their real world meanings (semantic
interpretation). The spatial relationships of these components can be described using various techniques [1, 2].
These components, or objects, have both visual and semantic properties that can be used in retrieval. In this
example, objects have multiple candidate semantics, each with an associated confidence value (smaller than 1.0
due to the limitations of the recognition engine). 3
Therefore, retrieval in multimedia databases is inherently fuzzy:
� similarity of media features, such as correlation between color (red vs. orange) or shape (circle vs. ellipse)
features,
� imperfections in the feature extraction algorithms, such as the high error rate in motion estimation due to
the multitude of factors involved, including camera and object speed, and camera effects,
� imperfections in the query formulation methods, such as the Query by Example (QBE) method where user
provides an example but is not aware of which features will be used for retrieval,
� partial match requirements, where objects in the database fail to satisfy all requirements in the query, and
2
� imperfections in the available index structures, such as low precision or recall rates due to the imperfections
in clustering algorithms.
In many multimedia applications, more than one of these reasons coexist and, consequently, the system must
take each of them into consideration. This requires quantification of different sources of fuzziness and merging
into a single combined value for user’s reference. The following example describes this requirement in greater
detail.
Example 1.2 A query for retrieving images containing Fuji Mountain and a lake can be specified with an SQL3
like query statement [3, 4, 5] as follows:
select image P, object object1, object object2
where P contains object1
and P contains object2
and object1.semantical property s like ”mountain”
and object1.image property image match ”Fuji mountain.gif”
and object2.semantical property is ”lake”
and object2.image property image match ”lake image sample.gif”
and object1.position is above object2.position
The above query contains two crisp query predicates: contains and is1. It also contains a set of fuzzy query
predicates:
� s like (i.e. semantically similar) which evaluates the degree of semantic similarity between two terms.
Helps resolving correlations between semantic features, imperfections in the semantics extraction algo-
rithms, in the index structures, and in the user queries;
� image match (i.e. visually like) which evaluates the visual similarity between two images. Helps resolving
correlations between visual features and imperfections in the index structures; and
� is above (a spatial condition) which compares the spatial position between two objects). Helps resolv-
ing correlations between spatial features, imperfections in the spatial information extraction algorithms,
imperfections in the index structures, and imperfections in the user queries.
This query returns a set of 3-tuples of the form hP; obje t1; obje t2i that satisfy all crisp conditions and that
has a combined fuzzy score above a given threshold. If users desire, the results may be sorted based on their
overall scores for quicker access to relevant results.
Figure 2(Query) shows the conceptual representation of the above query. Figure 2(a), (b), (c), and (d) shows
examples of candidate images that may match this query. The numbers next to the objects in these candidate
images denote the similarity values for the object level matching. As explained earlier, in this example, the
comparisons on spatial relationships are also fuzzy to account for correlations between spatial features.
1The keyword, where, used in the query does not correspond to a predicate.
3
Mountain
Lake
0.5
0.5
(b)
1.00.8
Fuji
Mountain
LakeFuji
Mountain
LakeMountain
Forest0.98
0.98
0.0
(c)(a)Query
Fuji
0.0 0.5
Lake
0.5
1.0
(d)
0.8Mountain
Fuji
Figure 2: Partial matches
The candidate image in Figure 2(a) satisfies object matching conditions but its layout does not match user
specification. Figures 2(b) and (d) satisfy image layout condition but objects do not perfectly match the spec-
ification. Figure 2(c) has structural and object matching with low scores. Note that in Figure 2(a), the spatial
predicate, and in Figure 2(c), the image similarity predicate for lake completely fail (i.e., the match is 0:0).
A multimedia database engine must consider all four images as candidates and must rank them according to
a certain unified criterion. 3
In this paper, we first address the problem of “given a query which consists of multiple fuzzy and crisp
predicates, how to provide a meaningful final ranking to the users.” We propose an alternative scoring approach
which captures the multimedia semantics well and which does not face the above problem while handling partial
matches. Although it is not based on weighing, the proposed approach can be used along with weighing strategies,
if weighing is requested by the user.
We then focus on the problem of “given a query which consists of multiple fuzzy and crisp predicates and a
scoring function, how to efficiently process the query.” It is clear that current database engines are not designed
to answer the needs of these kind of queries. Recently, there has been attempts to address challenges associated
with processing queries of the above kind. In [9], Adalı et al. proposes an algebra for similarity-based queries.
In [10, 11], Fagin proposes a set of efficient query execution algorithms for databases with fuzzy (similarity
based) queries. The algorithms proposed by Fagin assume that (1) individual sources can progressively (in
decreasing order of score) output results, and (2) users are interested in the best k matches to the query. This
for instance would require both of the s like and image match predicates, used in the above example, to return
ordered results. This assumption, however, may be invalid due to limited binding and processing capabilities
of the sources. For instance, the second assumption may be invalid due to the binding rules imposed by the
predicates.
Example 1.3 (Motivating Example) Consider the following SQL-like query which asks for 10 pairs of visually
similar images, such that both images in a given pair contains at least one object, a ”mountain” and a ”tree”,
respectively:
select 10 P1, P2
where semantically like(P1.semantical property, “mountain”)
4
and semantically like (P2.semantical property, “tree”)
and image match (P1.image property, P2.image property).
This query contains three fuzzy conditions: two semantically like predicates and one image match
predicate. Let us assume that the image match predicate is implemented as an external function, which can
be invoked only by providing two input images (i.e., both arguments have to be bound). The image match
predicate then returns a score denoting the visual similarity of its inputs. Hence, the predicate is fuzzy, but it can
not generate results in the order of score; i.e., it is a non-progressive fuzzy predicate. Let us also assume that
the semantically like predicate is implemented as an external function such that, when invoked with the
second argument bound, can return matching images (using an index structure) in the order of decreasing scores.
Consequently, in this example, we have two sources (semantically like predicates) which can output
images progressively through database access and one source (image match) which can not, and results from
all these sources have to be merges to get the final set of results. Unfortunately, due to the existence of a non-
progressive predicate, finding and returning the 10 best matching pairs of pictures would require a complete scan
of the database, which is clearly undesirable. 3
In this paper, we propose a query processing approach which uses score distribution function estimates/statistics [12]
and the statistical properties of the score merging functions for computing approximate top-k results when some
of the predicates can not return ordered results. More specifically, we propose a query processing algorithm,
which, given a query Q, uses the available score distribution function estimates/statistics [12], s, of the individ-
ual predicates and the statistical properties of the corresponding score merging function (�Q
s) for computing
approximate top-k results when some of the predicates in Q can not return ordered results. More specifically, we
provide a solution to the following problem: “ Given a query Q, a positive number k, and an error threshold, �,
return a set R of k results, such that each result r 2 R is most probably (prob(r 2 R
k
) > 1 � �) in the top k
results, Rk
.”
The paper is structured as follows: In Section 2, we provide an overview of the multimedia retrieval se-
mantics. We show the similarities between multimedia queries and fuzzy logic statements. Then, in Section 3,
we provide an overview of the popular fuzzy logic semantics and compare them with respect to the essential
requirements of multimedia retrieval problem. In Section 4 we propose an algorithm for generating approximate
results for queries with non-progressive fuzzy predicates. In Section 5, we investigate the statistical properties of
popular fuzzy logic semantics and we describe a generic function that approximates score distributions used in
the algorithm, when such distributions are not readily available. In Section 6 we experimentally evaluate the pro-
posed algorithm. In Section 7 we compare our approach with existing work. Finally, we present our concluding
remarks.
5
F1
F2
Figure 3: Clustering error which results in imperfections in the index structures: the squares denote the matching
objects, circles denote the non-matching objects, and the dashed rectangle denotes the cluster used by the index
for efficient storage and retrieval
2 Multimedia Retrieval Semantics
In this section, we first provide an overview of the multimedia retrieval semantics; i.e., we describe what we
mean by retrieval of multimedia data. We then review some of the approaches to deal with multimedia queries
that involve multiple fuzzy predicates.
2.1 Fuzziness in Multimedia Retrieval
It is possible to classify the fuzziness in the multimedia queries into three categories: precision-related, recall-
related, and partiality-related fuzziness. Precision related class captures fuzziness due to similarity of features,
imperfections in the feature extraction algorithms, imperfections in the query formulation methods, and the pre-
cision rate of the utilized index structures (Figure 3). Recall- and partiality-related classes are self explanatory.
Note that, in information retrieval, the precision/recall values are mainly used in evaluating the effectiveness
of a given retrieval operation or the effectiveness of a given index structure. Here, we are using these terms
more as statistics which can be utilized to estimate the quality of query results. We have used this approach, in
SEMCOG image retrieval system [4, 3, 5], to provide pre- and post-query feedback to users. The two examples
given below, show why precision and recall are important in processing multimedia queries.
Example 2.1 (Handling precision related fuzziness) Let us assume that we are given a query of the form
Q(X) s like(man;X:semanti property) ^ image mat h(X:image property; \a:gif
00
):
Let for a given object I , the corresponding semanti property be woman and image property be a im:gif .
Let us assume that the semantic precision of s like(man;woman) is 0:8 and the image matching precision
image mat h (\im:gif
00
; \a:gif
00
) is 0:6. This means that the index structure and semantic clusters used to
implement the predicate s like guarantee that 80% of the returned results are semantically similar to man. These
semantic similarities can be evaluated using various algorithms [14, 15]. Similarly, the predicate image mat h
6
guarantees that 60% of the returned results are visually similar to 00a:gif 00. Then, assuming that the two predicates
are not correlated, Q(I) should be 0:8 � 0:6 = 0:48: By replacing X in s like(man;X) with woman, we
maintain 80% precision; then, by replacing the X in image mat h(X; \a:gif
00
) with I , we maintain 60% of the
remaining precision. The final precision (or confidence) is 0:48. 3
Example 2.2 (Handling recall related fuzziness) Recall rate is the ratio of the number of returned results to the
ratio of the number of all relevant results. Let us assume that we have the query used in the earlier example. Let
us also assume that s like(man;X) returns 60% and image mat h(X; \a:gif
00
) returns 50% of all applicable
results in the database. Let us further assume that both functions work perfectly when both variables are bound.
Then (1) a left to right query execution plan would return 0:6 � 1:0 = 0:6, (2) a right to left query execution
plan would return 1:0 � 0:5 = 0:5, and a parallel execution of the predicates followed by a join would return
0:6 � 0:5 = 0:3 of all the applicable results. Consequently, given a query execution plan and assuming that the
predicates are independent, the recall rate can be found by multiplying the recall rates of the predicates. 3
2.2 Query Semantics in the Presence of Imperfections and Similarities
Traditional query languages are based on boolean logic, where a predicate is treated as a propositional function,
which returns one of the two values: true or false. However, due to the stated imperfections, predicates related
to visual or semantic features do not correspond to propositional functions, but to functions which return values
between 0:0 and 1:0. Consequently, the solution of a query of the form
Q(Y
1
; : : : ; Y
n
) �(p
1
(Y
1
; : : : ; Y
n
); : : : ; p
m
(Y
1
; : : : ; Y
n
));
where pi
s are fuzzy or crisp predicates, � is a logic formula, and Yj
are free variables. The solution, on the other
hand, is defined as an ordered set SQ
of n-tuples of the form hX1
; X
2
; : : : ;X
n
i, where (1) n is the number of
variables in query Q, (2) each Xi
corresponds to a variable Yi
in Q, and (3) each Xi
satisfies the type constraints
of the corresponding predicates in Q. The order of the set S denotes the relevance ranking of the solutions.
The first, trivial, way to process multimedia queries is by transforming similarity functions into propositional
functions by choosing a cutoff point, rtrue
and by mapping all numbers in [0:0; r
true
) to false and numbers
in [r
true
; 1:0℄ to true. Intuitively, such a cutoff point will correspond to a similarity degree which denotes
dissimilarity. The main advantage of this approach is that, predicates can refute (conjunctive queries) or validate
(disjunctive queries) solutions as soon as they are evaluated so that optimizations can be employed. In [16],
Chaudhuri and Gravano discuss cost-based query optimization techniques for such filter queries2 . If users only
consider full matching this method is preferred because it allows query optimization by early pruning of the
search space. However, when partial matches are also acceptable, this approach fails to produce appropriate
solutions. For instance, in Example 1.1, candidate images shown in Figures 2(a) and (c) will not be considered.
2Their techniques assume that queries do not contain negation.
7
The second way to process multimedia queries is to leave the final decision, not to the constituent predicates
but, to the n-tuple as a whole. This can be done by defining a scoring function �
Q
which maps a given object to
a value between 0:0 and 1:0. In this method a candidate object, o, is returned as a solution if �Q
(o) � sol
true
,
where sol
true
is the solution acceptance threshold. since determining an appropriate threshold is not always
possible, an alternative approach is to rank all candidate objects according to their scores and return the first k
candidates, where k is a user defined parameter. Related problems are studied within the domains of fuzzy set
theory [17], fuzzy relational databases [18], and probabilistic databases [19].
If users are also interested in partial matches, the second method is more suitable. The techniques proposed
in [16] do not address queries with partial matches since for a given object if any filter condition fails, the object
is omitted from the set of results. Furthermore, the proposed algorithms are tailored towards the min semantics
which, as discussed later in this section, despite of the many proven advantages, are not suitable for multimedia
applications. In this paper, we focus on the second approach. As discussed above, multimedia predicates, by their
nature, have associated scoring functions. Fuzzy sets and the corresponding fuzzy predicates also have similar
scoring or membership functions. Consequently, we will examine the use of fuzzy logic for multimedia retrieval.
The major challenge with the use of the second approach is that early filtering can not always be applied. As
mentioned earlier, Fagin [11, 10] has proposed algorithms for efficient query processing when all fuzzy predicates
are ordered. In contrast, our aim is to develop an algorithm that uses statistics to efficiently process such queries
even when not all of the fuzzy predicates are ordered.
3 Application of Fuzzy Logic for Multimedia Databases
In this section we provide an overview of fuzzy logic and, then, we introduce properties of different fuzzy logic
operators. A fuzzy set, F , with domain D can be defined using a membership function, �F
: D ! [0; 1℄. A crisp
(or conventional) set, C , on the other hand, has a membership function of the form �
C
: D ! f0; 1g. When for
an element d 2 D, �C
(d) = 1, we say that d is in C (d 2 C), otherwise we say that d is not in D (d =2 C). Note
that a crisp set is a special case of a fuzzy set.
A fuzzy predicate is defined as a predicate which corresponds to a fuzzy set. Instead of returning true(1)
or false(0) values for propositional functions (or conventional predicates which correspond to crisp sets), fuzzy
predicates return the corresponding membership values. Binary logical operators (^;_) take two truth values and
return a new truth value. The unary logical operator, :, on the other hand takes one truth value and returns an
other truth value. Similarly, binary fuzzy logical operators take two values between 0:0 and 1:0 and return a third
value between 0:0 and 1:0. A unary fuzzy logical operator, on the other hand, takes one value between 0:0 and
1:0 and returns an other value between 0:0 and 1:0.
8
Min semantics Produ t semantics
�
P
i
^P
j
(x) = minf�
i
(x); �
j
(x)g �
P
i
^P
j
(x) =
�
i
(x)��
j
(x)
maxf�
i
(x);�
j
(x);�g
� 2 [0; 1℄
�
P
i
_P
j
(x) = maxf�
i
(x); �
j
(x)g �
P
i
_P
j
(x) =
�
i
(x)+�
j
(x)��
i
(x)��
j
(x)�minf�
i
(x);�
j
(x);1��g
maxf1��
i
(x);1��
j
(x);�g
�
:P
i
(x) = 1� �
i
(x) �
:P
i
(x) = 1� �
i
(x)
Table 1: Min and products semantics for fuzzy logical operators
T-norm binary function N (for ^) T-conorm binary function C (for _)
Boundary conditions N(0; 0) = 0, N(x; 1) = N(1; x) = x C(1; 1) = 1, C(x; 0) = C(0; x) = x
Commutativity N(x; y) = N(y; x) C(x; y) = C(y; x)
Monotonicity x � x
0
; y � y
0
! N(x; y) � N(x
0
; y
0
) x � x
0
; y � y
0
! C(x; y) � C(x
0
; y
0
)
Associativity N(x;N(y; z)) � N(N(x; y); z) C(x;C(y; z)) � C(C(x; y); z)
Table 2: Properties of triangular-norm and triangular-conorm functions
3.1 Relevant Fuzzy Logic Operator Semantics
There are a multitude of functions [20, 21, 22], each useful in a different application domain, proposed as seman-
tics for fuzzy logic operators (^;_;:). In this section, we introduce the popular scoring functions, discuss their
properties, and show why these semantics may not be suitable for multimedia retrieval.
Two of the most popular scoring functions are the min and product semantics of fuzzy logical operators.
We can state these two semantics in the form of a table as follows: Given a set, P = fP
1
; : : : ; P
m
g of fuzzy
sets and F = f�
1
(x); : : : ; �
m
(x)g of corresponding membership functions, Table 1 shows the min and product
semantics.
These two semantics (along with some others) have the following properties: Defined as such, binary con-
junction and disjunction operators are triangular-norms (t-norms) and triangular-conorms (t-conorms). Table 2
shows the properties of t-norm and t-conorm functions. Intuitively, t-norm functions reflect the properties of the
crisp conjunction operation and t-conorm functions reflect those of the crisp disjunction operation.
Although the property of capturing crisp semantics is desirable in many cases, for multimedia applications,
this is not always true. For instance, the partial match requirements invalidate the boundary conditions. In
addition, monotonicity is too weak of a condition for multimedia applications. An increase in the score of a single
query criterion should increase the combined score; whereas the monotonicity condition dictates such a combined
increase only if the scores for all of the query criteria increases simultaneously. A stronger condition (N(x; y)
increases even if only x or only y increases) is called strictly increasing property3 . Clearly, min(x; y) is not
strictly increasing. Another desirable property for fuzzy conjunction and disjunction operators is distributivity.
The min semantics is known [23, 10, 24] to be the only semantics for conjunction and disjunction that preserves
logical equivalence (in the absence of negation) and be monotone at the same time. This property of the min
3This is not the same definition of strictness used in [10].
9
�
P
i
1
^:::^P
i
n
(x) �
:P
i
(x) �
P
i
1
_:::_P
i
n
(x)
�
i
1
(x)�:::��
i
n
(x)
n
1� �
i
1
(x) 1�
(1��
i
1
(x))�:::�(1��
i
n
(x))
n
(�
i
1
(x)� : : :� �
i
n
(x))
1
n
1� �
i
1
(x) 1� ((1� �
i
1
(x))� : : :� (1� �
i
n
(x)))
1
n
Table 3: N-ary arithmetic average and geometric average semantics
semantics makes it the preferred fuzzy semantics for most cases. Furthermore, in addition to satisfying the
properties of being t-norm and t-conorm, the min semantics also has the property of being idempotent.
Although it has nice features, the min semantics is not suitable for multimedia applications. As discussed
in Example 1.2, according to the min semantics, the score of the candidate images given in Figures 2(a) and (c)
would be 0:0 although they partially match the query. Furthermore, scores of the images in Figure 2(b) and (d)
would both be 0:5, although Figure 2(d) intuitively has a higher score.
The product semantics [21], on the other hand, satisfies idempotency only if � = 0. On the other hand,
when � = 1, it has the property of being strictly increasing (when x or y is different from 1) and Archimedean
(N(x; x) < x and C(x; x) > x). The Archimedean property is weaker than the idempotency, yet it provides an
upper bound on the combined score, allowing for optimizations.
3.2 N-ary Operator Semantics
In information retrieval research (which also shows the characteristics of multimedia applications), other fuzzy
semantics, including the arithmetic mean [26] are suggested. The arithmetic mean semantics (Table 3) provides
an n-ary scoring function (jfPi
; : : : ; P
j
gj = n). Note that the binary version of arithmetic mean does not satisfy
the requirements of being a t-norm: it does not satisfy boundary conditions and it is not associative. Hence, it
does not subsume crisp semantics. On the other hand, it is idempotent and strictly increasing.
Arithmetic average semantics emulate the behavior of the dot product based similarity calculation popular,
in information retrieval: effectively, each predicate is treated like an independent dimension in an n-dimensional
space (where n is the number of predicates), and the merged score is defined as the dot-product distance between
the complete truth, h1; 1; : : : ; 1i, and the given values of the predicates, h�1
(x); : : : ; �
n
(x)i. Although this
approach is shown to be suitable for many information retrieval applications it does not capture the semantics of
multimedia retrieval applications, introduced in Section 2, which are multiplicative in nature. Therefore, we can
use the n-ary geometric average semantics instead.
Note that, as it was the case in the original product semantics, the geometric average semantics is also
not distributive. Therefore, if Q(Y
1
; : : : ; Y
n
) is a query which consists of a set of fuzzy predicates, variables,
constants, and conjunction, disjunction, and negation operators, and if Q_(Y1
; : : : ; Y
n
) is the disjunctive normal
representation of Q, then, we define the normal fuzzy semantics of Q(Y
1
; : : : ; Y
n
) as the fuzzy semantics of
Q
_
(Y
1
; : : : ; Y
n
). In general, �Q
6= �
Q
_ . The former semantics would be used when logical equivalence of
queries is not expected. The latter, on the other hand, would be used when the logical equivalence is required.
10
3.3 Accounting for Partial Matches
Both the min and the geometric mean functions have weaknesses in supporting partial matches. When, one of
the involved predicates returns zero, then both of these functions return 0 as the combined score. However, in
multimedia retrieval, partial matches are required (see Section 1). In such cases, having a few number of terms
with 0 score value in a conjunction should not eliminate the whole conjunctive term from consideration.
One proposed [6] way to deal with the partial match requirement is to weigh different query criteria in such a
way that those criterion that are not important for the user are effectively omitted. For instance, in the query given
in Figure 2(Query), if the user knows that spatial information is not important, then the user can choose to provide
a lower weight to spatial constraints. Consequently, using a weighting technique, the image given in Figure 2(a)
can be maintained although the spatial condition is not satisfied. This approach, however, presupposes that
users can identify and weigh different query criteria. This assumption may not be applicable to many situations,
including databases for naive users or retrieval by QBE (query by example). Furthermore, it is always possible
that for each feature or criterion in the query, there may be a set of images in the database that fails it. In such a
case, no weighting scheme will be able to handle the partial match requirement for all images.
Example 3.1 Let us assume that a user wants to find all images in the database that are similar to image Iexample
.
Let us also assume that the database uses three features, olor, shape, and edgedistribution to compare images,
and that the database contains three images, I1
, I2
, and I
3
. Finally let us assume that the following table gives
the matching degrees of the images in the database for each feature:
Image Shape Color Edge
I
1
0.0 0:9 0:8
I
2
0:8 0.0 0:9
I
3
0:9 0:8 0.0
According to this table, it is clear that if the user does not specify a priority among the three features, the
system should treat all three candidates equally. On the other hand, since for each of the three features, there is
a different image which fails it completely, even if we have a priori knowledge regarding the feature distribution
of the data, we can not use feature weighing to eliminate low scoring features. 3
To account for partial matches, we need to modify the semantics of the n-ary logical operators and eliminate
the undesirable nullifying effect of 0 [3]. Note that a similar modification can also be done for the min semantics:
Given a set, P = fP
1
; : : : ; P
m
g of fuzzy sets and F = f�
1
(x); : : : ; �
m
(x)g of corresponding scoring
functions, the semantics of n-ary fuzzy conjunction operator is as follows:
�
(P
i
1
^:::^P
i
n
)
(t; r
true
; �) =
((
Q
�
k
(t)�r
true
�
k
(t))� (
Q
�
k
(t)<r
true
�))
1=n
� �
1� �
where
� n is the number of predicates in Q,
11
� r
true
is the truth cutoff point, i.e., it is the minimum valid score,
� � is an offset value greater than 0:0 and less than r
true
; it corresponds to the fuzzy value of false and it
prevents the combined score to be 0 when one of the predicates has a 0 score, and
� �
k
(t) is the score of predicate Pk
2 fP
i
1
; : : : ; P
i
n
g for n-tuple t.
The term (
Q
�
k
(t)�r
true
�
k
(t)) � (
Q
�
k
(t)<r
true
�))
1=n returns a value between � and 1:0. Subtraction of � from
it and the subsequent division to 1� � normalizes the result to values between 0:0 and 1:0. An additional
improvement could be to allow the conjunction to give extra importance to the predicates which are above the
truth cutoff value, rtrue
. To achieve this, one can modify the scoring function as follows:
�
(P
i
1
^:::^P
i
n
)
(t; r
true
; �
1
; �
2
; �) = �
1
�
((
Q
�
k
(t)�r
true
�
k
(t))� (
Q
�
k
(t)<r
true
�))
1=n
� �
1� �
+�
2
�
P
�
k
(t)�r
true
1
n
;
where �1
and �2
are values between 0:0 and 1:0 such that �1
+ �
2
= 1:0,
The following is an example where various semantics for logical operators are compared. Among other things,
this example clearly shows that any semantics which capture the crisp semantics are not suitable for multimedia
retrieval.
Example 3.2 In this example, we use the query which was presented in the introduction section to compare
scores corresponding to different approaches. Figure 4 shows a set of candidate images and the associated scores
computed by different methods. Numbers next to the objects in the candidate images denote the similarity values
for the object level matching. The figure shows the score of the candidate images as well as their relative ranks.
The cutoff parameters used in this example are rtrue
= 0:4 and � = 0:4, and the structural weights are �1
= 0:8
and �2
= 0:2. 3
4 Evaluation of Queries with Unordered Fuzzy Predicates
In the earlier sections, we have investigated characteristics of multimedia retrieval and semantic properties of
different fuzzy retrieval options. In this section, we focus on the query processing requirements for multimedia
retrieval and provide an efficient algorithm. Although the algorithm is independent of the chosen semantics of the
fuzzy logic operators described above, it uses their statistical properties to deal with unknown system parameters.
4.1 Essentials of Multimedia Query Processing
Recently, many researchers studied query optimization (mostly through algebraic manipulations) in non-traditional
forms of databases. [28, 29, 30, 31, 32] provide overviews of techniques used for query processing and retrieval
12
Fuji
Mountain
Lake
Query
Lake
0.5
1.0
0.8Mountain
Fuji
Candidate 4
Lake
Mountain0.98
0.98
Fuji
0.0
Candidate 1
Mountain
Lake
0.5
0.5
1.0
Candidate 2
0.8
Fuji
Mountain
Forest
0.00.5
Candidate 3
Semantics Score Rank Score Rank Score Rank Score Rank
min 0.50 1-2 0.00 3-4 0.50 1-2 0.00 3-4
product 0.40 1 0.00 3-4 0.25 2 0.00 3-4
arithmetic
average
0.76 1 0.65 3 0.66 2 0.43 4
geometric
average
0.74 1 0.00 3-4 0.63 2 0.00 3-4
geometric aver-
age with cutoff
0.56 1 0.55 2 0.38 3 0.24 4
geometric aver-
age with weights
0.65 1 0.57 2 0.51 3 0.32 4
Figure 4: Comparison of different scoring mechanisms
in such databases. Solutions in such non-traditional databases vary from the use of database statistics and do-
main knowledge to facilitate query rewriting and intelligent use of cached information [33] to the use of domain
knowledge to discover and prune redundant queries [34]. Li et al. also used off-line feedback mechanisms to
prevent users from asking redundant or irrelevant queries [35]. In [16] Chaudhuri and Gravano discuss query
optimization issues and ranking in multimedia databases. In [36], Chaudhuri and Shim discuss approaches for
query processing in the presence of external predicates, or user defined functions which are very common in
multimedia systems.
The above work and others in the literature collectively point to the following essential requirements for
multimedia query processing:
� As discussed in earlier sections, fuzziness is inherent in multimedia retrieval due to many reasons includ-
ing similarity of features, imperfections in the feature extraction algorithms, imperfections in the query
formulation methods, partial match requirements, and imperfections in the available index structures.
� Users are usually not interested in a single result, but k � 1 ranked results, where k is provided by the user.
This is mainly due to the inherent fuzziness, users want to have more alternatives to choose what they are
interested in.
� We would prefer to generate kth result, after we generate (k � 1)
th result, as progressively as possible.
� Since the solution space is large, we can not perform any processing which would require us to touch or
enumerate all solutions.
13
In [10, 11], Fagin proposes a set of efficient query execution algorithms for databases with fuzzy queries.
These algorithms assume that
� the query has a monotone increasing combined scoring function,
� individual sources can progressively (in decreasing order of score) output results, and
� the user is interested in the best k matches to the query.
If all these conditions hold, then these algorithms can be used to progressively find the best k matches to the
given query.
Note that, if the min semantics for conjunction is used and if the query does not contain negation, then
scoring function of queries are guaranteed to be monotone increasing. Similarly, for arithmetic average, product,
and geometric average semantics, (if the query does not contain negation, then) the combined scoring function
will be monotone. Consequently, the algorithms proposed by Fagin can be applied.
4.2 Negation
If the query contains negation, on the other hand, then the scoring function may not be monotone increasing,
invalidating one of the assumptions.
This, however, can be taken care of if we can assume that some of the sources (the negated ones) can also
output results in increasing order of score. Since a multimedia predicate is more likely to return lower scores, the
execution cost of such a query is expected to be higher. Although, the algorithm we introduce in this paper can
take into account negated goals, when such an index is available, the actual focus of the algorithm is to deal with
unordered subgoals.
4.3 Unordered Subgoals
The second assumption can also be invalid for various reasons, including the binding rules imposed by the
predicates.
Example 4.1 For example, consider the following query which is aimed at receiving all pairs of images, each
containing at least one object, a ”mountain” and a ”tree”, respectively, and that are visually similar to each other:
select image P1, P2
where P1.semantical property s like ”mountain”
and P2.semantical property s like ”tree”
and P1.image property image match P2.image property
The above query contains three fuzzy conditions (two s like predicates and one image match predicate). Let
us assume that the image match predicate is implemented as an external function, which can be invoked only by
providing two input images. The image match predicate then returns a score denoting the visual similarity of
its inputs. In this case, we have two sources (s like predicates) which can output images progressively through
database access and one source (image match) which can not. 3
14
--------------------P(X) and Q(Y) and R(X,Y)
<y3,0.6>
--------------------P(X) and Q(Y) and R(X,Y)
<y1,0.8>
<y3,0.6>
<y2,0.7> <y2,0.7>
(ordered) (ordered) (ordered) (unordered)
(a) (b)
Three results can be merged in order Only two results can be merged in order
(ordered) (ordered)
<x1,0.9> <y1,0.8>
<x2,y2,0.1>
<x3,y3,0.7>
<x2,0.9>
<x3,0.9>
<x1,0.9>
<x2,0.9>
<x3,0.9>
<x1,y1,0.78> <x3,y2,0.8>
<x1,y1,0.78>
<x2,y1,0.75>
Figure 5: The effect of having non-progressive fuzzy predicates (numbers in the figure denote scores of the re-
sults): In (a) all predicates are progressive (results are ordered); hence, the results can be merged using algorithms
presented in [11] to find the k top ranking results. In (b) only two of the three fuzzy predicates are progressive;
hence the merging can be done on two predicates only. However, in this case, the ranking generated by merging
two predicates may not be equal to the ranking that would be generated by merging three of the scores.
As shown in Figure 5, because of such non-progressive fuzzy predicates, finding and returning the k best
matching results may require a complete scan of the database. Consequently, in order to avoid the complete scan
of the database, our algorithm uses the score distribution estimates/statistics of the individual predicates and the
statistical properties of the score merging function to compute an approximate set of top-k results.
In [16] Chaudhuri and Gravano discuss query optimization issues and ranking in multimedia databases. In
their framework, they differentiate between top search (access through an index) and probe (testing the predicate
for given object). Top search and probe can be looked as the same as sorted access and random access described
in [11].
In the next subsection, we provide an algorithm that has a similar search/probe structure. However, unlike
the techniques proposed in [16], the algorithm we propose is aimed at dealing with queries with partial matches,
is not tied with min semantics, and can deal with negation as long as sources (the negated ones) can also output
results in increasing order of score. In order to deal with the non-progressive predicates, the algorithm we propose
takes a probability threshold, �, as input and it returns a set R of k results, such that each result r 2 R is most
probably (prob(r 2 Rk
) > 1��) in the top k results, Rk
.
Note that the algorithm uses score distribution function estimates/statistics [33, 35] and the statistical prop-
erties of the score merging functions for computing approximate top-k results. Note also that the algorithm
is flexible in the sense that both strict (such a product) and monotone (such as min) semantics of fuzzy logic
operators are acceptable.
15
Algorithm 4.1 Query Evaluation Algorithm
Input:
� A query, Q,
� A set of ordered predicates, PO
, a set of non-ordered predicates, PU
, and a set of crisp predicates, PC
,
� A positive integer, k,
� A threshold, �.
1. Identify a scoring function, �Q
. Find the sign (signi
2 f+;�g) of the slope of �Q
with respect to each
predicate Pi
2 P .
2. V isited = ;. solNum = 0. Marked = ;.
3. while solNum < k do
(a) Use the decreasingly (if signi
= +) or increasingly (if signi
= �) ordered outputs of predicates
P
i
2 P
O
to construct a set, S, of at least 1 n-tuple satisfying all crisp predicates, PC
.
(b) Order the resulting tuples with respect to �Q
, which encapsulate predicates in PO
, as well as those in
P
U
. Let �i;j
denote the score of the n-tuple si
2 S with respect to the predicate Pj
2 P
U
.
(c) V isited = V isited [ S.
(d) For every n-tuple si
2 S, find P
9
(i) = prob(9j > i su h that �
Q
(s
i
) < �
Q
(s
j
)) (assuming that
predicates in Q are independent).
(e) Put n-tuples in S such that P9
(i) � � into Marked.
(f) Let the number of n-tuples put into Marked in the previous step be k0. solNum = solNum+ k
0.
4. Let R be the best k solutions in V isited; output R.
Figure 6: Query evaluation algorithm.
4.4 Query Evaluation Algorithm
Let Q(Y
1
; : : : ; Y
n
) be a query and let P = P
O
[P
U
be the set of all fuzzy predicates in Q, such that predicates in
P
O
can output ordered results and those inPU
can not output ordered results. Let the query also contain a set, PC
,
of crisp (non-fuzzy) predicates, including those which check the equalities of variables. Let us also assume that
the user is interested in a set R of k results, such that each result r 2 R is most probably (prob(r 2 Rk
) > 1��)
in the top k results, Rk
. The proposed query execution algorithm is given in Figure 6.
The main inputs to this algorithm are
� a query, Q,
� a set of ordered predicates, PO
, a set of non-ordered predicates, PU
, and a set of crisp predicates, PC
,
� a positive integer, k,
� an error threshold, �,
and the output is a set, R, of k n-tuples. Below, we provide detailed description of the algorithm.
16
The first step of the algorithm calculates the combined scoring function for the query. If this combined
scoring function is not monotone (i.e., there are some negated subgoals), the algorithm identifies the predicates
which are negated. For all negated predicates which may have an inversely ordered index (if such an index is
available), the algorithm will use the results in the increase order of score. For all negated predicates which do
not have such an inverse index (which is more likely) the algorithm will treat them as non-progressive predicates.
The second step of the algorithm, initializes certain temporary data structures. solNum keeps track of the
number of candidate solutions generated, V isited is the set of all tuples generated (candidate solution or not),
and Marked is the set of all candidate solutions.
In steps 3(a) through 3(d), the algorithm uses an algorithm similar to the one presented in [11, 10], to merge
results using the ordered predicates (see Figure 5(b)). The algorithm stops when it finds k n-tuples which satisfy
the condition in step 3e. Note that an n-tuple, si
, satisfies this condition if and only if
� the probability of having another n-tuple, sj
, with a better score is less than �, i.e., when P
9
(i) =
prob(9j > i su h that �
Q
(s
i
) < �
Q
(s
j
)).
Let us refer to this condition as F (i;�) = (P
9
(i) � �). Intuitively, if F (i;�) is true, then the probability of
having another n-tuple, sj
, with a better combined score than the score of si
, is less than �.
At the end of the third step, Marked contains k n-tuples which satisfy the condition. However, in the fourth
step, the algorithm revisits all the n-tuples generated and put into V isited earlier, to see whether there are any
better solutions in those that are visited. The best k n-tuples generated during this process are returned as the
output, R.
Intuitively, the algorithm uses a technique similar to the ones presented in [11, 10] to generate a sequence of
results that are ranked with respect to the ordered predicates. As we mentioned above, this order does not neces-
sarily correspond the the final order of the predicates, because it does not take the unordered scores into account.
Therefore, for each tuple generated in the first stage, using the database statistics and statistical properties of �Q
,
the algorithm estimates the probability of having a better result in the remainder of the database. If for a given
tuple, this probability is below a certain level, then this tuple is said to be a candidate to be in the top k results.
Note that, for the algorithm to work, we need to be able to calculate F (i;�) for the fuzzy semantics chosen
for the query. For each different semantics, F (i;�) must be calculated in a different way. The following two
examples show how to calculate F (i;�) for product and min semantics we covered in Section 3.
Example 4.2 Let us assume the �Q
is a scoring function that has the product semantics. Let �Q;o
be the com-
bined score of the ordered predicates and let �Q;u
be the combined score of the unordered predicates. Then,
given the ith tuple, si
, ranked with respect to the ordered predicates, F�
(i;�) is equal to
F
�
(i;�)= (prob(9 j > i �
Q;o
(s
j
)� �
Q;u
(s
j
) > �
Q;o
(s
i
)� �
Q;u
(s
i
)) � �)
= (prob(8 j > i �
Q;o
(s
j
)� �
Q;u
(s
j
) � �
Q;o
(s
i
)� �
Q;u
(s
i
)) � 1��)
= (prob(8 j > i �
Q;u
(s
j
) �
�
Q;o
(s
i
)��
Q;u
(s
i
)
�
Q;o
(s
j
)
) � 1��)
17
3
Example 4.3 Let us assume the �Q
is a scoring function that has the min semantics. Then, given the ith tuple,
s
i
, ranked with respect to the ordered predicates, Fmin
(i;�) = (P
9
(i) � �) is equal to
F
min
(i;�)= (prob(9 j > i minf�
Q;o
(s
j
); �
Q;u
(s
j
)g > minf�
Q;o
(s
i
); �
Q;u
(s
i
)g) � �)
= (prob(8 j > i minf�
Q;o
(s
j
); �
Q;u
(s
j
)g � minf�
Q;o
(s
i
); �
Q;u
(s
i
)g) � 1��)
Since �Q;o
is a non-increasing function, �Q;o
(s
j
) is smaller than or equal to �Q;u
(s
i
). Consequently,
if �Q;o
(s
i
) � �
Q;u
(s
i
), then
F
min
(i;�)= true
else
F
min
(i;�)= (prob(8 j > i minf�
Q;o
(s
j
); �
Q;u
(s
j
)g � �
Q;u
(s
i
)) � 1��).
3
As seen in the above examples, in order to find the value of F (i;�), we need to know the distributions of
the scoring functions �Q;o
and �Q;u
. In Section 5.1, we will show how these statistical values can be calculated
for different fuzzy semantics. However, in some cases, such a score distribution function may not be readily
available. In such cases, we need to approximate the score distribution using the database statistics. Obviously,
such an approximation is likely to cause deviations from the expected results obtained using the algorithm. In
Section 5, we describe a method for approximating score distributions.
4.5 Correctness of the Algorithm
In this subsection, we show that the expected ratio of the relevant results, within the k top results returned by the
algorithm, is within the error bounds; i.e., we prove the correctness of the algorithm.
Theorem 4.1 Given an n-tuple, ri
, which is in the top k n-tuples returned in R, ri
is most probably (prob(r 2
R
k
) > 1��) in the set, Rk
, of top k results. 2
Proof 4.1 Given the ith n-tuple, ri
(i � k) in R, ri
is either in Marked or not.
If ri
2 Marked, then it satisfies the condition in step 3e, and the probability that there is another tuple with
a better combined probability is less than or equal to �. Since i � k, we can conclude that the probability that
the subsequent searches will yield k � i + 1 tuples with a better combined probability is also less than or equal
to �. In other words, the probability that the subsequent searches will push r
i
out of Rk
is less than or equal to
�. Consequently, the probability that ri
is in Rk
is greater than 1��.
If ri
=2Marked, then there is an mi
2Marked such that ri
> m
i
and mi
satisfies the condition in Step 3e.
Consequently, the probability that ri
is in Rk
is again greater than 1��.
Theorem 4.2 The expected number of n-tuples that are in R that are also in Rk
is greater than (1��)� k. 2
18
Proof 4.2 Given a set, R, of k n-tuples such that each one is in R
k
with a worst-case probability 1 � � (The-
orem 4.1), the number, j, of the elements of R that are also in R
k
is a random variable that has a binomial
distribution with parameters p > � and q � 1 � �. Consequently, the expected value of j is greater than
(1��)� k.
4.6 Complexity of the Algorithm
Note that the main loop of the proposed algorithm can be iterated many times during which no tuples are added
to the Marked set. The following theorem states the affect of this on the complexity of the algorithm.
Theorem 4.3 If N is the database size (the number of all possible n-tuples) and m is jPO
j and if prob(F (i;�) =
truej1 � i � N) is geometrically distributed with parameter F and if predicates are independent, then the
expected running time of the algorithm, with arbitrarily high probability, is O(
N
m�1
m
�k
F
+
k
F
� log(
k
F
))). 2
Proof 4.3 If we assume that the probability of having one n-tuple satisfying the condition in step 3e is geomet-
rically distributed with parameter F , the expected number of n-tuples to be tested until the first suitable n-tuple
is 1
F
. In reality, F is not a constant and it tends to be higher for lower values of i. Consequently, the actual
number of visited tuples is lower than the value provided by this assumption. See the results in Section 6 for
details. Consequently, for k matches, the expected number of times the loop will be repeated is k
F
. Step 3a
of the algorithm takes O(N
m�1
m
) (according to Fagin [10, 11]), with arbitrarily high probability, where N is the
database size, i.e., the number of all possible n-tuples. Consequently, the expected running time of the while loop
(or the expected number of tuples to be evaluated), with arbitrarily high probability, is O(
N
m�1
m
�k
F
).
The final selection of k best solutions among the candidates, then, takes O(
k
F
� log(
k
F
)) time. Consequently,
the expected running time of the algorithm, with arbitrarily high probability, is O(
N
m�1
m
�k
F
+
k
F
� log(
k
F
)). 2
Note that when N
m�1
m
�k
F
� N , the proposed algorithm accomplishes its task: it visits a very small portion
of the database, yet generates approximately good results. This means that if k
F
� N
1
m , then the algorithm will
work most efficiently. In Section 6, we will show experiment results that verify these expectations.
Note that the theorem assumes that prob(F (i;�) = truej1 � i � N) is geometrically distributed. The
results presented in Section 6 will show that, if we replace the geometric distribution assumption with a more
skewed distribution, the complexity of the algorithm relative to the database size will be less than predicted by
the above theorem. Note also that, similar to the case in [11], although positive correlations between predicates
may help reduce the complexity predicted above, negative correlations between predicates may result in higher
complexities.
5 Approximating the Score Distribution using Statistics
In the previous section, we have seen that in order to use the proposed query evaluation algorithm, we need to
calculate the score distribution for the combined scoring functions, �Q;o
and �Q;u
. In this section, we discuss the
19
0 0.2 0.4 0.6 0.8 1
pred1
0.5
1
pred2
0
0.2
0.4
0.6
0.8
1
Conjunction - geometric average
0 0.2 0.4 0.6 0.8 1
pred1
0.5
1
pred2
0
0.2
0.4
0.6
0.8
1
Conjunction - arithmetic average
0 0.2 0.4 0.6 0.8 1
pred1
0.5
1
pred2
0
0.2
0.4
0.6
0.8
1
Conjunction - minimum
(a) (b) (c)
Figure 7: The effect of (a) geometric average, (b) arithmetic average, and (c) minimum function with two pred-
icates. Horizontal axes correspond to the values of the two input predicates and the vertical axis corresponds to
the value of the conjunct according to the respective function.
differences in score distributions for various fuzzy semantics, followed by our proposed method for approximat-
ing the score distribution using statistics.
5.1 Statistical Properties of Fuzzy Semantics
We first investigate the score distributions and statistical properties of different fuzzy semantics. Figure 7 depicts
three mechanisms to evaluate conjunction. Figure 7(a) depicts the geometric averaging method (which is product
followed by a root operation), (b) depicts the arithmetic averaging mechanism used by other researchers [26],
and (c) the minimum function as described by Zadeh [17] and Fagin [10, 11]. In this section, we compare various
statistical properties of these semantics. These properties describe the shape of the combined score distribution
histograms.
5.2 Relative Importance of Query Criteria
An important advantage of the geometric average, against the arithmetic average and the min functions, is that
although it shows a linear behavior when the similarity values of the predicates are close to each other:
� (x = y) �!
d�(x;y)
dxdy
=
1
2�
p
x�x
� (x+ x) =
1
2�x
� (2� x) = 1;
it shows a non-linear behavior when one of the predicates has lower similarity compared to the others:
� (x << y) �!
d�(x;y)
dx
=
p
y
2�
p
x
=1 (x >> y) �!
d�(x;y)
dx
= 0:
Example 5.1 The first item below shows the linear increase in the score of the geometric average when the input
values are closer to each other. The second item, on the other hand, shows that non-linearity of the increase when
input values are different:
� (0:5� 0:5 � 0:5)
1=3
= 0:5; (0:6 � 0:6� 0:6)
1=3
= 0:6; and (0:7 � 0:7� 0:7)
1=3
= 0:7:
20
Arithmetic average Min Geometric averageR
1
0
R
1
0
x+y
2
dydx
R
1
0
R
1
0
dydx
=
1
2
R
1
0
R
1
0
minfx;ygdydx
R
1
0
R
1
0
dydx
=
1
3
R
1
0
R
1
0
p
x�y dydx
R
1
0
R
1
0
dydx
=
4
9
Table 4: Average score of various scoring semantics
Arithmetic average Min Geometric average
1�
8��
3
3
(� � 0:5) 1� 3� �
2
+ 2� �
3
1� �
3
+ 3� �
3
� ln(�)
4
3
� 4� �
2
+
8��
3
3
(� � 0:5)
Table 5: Score distribution (relative cardinality of a strong �-cut) of various scoring semantics
� (1:0� 1:0 � 0:5)
1=3
= 0:79; (1:0� 1:0 � 0:6)
1=3
= 0:85; and (1:0 � 1:0� 0:7)
1=3
= 0:88: 3
It is claimed that according to real-world and artificial nearest-neighbor workloads, the highest-scoring pred-
icates are interesting and the rest is not interesting [27]. This implies that the min semantics which gives the
highest importance on the lowest scoring predicate may not be suitable for real workloads. The geometric average
semantics, unlike the min semantics, on the other hand, does not suffer from this behavior.
Furthermore, the effect of an increase in the score of a sub-query (due to a modification/relaxation on the
query by the user or the system) with a small score value is larger than an equivalent increase in the score of a
sub-query with a large score value. This implies that, although the sub-queries with a high score have a larger
role in determining the final score, relaxing a non-satisfied sub-query may have a significant impact on improving
the final score. This makes sense as an increase in a low scoring sub-query increases the interestingness of the
sub-query itself.
Average score: The first statistical property that we consider in this section is the average score, which measures,
assuming a uniform distribution of input scores, the average output score. The average score, or the relative
cardinality, of a fuzzy set with respect to its discourse (or domain) is defined as the cardinality of the set divided
by the cardinality of its discourse. We can define the relative cardinality of a fuzzy set S with a scoring function
�(x), where x ranges between 0 and 1 as
R
1
0
�(x)dx
R
1
0
dx
: Consequently, the average score of conjunction semantics
can be computed as shown in Table 4. Note that, if analogously defined, the relative cardinality of the crisp
conjunction is
�
(false^false)
+ �
(false^true)
+ �
(true^false)
+ �
(true^true)
jf(false ^ false); (false ^ true); (true ^ false); (true ^ true)gj
=
0 + 0 + 0 + 1
4
=
1
4
:
If no score distribution information is available, the average score (assuming a uniform distribution of inputs)
can be used to calculate a crude estimate of the ratio of the inputs that have a score greater than a given value.
However, a better choice is obviously to look at the expected score distributions of the merge functions.
21
0
0.2
0.4
0.6
0.8
1
0.1 0.2 0.3 0.4 0.5x 0
0.2
0.4
0.6
0.8
1
0.50.6 0.7 0.8 0.9 1x
Figure 8: Score distribution of various score merging semantics (lower curve = min, middle curve = geometric
average, and upper curve = arithmetic average)
0
0.2
0.4
0.6
0.8
1
2 4 6 8 10k
Uniform Distribution of Scores
0.05
0.1
0.15
0.2
0.25
0.3
0.35
2 4 6 8 10k
Zipf’s Distribution of Scores
0
0.1
0.2
0.3
0.4
0.2 0.4 0.6 0.8 1x
Vector-space Dist. of Scores
(a) (b) (c)
Figure 9: Various score distributions: (a) Uniform distribution, (b) Zipf’s distribution, (c) vector-space distribu-
tion
Score Distribution: The second property we investigate is score distribution. The study of the score distribution
of fuzzy algebraic operators is essential in creating histograms that can be used in the algorithm we proposed.
Strong �-cut of a fuzzy set is defined as the set of elements of the discourse that have score values equal to
or larger than �. The relative cardinality of a strong �-cut (�x^y
� �) of a conjunction with respect to its overall
cardinality (�x^y
� 0) describes the concentration of scores above a threshold �. Table 5 and Figure 8 show the
score distribution of various scoring semantics(assuming a uniform distribution of inputs). Note that when, � is
close to 1, the relative cardinality of the strong �-cut of the geometric average behaves like arithmetic average.
5.3 Approximation of Score Distributions
In the previous subsection, we studied the score distribution of various merge functions. These distributions were
generated assuming a uniform distribution of input values and can be used when there is no other information. An
alternative approach, on the other hand, is to approximate the score distribution function using database statistics
or domain knowledge.
5.3.1 Score Distributions
There are various ways in which the scoring function of a predicate can behave. The following is a selection of
possible distributions:
22
–0.3
–0.2
–0.1
0
0.1
0.2
0.3
–0.3
–0.2
–0.1
0
0.1
0.2
0.3
–0.3
–0.2
–0.1
0
0.1
0.2
0.3
Figure 10: The equi-distance spheres enveloping a query point in a three-dimensional space
Uniform distribution of scores: In this case, the probability that a fuzzy predicate will return a score within the
range [
a
k
;
a+1
k
), where 0 � a < k, is 1
k
. Figure 9(a) shows this behavior.
Zipf’s distribution of scores: It is generally the case that multimedia predicates have small number of high-
scoring and a very high number of low-scoring inputs. For instance, according to Zipf’s distribution [37, 38], the
probability that for a given data element, a fuzzy predicate will return a score within the range [
a
k
;
a+1
k
), where
0 � a < k, is 1
(a+1)�log(1:78�k)
(Figure 9(b)).
Vector-space distribution of scores: In information retrieval or multimedia databases, documents or media
objects are generally represented as points in an n � dimensional space. This is called the vector-model rep-
resentation of the data [28, ?]. The predicates use the distance (Euclidean, city-block etc.) between the points
in this space to evaluate the dissimilarities of the objects. Therefore, the further apart the objects are from each
other the more dissimilar they are. Hence, given two objects, o1
and o
2
, in the vector space, their similarity can
be measured as
sim(o
1
; o
2
) = 1�
�(o
1
; o
2
)
max�
;
where �(o
1
; o
2
) is the distance between o1
and o2
and max� is the maximum distance between any two points
in the database. Note that sim(o
1
; o
1
) is equal to 1:0 and the minimum possible score is 0:0.
Given a nearest-neighbor query (which is a very common type of query that asks for objects that fall near
a given query point) q, we can divide the space into k using k spheres each of which is max�
k
apart from each
other. Figure 10 shows three spheres enveloping a query point in a three-dimensional space. In this example,
each of these spheres is max�
10
units apart from each other; i.e., k is equal to 10.
Note that in an n-dimensional space, the volume of a sphere of radius r is C � r
n, where C is some constant
(for example, in two dimensional space, the area of a circle is �r2 and in a three dimensional space, the volume
of a sphere is 4
3
�r
3. Consequently, in an n-dimensional space, the volume between two consecutive, ith and
23
(b)(a)
Figure 11: Two example queries to an image database
(i� 1)
th, spheres can be calculated as
C � (i�
max�
k
)
n
� C � ((i� 1)�
max�
k
)
n
= C
0
� i
n�1
+ o(i
n�1
)
Hence, assuming that the points are uniformly distributed in the space, the expected ratio of points that fall into
this volume to the points that fall into the next larger slice is approximately (
i
i+1
)
n�1
: Therefore, if the number
of points in the innermost sphere is I , then the number of points in
� the second slice is O(I � 2
n�1
),
� the third slice is O(I � 3
n�1
),
� the fourth slice is O(I � 4
n�1
), and so on.
Hence, assuming uniform distribution of points in the space, the probability that for a given data element, a fuzzy
predicate will return a score within the range [
a
k
;
a+1
k
), where 0 � a < k, is M � (k � a)
n�1, where M is a
positive constant and n is the number of dimension in the vector space. Figure 9(c) shows this behavior. Note
that, when the number of dimensions is higher, the curve becomes steeper.
Clustered Score Distributions: In the real world, the assumption that points are uniformly distributed across the
vector space does not always hold. Figure 11 provides two example score distributions from an image database,
ImageRoadMap [?]. This particular database contains 25000 images. The retrieval predicate omits those images
that are beyond a certain threshold distance from the query point and, then, it appropriately scales the scores
to cover the range [0; 1℄. Therefore, although not shown in the figure, the bin corresponding to 0 score is, in
actuality, very large.
Figure 11(a) and (b), both shows a rapid increase in the score distribution as predicated by the zipfian and
vector-based models. However, both figures shows an other phenomenon unpredicted by these models: the
distribution starts decreasing after a point instead of continuously increasing. This is due to the fact that points
in the vector space are not uniformly distributed; instead, they tend to form clusters. This is especially apparent
in Figure 11(b), where there are not only one, but two local maximas, corresponding to two different clusters.
Note, however, that the fact that clusters exist in the vector space does not mean that the vector space model
described earlier can not be used to reason about score distribution: since the volume between slices in the vector
24
Higher Score
Mo
re P
oin
ts Actual distribution
Envelop
Figure 12: A clustered distribution and the enveloping curve
space increases as we get further away from the center, slices away from the query point is likely to contain more
clusters then the closer ones (assuming that clusters themselves are uniformly distributed). Therefore, a zipfian
or vector-based model can be used to model an envelop curve for a large vector space with clusters of points
(Figure 12).
Merged Score Distributions: When cached results and materialized views are used for answering queries or sub-
queries, a single predicate in a given query may correspond to a cached combination of multiple sub-predicates
put together (in advance) using a merge function. As we have seen in Section 5.1, even when their inputs
are uniformly distributed, the score distributions corresponding to such score merge functions show a skewed
behavior: the number of low-scoring inputs is much larger than the number of high-scoring inputs.
Summary: Since, in most cases the number of low-scoring inputs is much larger than the number of high-scoring
inputs, in the rest of the paper, we will focus our attention to these kind of scoring functions. However, this does
not mean that scoring functions with other behaviors can not exist.
5.3.2 Selection of an Appropriate Appropriate Scoring Function
Since in this section we concentrate on score distributions where the number of low-scoring inputs is much larger
than the number of high-scoring inputs, in order to model and approximate �
Q;o
and �
Q;u
, we need to find a
generic scoring function which has a similar behavior. Clearly various functions can be used for approximating
the scoring function. We see that two good candidates are
� f
a
�;�;�
(rank) =
�
rank+�
+ � and
� f
b
�;�;�
(rank) = � � (
max rank�rank
max rank
)
�
+ �.
Intuitively, in both cases, f(i) gives the ith highest score of the predicate. Depending on the values of �, �, and
� parameters, these functions can describe rapidly and slowly decreasing scoring functions. The advantage of
the first function over the second one is that it does not need the maximum rank information (max rank) and it
can be evaluated much easily. Note also that the first function can describe a large set of curvatures, both concave
and convex (Figure 13(a,b,c,d)). Therefore, we use fa in the rest of this paper.
25
0.2
0.4
0.6
0.8
score
200 400 600 800 1000rank
alpha=207, beta=197, phi=0
0
50
100
150
200
250
300
350
num
0.2 0.4 0.6 0.8 1score
alpha=207, beta=197, phi=0
(a) (b)
0
0.2
0.4
0.6
0.8
1
score
200 400 600 800 1000rank
alpha=–1333.9, beta=447.3, phi=1.34
0
50
100
150
200
250
num
0 0.2 0.4 0.6 0.8 1score
alpha=–1333.9, beta=447.3, phi=1.34
(c) (d)
0 5 10 15 20 25 30 35 40 45 500
0.2
0.4
0.6
0.8
1
1.2
Rank(*20000)
Sco
re
P(X) and Q(Y)
- :statistics
. :appx 1
+ :appx 2
(e)
Figure 13: The first score approximation function: two functions (a and c) and the corresponding score distribu-
tions (b and d); (e) approximation of the combined score of a query, P (X) ^Q(Y )
Figure 13(e) shows two possible approximations, each with different parameters, for the combined score of
a query of the form P (X) ^ Q(Y ), where P (X) and Q(Y ) are ordered and both satisfying Zipf’s distribution.
This figure shows that, the function does not only approximate the predicate scores, but can also approximate
combined scores for a query. Therefore, we can conclude that the proposed approximation function (1) is not lim-
ited to the predicates with a Zipfian distribution and (2) can handle merged scores with non-uniformly distributed
inputs.
5.3.3 Calculating F�
(i;�) using the Approximate Scoring Function
Note that an approximate scoring function, f�;�;�
(i), for the ith highest score, is not enough for the algorithm
proposed in Section 4.4. Instead, the algorithm needs the value of F�
(i;�)
4, assuming that the behaviors of
�
Q;o
and �Q;u
are approximated with functions f�
o
;�
o
;�
o
(x) and f�
u
;�
u
;�
u
(x).
Let us assume that �Q
is a scoring function that has the product semantics. Let �Q;o
be the combined score of
the ordered predicates and let �Q;u
be the combined score of the unordered predicates. Let �Q;o
(x) and �Q;u
(x)
be approximated with functions f�
o
;�
o
;�
o
(x) and f�
u
;�
u
;�
u
(x). Then, given the ith tuple, si
, ranked with respect
to the ordered predicates in Q, F�
(i;�) = (P
9
(i) � �) is equal to
(prob(8 j > i �
Q;o
(s
j
)� �
Q;u
(s
j
) � �
Q;o
(s
i
)� �
Q;u
(s
i
)) � 1��);
4Reminder: If F (i;�) is true, then given a tuple s
i
, the probability of having another tuple, sj
, with a better combined score than the
score of si
, is less than �.
26
uα β uf (X)
κij
��������������������������������������������������������
Sco
re
lj
Rank
Nthat would satisfy the
constraint
The range of inputs1
Figure 14: The ratio of the unordered tuples that could satisfy the constraint isN�l
j
N
(prob(8 j > i �
Q;u
(s
j
) �
�
Q;o
(s
i
)� �
Q;u
(s
i
)
�
Q;o
(s
j
)
) � 1��);
(prob(8 j > i �
Q;u
(s
j
) �
�
Q;o
(s
i
)� �
Q;u
(s
i
)
�
o
j+�
o
+ �
o
) � 1��);
(prob(8 j > i �
Q;u
(s
j
) �
�
Q;o
(s
i
)� �
Q;u
(s
i
)� (j + �
o
)
�
o
+ �
o
� (j + �
o
)
) � 1��); or in other words
(
N
Y
j=i+1
prob(�
Q;u
(s
j
) �
a
i
� j + b
i
� j + d
) � 1��):
Note that ai
; b
i
; and d are used as shorthands. To simplify the calculations, let us split the above inequality into
two parts as follows:
F
�
(i;�) = (S
i
� 1��) where S
i
=
N
Y
j=i+1
prob(�
Q;u
(s
j
) �
a
i
� j + b
i
� j + d
):
Since, �Q;u
(s
j
) is the combined score of the unordered predicates, given a rank j calculated with respect to the
ordered predicates, we can not estimate the value of �Q;u
(s
j
). However, using the function, f�
u
;�
u
;�
u
, we can
estimate the implicit rank lj
at which f�
u
;�
u
(l
j
) is equal to �i;j
=
a
i
�j+b
i
�j+d
as follows:
f
�
u
;�
u
(l
j
) =
a
i
� j + b
i
� j + d
;
�
u
l
j
+ �
u
+ �
u
=
a
i
� j + b
i
� j + d
; whi h gives us
l
j
=
a
0
i
� j + b
0
i
0
i
� j + d
0
i
:
27
Note, on the other hand, that, as shown as the shaded region in Figure 14, the ratio of all tuples, sj
, such that
�
Q;u
(s
j
) � �
i;j
isN�l
j
N
. Hence, we have
S
i
= prob(�
Q;u
(s
j
) � �
ij
) =
N � l
j
N
:
Since this ratio corresponds to a probability, we must make sure that 0:0 �N�l
j
N
� 1:0. In other words
0:0 � 1�
1
N
� (
a
0
i
� j + b
0
i
0
i
� j + d
0
i
) � 1:0;
which means that we can find two limits such that
l
?
i
� j � l
>
i
Consequently,
� if i + 1 � l
?
i
, then there will be at least one i + 1 � j � N such thatN�l
j
N
will be 0:0. Hence
S
i
= 0:0 < 1��; and assuming that the error ratio, �, is less than 1:0, the condition F
�
(i;�) = false.
� If, on the other hand, i+ 1 � l
>
i
, for all i+ 1 � j � N ,N�l
j
N
will be 1:0. Hence, Si
= 1:0 � 1�� and
F
�
(i;�) = true.
� Finally, if neither is the case, if l>i
< N , then since for all j � l
>
i
N�l
j
N
= 1:0, we have S
i
=
Q
N
j=i+1
N�l
j
N
=
Q
l
>
i
j=i+1
N�l
j
N
: Therefore, Si
can be rewritten asQ
lim
i
j=i+1
N�l
j
N
;where limi
= minfl
>
i
; Ng.
Note that we can further rewrite Si
as
lim
i
Y
j=i+1
a
00
i
� j + b
00
i
00
i
� j + d
00
i
;
where a00i
; b
00
i
;
00
i
and d00i
are used as shorthands 5.
Since we have guaranteed that the ratio,a
00
i
�j+b
00
i
00
i
�j+d
00
i
is always between 0:0 and 1:0, we can convert the product
into summation by taking the logarithm of both sides of the equation:
ln(S
i
) =
lim
i
X
j=i+1
ln(
a
00
i
� j + b
00
i
00
i
� j + d
00
i
):
Note that since it is not straight forward to solve the above summation, we can instead replace the equality
with two inequalities that bound the value of ln(Si
) from above and below. These two inequalities are
ln(S
i
) �
Z
lim
i
i+1
ln(
a
00
i
� j + b
00
i
00
i
� j + d
00
i
)dj and ln(S
i
) �
Z
lim
i
�1
i
ln(
a
00
i
� j + b
00
i
00
i
� j + d
00
i
)dj:
5
S
i
can be calculated as a
00
i
lim
i
�(lim
i
+
b
00
i
a
00
i
)
00
i
i+1
�(i + 1 +
d
00
i
00
i
)
�
00
i
lim
�
�1
�
�(lim
i
+
d
00
i
00
i
)
�
�1
�
a
i+1
�
�1
�
�(i+ 1 +
b
00
i
a
00
i
)
�
�1
,
where �(x) = (x� 1)!. But, we will provide an alternative formulation.
28
Using the second of the two inequalities, we can get
S
i
� e
R
lim
i
�1
i
ln(
a
00
i
�j+b
00
i
00
i
�j+d
00
i
)dj
= T
i
:
Since F�
(i;�) = (S
i
� 1��), we can conclude that
F
�
(i;�) = (T
i
� 1��) :
Consequently, putting altogether the different cases encountered so far, we have
� if i+ 1 � l
?
i
then
F
�
(i;�) = false;
� else if i+ 1 � l
>
i
,
F
�
(i;�) = true;
� else
F
�
(i;�) = (T
i
� 1��) :
Note that, therefore, given a tuple si
, ranked ith with respect to the ordered predicates, which satisfy the boundary
conditions is most probably in the result (prob(si
2 R
k
) > 1��) if
T
i
� 1��:
Final note: If the approximate score function overestimates �
Q;u
, then the value of the lim
i
(the limit rank
below which the probability of finding a larger combined score goes to 0) is also overestimated. An important
consequence of this overestimation, however, is that F (i;�) may evaluate to false in more situations then nec-
essary. Therefore, in order to balance such overestimations, we can introduce a new parameter win and set limi
to i+win. However, since the approximation function given in the previous section is flexible enough to fit into
various situations, we believe that this parameter will not be necessary in most situations. Nevertheless, in the
experiments section, we experimented with different values of the win parameter as well.
6 Experimental Evaluation
We have conducted a set of experiments to evaluate the algorithm we proposed to reduce the number of tuples
searched when looking for the best k matches to a fuzzy query. The primary goal of the experiments was to see
whether (a) the proposed algorithm returns the expected percentage of the results by (b) exploring only a small
29
0
0.2
0.4
0.6
0.8
1
score
200 400 600 800 1000rank*1000
alpha=197315, beta=207807, phi=0
Figure 15: The distribution of R(X;Y ) and its overestimating approximation
fraction of the entire search space. These experiments were also aimed at investigating whether (c) the proposed,
approximation of predicates through their statistical properties, approach affects the solutions obtained by the
algorithm negatively or not. In this section, we report on the observations we obtained through simulations.
6.1 Experiment Setup
In our experiments, we have varied (1) the number of top results required (k), (2) the database size (or the
search space size), and (3) the error threshold. The query that we used for the simulations contains two ordered
predicates, P (X) and Q(Y ), and one unordered predicate, R(X;Y ). We have ranged the sizes of predicates P
and Q between 100 and 1,000. Note that this means that there are up to 1,000,000 tuples in the database [X;Y ℄.
This is similar to the case when there are 1,000,000 images in the database, and we are using features with 1,000
different values for retrieval.
We ran the experiments on a Linux platform with a 300MHz Pentium machine with 128 MB main memory.
Each experiment was run 20 times and the results were averaged.
We have generated the fuzzy values for the predicate scores according to Zipf’s distribution [37, 38]. This
distribution guarantees that the number of high scores returned by the algorithm is less than the number of low
scores, fitting into the profile of many multimedia predicates. More specifically, we set the probability that for
a given data element, a fuzzy predicate will return a score within the range [
a
10
;
a+1
10
), where 0 � a < 10, to
1
(a+1)�log(1:78�10)
.
We used product semantics as the fuzzy semantics for retrieval. We have approximated �
Q;o
and �
Q;u
with
functions f�
o
;�
o
;�
o
(x) and f
�
u
;�
u
;�
u
(x). For instance, for the case when the size of the database is 1,000,000,
the corresponding approximation parameters are chosen as h�o
= 7913:20; �
o
= 8000:39; �
o
= 0:0i (for
P (X)^Q(Y ) as shown in Figure 13(e)) and h�u
= 197315:00; �
u
= 207807:59; �
u
= 0:0i (for R(X;Y )). Note
that the parameters are chosen such that the approximation for R(X;Y ) overestimates the scores (Figure 15). To
account for this overestimation, we varied win between 1 and 5.
For comparison purposes we have also implemented a brute force algorithm which explores all the search
30
2040
6080
100
0200
400600
8001000
0.1
0.2
0.3
0.4
0.5
0.6
Ksqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%20,win:2)
2030
4050
6070
8090
100
0
200
400
600
800
1000
0.6
0.7
0.8
0.9
1
K sqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%80,win:4)
(a) (b)
2040
6080
100
0200
400600
8001000
0
50
100
150
200
250
300
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%20,win:2)
2040
6080
100
0200
400600
8001000
0
500
1000
1500
2000
2500
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%80,win:4)
(c) (d)
Figure 16: (a)-(c) Percentage of real top-k results among the tuples returned by the algorithm and (b)-(d) the
number of tuples enumerated (figures are plotted to thep
scale for database size)
space and returns the actual rank of each element in the database with respect to a given query. In the next section,
we describe our observations.
6.2 Experiment Results
The initial set of experiments showed that as expected due to the overestimation of the �
Q;u
(Figure 15) the
algorithm found all the data elements in top k, it performed significantly more comparisons than required, and in
the worst case, it degenerated to checking all tuples. This condition, however, gave us a reasonable framework
in which we can study the effect of correcting imperfect approximations through the use of the win parameter
which limits the look-ahead.
Figures 16(a) and (b) show the percentage of real top-k results among the tuples returned by the algorithm
for two different expected percentages: (a) %20 and (b) %80 (or 0.8 and 0.2 error thresholds, respectively). For
the first case, we used win = 2 and for the second case, we used win = 5. The reason why, in this particular
figure, we are using a larger win value for the smaller error threshold is that, when we want smaller errors, the
algorithm must make its estimations using more information. Results for the other parameter assignments are
provided in the Appendix. Both of these figures show that, as the value of k increases, the algorithm performs
better; in the sense that it returns closer to the expected ratio of real top-k results. Note, on the other hand, that
31
for the %20 case, when the database is small, the effect is higher number of top-k results (which means that the
algorithm visits more tuples then it should for the percentage provided by the user). Note that in both cases, when
k and database size are sufficiently large (50 and 5000 respectively), the algorithm provides at least the expected
ratio of top-k results.
Figures 16(c) and (d) show the number of tuples visited by the algorithm. According to these figures, the
proposed algorithm works very efficiently. For instance, when the database size is 1,000,000, k is 100, and
expected ratio of top-k matches is %80, the algorithm visits around 2000 tuples; i.e., %0.2 of the database.
Note that Theorem 4.3 implies that the number of tuples visited should be O(
N
m�1
m
�k
F
), where N is the
database size, m is the number of ordered predicates (2 in our case), k is the number of results, and F =
prob(F (i;�) = truej1 � i � N). According to this, the number of tuples visited must increase linearly with
k. This expectation is confirmed by both figures. Again according to the theorem, since m = 2, for a given k and
assuming that F is a constant, the number of visited tuples must be O(
p
N). In the simulation, we have actually
seen an even smaller increase in the number of tuples by increase in the database size (note that the figures are
plotted to thep
scale for database size to provide a better visualization of the results). This is because the
theorem gives an upper bound on the number of tuples visited: the proof assumes that n-tuples satisfying the
condition F (i;�) are geometrically distributed with parameter F . In the simulations, however, since we used
Zipf’s distribution of the predicates, the actual distribution of F is not geometrical. Consequently, the algorithm
finds the matches much earlier than the upper-bound suggests.
A complete set of experiment results with various parameter settings are given in Appendix. The results
presented in the appendix mimic the sample of results presented in this section.
7 Comparison with the Related Work
In addition to various related work we mention along with the presentation of our approach, a notable recent work
by Donjerkovic and Ramakrishnan [?] aims at optimizing a “top k results” type of a query probabilistically by
viewing the selectivity estimates as probability distributions. However, unlike our approach, it does not address
fuzzy queries, but exact match queries on data that have an inherent order (such as salaries). Also, instead of
allowing a limited error on the results themselves, it uses the selectivity estimates in optimizing the cost of exact
retrieval of first k results by providing probabilistic guarantees for the optimization cost. Also recently, Chauduri
and Gravano[?] considered the problem of top k selection queries effectively, their work has a different focus
from ours. It focuses on providing a way to convert top k queries into regular database queries. A more relevant
work is by Acharya et al.[?], which proposes algorithms aimed at providing approximate answers to warehouse
queries using statistics about database content. Unlike our approach, it does not provide mechanisms to deal with
similarity-based query processing and they do not address the problem of finding ”top k results” incrementally.
32
8 Conclusion
In this paper, we have first presented the difference between the general fuzzy query and multimedia query
evaluation problems. More specifically, we have pointed to the multimedia precision/recall semantics, partial
match requirement, and unavoidable necessity of fuzzy, but non-progressive predicates. Next, we have presented
an approximate query evaluation algorithm that builds on [10, 11] to address the existence of non-progressive
fuzzy predicates. The proposed algorithm returns a set R of k results, such that each result r 2 R is most
probably in the set, Rk
, of top k results of the query. This algorithm uses the statistical properties of the fuzzy
predicates as well as the merge functions used to combine the fuzzy values returned by individual predicates.
It minimizes the unnecessary accesses to non-progressive predicates, while providing error-bounds on the top-k
retrieval results. Since, the performance of the algorithm depends on the accuracy of the database statistics; we
have discussed techniques to generate and maintain relevant statistics. Finally, we presented simulation results
for evaluating the proposed algorithm in terms of quality of results and search space reduction.
Acknowledgements
We thank Dr. Golshani and Y.-C. Park for providing us with data distributions, which we used for evaluating the
score approximations, from their image database, ImageRoadMap.
References
[1] S. Y. Lee, M. K. Shan, and W. P. Yang. Similarity Retrieval of ICONIC Image Databases Systems. Pattern
Recognition, 22(6):675–682, 1989.
[2] A. Prasad Sistla, Clement Yu, Chengwen Liu, and King Liu. Similarity based Retrieval of Pictures Us-
ing indices on Spatial Relationships. In Proceedings of the 1995 VLDB Conference, Zurich, Switzerland,
September23-25 1995.
[3] Wen-Syan Li and K. Selcuk Candan. SEMCOG: A Hybrid Object-based Image Database System and Its
Modeling, Language, and Query Processing. In Proceedings of the 14th International Conference on Data
Engineering, Orlando, Florida, USA, February 1998.
[4] Wen-Syan Li, K. Selcuk Candan, Kyoji Hirata, and Yoshinori Hara. Facilitating Multimedia Database
Exploration through Visual Interfaces and Perpetual Query Reformulations. In Proceedings of the 23th
International Conference on Very Large Data Bases, pages 538–547, Athens, Greece, August 1997. VLDB.
[5] Wen-Syan Li, K. Selcuk Candan, Kyoji Hirata, and Yoshinori Hara. Hierarchical Image Modeling for
Object-based Media Retrieval. Data & Knowledge Engineering, 27(2):139–176, July 1998.
33
[6] R. Fagin and E. L. Wimmers. Incorporating user preferences in multimedia queries. In F. Afrati and
P. Koliatis, editors, Database Theory – ICDT ’97, volume 1186 of LNCS, pages 247–261, Berlin, Germany,
1997. Springer Verlag.
[7] R. Fagin and Y. S. Maarek. Allowing users to weight search terms. Technical Report RJ10108, IBM
Almaden Research Center, San Jose, CA, 1998.
[8] S. Y. Sung. A linear transform scheme for combining weights into scores. Technical Report TR98-327,
Rice University, Houston, TX, 1998.
[9] S. Adali, Piero A. Bonatti, Maria Luisa Sapino, and V. S. Subrahmanian. A Multi-Similarity Algebra. In
Proceedings of the 1998 ACM SIGMOD Conference, pages 402–413, Seattle, WA, USA, June 1998.
[10] Ronald Fagin. Fuzzy Queries in Multimedia Database Systems. In 17th ACM Symposium on Principles of
Database Systems, pages 1–10, June 1998.
[11] Ronald Fagin. Combining Fuzzy Information from Multiple Systems. In 15th ACM Symposium on Princi-
ples of Database Systems, pages 216–226, 1996.
[12] Wen-Syan Li, K. Selcuk Candan, Kyoji Hirata, and Yoshinori Hara. Facilitating Multimedia Database
Exploration through Visual Interfaces and Perpetual Query Reformulations. In Proceedings of the 23th
International Conference on Very Large Data Bases, pages 538–547, Athens, Greece, August 1997. VLDB.
[13] Christos Faloutsos. Searching Multimedia Databases by Content. Kluwer Academic Publishers, Boston,
1996.
[14] R. Richardson, Alan Smeaton, and John Murphy. Using Wordnet as a Knowledge base for Measuring
Conceptual Similarity between Words. In Proceedings of Artificial Intelligence and Cognitive Science
Conference, Trinity College, Dublin, 1994.
[15] Weining Zhang, Clement Yu, Bryan Reagan, and Hiroshi Nakajima. Context-Dependent Interpretations of
Linguistic Terms in Fuzzy Relational Databases. In Proceedings of the 11th International Conference on
Data Engineering, Taipei, Taiwan, March 1995. IEEE.
[16] Surajit Chaudhuri and Luis Gravano. Optimizing Queries over Multimedia Repositories. In Proceedings of
the 1996 ACM SIGMOD Conference, pages 91–102, Montreal, Canada, June 1996.
[17] L. Zadeh. Fuzzy Sets. Information and Control, pages 338–353, 1965.
[18] H. Nakajima. Development of Efficient Fuzzy SQL for Large Scale Fuzzy Relational Database. In Pro-
ceedings of the 5th International Fuzzy Systems Association World Conference, pages 517–520, 1993.
34
[19] V.S. Lakshmanan, N. Leone, R. Ross, and V.S. Subrahmanian. ProbView: A Flexible Probabilistic Database
System. ACM Transactions on Database Systems, 22(3):419–469, September 1997.
[20] H. Bandermer and S. Gottwald. Fuzzy Sets, Fuzzy Logic, Fuzzy Methods with Applications. John Wiley
and Sons Ltd., England, 1995.
[21] U. Thole, H.-J. Zimmerman, and P. Zysno. On the Suitability of Minimum and Product operators for the
Intersection of Fuzzy Sets. Fuzzy Sets and systems, pages 167–180, 1979.
[22] J. Yen. Fuzzy logic–a modern perspective. IEEE Transactions on Knowledge and Data Engineering,
11(1):153–165, January 1999.
[23] D. Dubois and H. Prade. Criteria Aggregation and Ranking of Alternatives in the Framework of Fuzzy Set
Theory. Fuzzy Sets and Decision Analysis, TIMS Studies in Management Sciences, 20:209–240, 1984.
[24] R.R. Yager. Some Procedures for Selecting Fuzzy Set-Theoretic Operations. International Jounral General
Systems, pages 115–124, 1965.
[25] S. Adali, K.S. Candan, Y. Papakonstantinou, and V.S. Subrahmanian. Query Caching and Optimization in
Distributed Mediator Systems. In Proceedings of the 1996 ACM SIGMOD Conference, pages 137–147,
Montreal, Canada, June 1996.
[26] Y. Alp Aslandogan, Chuck Thier, Clement Yu, Chengwen Liu, and Krishnakumar R. Nair. Design, Im-
plementation and Evaluation of SCORE. In Proceedings of the 11th International Conference on Data
Engineering, Taipei, Taiwan, March 1995. IEEE.
[27] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “nearest neighbor” meaningful? In
Database Theory – ICDT ’99, Berlin, Germany, 1999. Springer Verlag. to appear.
[28] C. Faloutsos. Searching Multimedia Databases by Content. Kluwer Academic Publishers, 1996.
[29] C.T. Yu and W. Meng. Principles of Database Query Processing for Advanced Applications. Morgan
Kauffman Publishers, 1998.
[30] A. Yoshitaka and T. Ichikawa. A survey on content-based retrieval for multimedia databases. IEEE Trans-
actions on Knowledge and Data Engineering, 11(1):81–93, January 1999.
[31] F. Idris and S. Panchanathan. Review of image and video indexing techniques. Journal of Visual Commu-
nication and Image Representation-Special Issue on Indexing, Storage and Retrieval of Images and Video -
Part II, 8(2):146–166, June 1997.
[32] Y.A. Aslandogan and C.T. Yu. Techniques and systems for image and video retrieval. IEEE Transactions
on Knowledge and Data Engineering, 11(1):56–63, 1999.
35
[33] S. Adalı, K.S. Candan, Y. Papakonstantinou, and V.S. Subrahmanian. Query Caching and Optimization in
Distributed Mediator System. In Proc. of the 1996 ACM SIGMOD Conference, Montreal, Canada, June
1996.
[34] O. Etzioni, K. Golden, and D. Weld. Sound and efficient closed-world reasoning for planning. Artificial
Intelligence, 89(1-2):113–148, 1997.
[35] W.-S. Li, K.S. Candan, K. Hirata, and Y. Hara. Facilitating multimedia database exploration through visual
interfaces and perpetual query reformulations. In Proceddings of the 23rd International Conference on
VLDB, pages 538–547, Athens, Greece, August 1997.
[36] S. Chaudhuri and K. Shim. Optimization of queries with user-defined predicates. In VLDB’96, pages 87–98,
1996.
[37] G. K. Zipf. Relative Frequency as a Determinant of Phonetic Change. Harvard Studies in Classical Phili-
ology, 1929.
[38] Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. On the implications of zipf’s law for
web caching. In INFOCOM ’99, New York, USA, March 1999.
36
Appendix
20
40
60
80
100
0
200
400
600
800
10000
50
100
150
200
250
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%20,win:1)
2040
6080
100
0200
400600
8001000
0
50
100
150
200
250
300
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%20,win:2)
20
40
60
80
100
0
200
400
600
800
10000
50
100
150
200
250
300
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%20,win:3)
20
40
60
80
100
0
200
400
600
800
100050
100
150
200
250
300
350
400
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%20,win:4)
20
40
60
80
100
0
200
400
600
800
100050
100
150
200
250
300
350
400
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%20,win:5)
20
40
60
80
100
0
200
400
600
800
10000
50
100
150
200
250
300
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%50,win:1)
Figure 17: Number of tuples enumerated by the algorithm
37
20
40
60
80
100
0
200
400
600
800
100050
100
150
200
250
300
350
400
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%50,win:2)
20
40
60
80
100
0
200
400
600
800
10000
100
200
300
400
500
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%50,win:3)
20
40
60
80
100
0
200
400
600
800
10000
100
200
300
400
500
600
700
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%50,win:4)
20
40
60
80
100
0
200
400
600
800
1000100
200
300
400
500
600
700
800
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%50,win:5)
20
40
60
80
100
0
200
400
600
800
10000
100
200
300
400
500
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%80,win:1)
20
40
60
80
100
0
200
400
600
800
10000
200
400
600
800
1000
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%80,win:2)
Figure 18: Number of tuples enumerated by the algorithm
38
20
40
60
80
100
0
200
400
600
800
10000
500
1000
1500
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%80,win:3)
2040
6080
100
0200
400600
8001000
0
500
1000
1500
2000
2500
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%80,win:4)
20
40
60
80
100
0
200
400
600
800
10000
500
1000
1500
2000
2500
3000
Ksqrt(database size)
# o
f tu
ple
s
Tuples visited (exp:%80,win:5)
Figure 19: Number of tuples enumerated by the algorithm
39
20
40
60
80
100
0
200
400
600
800
1000
0
0.1
0.2
0.3
0.4
0.5
K
sqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%20,win:1)
2040
6080
100
0200
400600
8001000
0.1
0.2
0.3
0.4
0.5
0.6
Ksqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%20,win:2)
20
40
60
80
100
0
200
400
600
800
1000
0.1
0.2
0.3
0.4
0.5
0.6
K
sqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%20,win:3)
20
40
60
80
100
0
200
400
600
800
1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
K
sqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%20,win:4)
20
40
60
80
100
0
200
400
600
800
1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
K
sqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%20,win:5)
20
40
60
80
100
0
200
400
600
800
1000
0.1
0.2
0.3
0.4
0.5
0.6
K
sqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%50,win:1)
Figure 20: Percentage of real top-k results among the tuples returned by the algorithm
40
20
40
60
80
100
0
200
400
600
800
1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
K
sqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%50,win:2)
20
40
60
80
100
0
200
400
600
800
1000
0.2
0.3
0.4
0.5
0.6
0.7
K
sqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%50,win:3)
20
40
60
80
100
0
200
400
600
800
1000
0.2
0.3
0.4
0.5
0.6
0.7
K
sqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%50,win:4)
20
40
60
80
100
0
200
400
600
800
1000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
K
sqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%50,win:5)
20
40
60
80
100
0
200
400
600
800
1000
0.2
0.3
0.4
0.5
0.6
0.7
K
sqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%80,win:1)
20
40
60
80
100
0
200
400
600
800
1000
0.3
0.4
0.5
0.6
0.7
0.8
0.9
K
sqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%80,win:2)
Figure 21: Percentage of real top-k results among the tuples returned by the algorithm
41
20
40
60
80
100
0
200
400
600
800
1000
0.3
0.4
0.5
0.6
0.7
0.8
0.9
K
sqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%80,win:3)
2030
4050
6070
8090
100
0
200
400
600
800
1000
0.6
0.7
0.8
0.9
1
K sqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%80,win:4)
20
30
40
50
60
70
80
90
100
0
200
400
600
800
1000
0.6
0.7
0.8
0.9
1
K sqrt(database size)
Ra
tio
of
tup
les
Ratio of tuples in top K (exp:%80,win:5)
Figure 22: Percentage of real top-k results among the tuples returned by the algorithm
42