ORI GIN AL PA PER
An Effective and Fast Hybrid Framework for ColorImage Retrieval
Ekta Walia • Sulaiman Vesal • Aman Pal
Received: 22 November 2013 / Revised: 15 April 2014
� Springer Science+Business Media New York 2014
Abstract This paper presents a novel, fast and effective hybrid framework for
color image retrieval through combination of all the low level features, which gives
higher retrieval accuracy than other such systems. The color moment (CMs),
angular radial transform descriptor and edge histogram descriptor (EHD) features
are exploited to capture color, shape and texture information respectively. A mul-
tistage framework is designed to imitate human perception so that in the first stage,
images are retrieved based on their CMs and then the shape and texture descriptors
are utilized to identify the closest matches in the second stage. The scheme employs
division of images into non-overlapping regions for effective computation of CMs
and EHD features. To demonstrate the efficacy of this framework, experiments are
conducted on Wang’s, VisTex and OT-Scene databases. Inspite of its multistage
design, the system is observed to be faster than other hybrid approaches.
Keywords Color moments � Angular radial transform � Edge histogram
descriptor � Content based image retrieval � Similarity measures � Time complexity
1 Introduction
Nowadays, with the development in the field of digital image processing and also
with the increased use of digital images, it has become essential to find an effective
E. Walia (&) � A. Pal
Department of Computer Science, South Asian University, New Delhi, India
e-mail: [email protected]
A. Pal
e-mail: [email protected]
S. Vesal
Information Extend Technology, Kabul, Afghanistan
e-mail: [email protected]
123
Sens Imaging (2014) 15:93
DOI 10.1007/s11220-014-0093-9
and efficient method for searching and indexing of images from large image
collections. The conventional annotation approaches rely on the mapping of images
with some text, keywords or descriptions. These methods can hardly describe
content diversity of images. Due to lack of discrimination power in conventional
methods, the concept of content based image retrieval (CBIR) [1] has become an
active research topic in the last decades. Currently, CBIR techniques work on
combination of low level and high level features [1]. Color, texture and shape are
the prominent low level features used in CBIR.
Color is one of the most expressive and distinguished visual features which is
used in CBIR and object recognition. There are many color descriptors that have
been proposed for image retrieval. These are color histogram which indicates the
occurrence of color in the image [2], dominant color descriptor [3] which describes
the salient distribution of color in the region of interest and color correlogram [4]
which describes the probability of finding a pair of colors based on a particular
distance. Lu and Chang [5] proposed a new technique for the purpose of more
effective image retrieval which uses the color distributions to represent the global
characteristics of image. Color moments (CMs) are known to yield better retrieval
accuracy as compared to the conventional color based features. In this paper, we use
CMs for extraction of color features. In order to increase the discrimination power
of CMs, we use non-overlapping image tiles and then extract first three moments of
every tile which yields a powerful descriptor.
Texture is another significant visual feature which has the capability to
distinguish different objects without any other information and it also describes
the structural arrangement of a region in the object. Moreover, many statistical
techniques have been proposed to describe texture features of an image, such as
gray-level co-occurrence matrix, Markov random field Gabor wavelet, edge
histogram descriptor (EHD). These features have shown better performance and
effectiveness in comparison to some of the traditional features [6, 7].
Amongst various visual features, shape based features are the most relevant,
because human perception is based on shape of an object and they can recognize
objects solely from their shapes. Shape descriptors can be classified into contour-
based descriptors which extract features from the outer boundary and region-based
descriptors which extract features from the entire region [8]. The important region-
based shape descriptors are Zernike moments (ZMs), angular radial transform
(ART), geometric moments, moment invariants, etc. [9]. ZMs possess certain
desirable properties such as rotation invariance, robustness to noise, minimum
information redundancy and fast computation for each moment order [8] but it still
needs complex computation when all moments up to certain specified order are
required. ART is a moment based method adopted by MPEG-7 and its robustness
and efficiency has been proved by many researchers. Therefore, in our proposed
framework, we use ART descriptor, which provides almost similar retrieval
accuracy as that of ZMs with extra advantage of being quite efficient in its
computation time [10].
Over the past years, most of the studies on CBIR have used only a single feature
amongst various visual features. However, it is hard to achieve satisfactory retrieval
results using a single feature because usually an image contains various
93 Page 2 of 23 Sens Imaging (2014) 15:93
123
characteristics and diverse contents. Therefore, it is essential to combine visual
features in order to gain satisfactory retrieval results. Since, shape, texture and color
features are complementary to each other, therefore, it is expected that their
combination would yield an improved retrieval performance. In order to improve
the retrieval efficiency; in [16, 19], the authors proposed color CBIR systems that
utilize a combination of color and texture features only. Wang et al. [6] proposed a
color image retrieval framework by using all the low level features. They used color
quantization algorithm with clusters merging to obtain the small number of
dominant colors and their percentages in the image. The texture features are
extracted using steerable filter decomposition and the pseudo-ZMs are used as shape
descriptor. ElAlami [16] proposed a model for image retrieval which depends on
most relevant features (color and texture) according to a Genetic Algorithm based
feature selection technique. Kang and Zhang [17] presented a color image retrieval
scheme by combining all the three i.e. color, texture and shape features, which
achieved higher retrieval efficiency. They have integrated three image features
namely; color histogram, gray-level co-occurrence matrix and ZMs. Hiremath and
Pujari [18] also presented a framework by combining all low level features to
achieve higher retrieval efficiency. In [19], Huang et al. proposed an image retrieval
approach based on color and texture features. They used CMs as color features and
Gabor filter to capture texture features whose similarity results are finally combined
for color image retrieval. Yue et al. [20] have used the fusion of color histogram and
texture features based on a co-occurrence matrix for image retrieval. Banerjee et al.
[21] have presented a content based retrieval system based on significant point
features extracted using a fuzzy set theoretic approach. Jalab [22] has used a
combination of color layout and Gabor texture descriptors for the implementation of
image retrieval system. Liu and Yang [23] presented color difference histogram, a
new image feature representation approach for image retrieval. This descriptor
encodes color, edge orientation and perceptually uniform color difference. Wang
et al. [29] propose a new CBIR scheme using color and texture information. They
use Zernike chromaticity distribution moments for capturing color contents and
texture features are obtained in contourlet domain.
In this paper, we propose a hybrid framework for color image retrieval which
gives better accuracy than the conventional methods. It is different from other
hybrid approaches because it is based on human perception and recognition process.
As humans perceive objects initially by their colors and later try to make sense
using shape and texture, it imitates their recognition phenomenon by retrieving
relevant images in two different stages. The work in [30], presents data pertaining to
different visual attributes arranged in decreasing order of their perceptivity. It is
given that color attributes are more perceivable as compared to texture and shape.
Therefore, we use color features in the first stage and shape as well as texture
features in the second stage. Thus, it is a unique approach as the retrieval is
performed in a multistage fashion. To extract the color features in the first stage, we
use CMs which are more discriminative and give detailed color information of
image. In the second stage, texture and shape features of the images returned by the
first stage are extracted. EHD is used to capture texture features of color images
which creates 80 bins histogram to simulate the distribution of edges in an image.
Sens Imaging (2014) 15:93 Page 3 of 23 93
123
ART is used for shape feature extraction. It is chosen over other moment based
shape descriptors because of small size of its descriptor. Further, the distances of the
ART and EHD features of the query image and the images belonging to the dataset
obtained in the first stage are combined using certain weights and finally 20 top
ranked images are retrieved for Wang’s database.
To study the retrieval performance of the proposed framework, various
experiments are performed on different databases such as Wang’s database, OT-
Scene and VisTex databases using Euclidean distance as a similarity measure. The
image retrieval performance of the proposed framework is also compared with other
hybrid approaches proposed in this domain. From the experiments, it is observed
that the performance of the proposed technique is better than others. A thorough
analysis of time complexity of the proposed technique is also given.
The paper is organized in the following sections as mentioned here. Section 2
describes the various descriptors used i.e. CMs, EHD and ART. Section 3 explains
the methodology employed in the proposed approach. Section 4 presents the
experimental results and analysis of retrieval performance on various databases. It
also gives comparison of the retrieval performance of the proposed scheme with
other such systems. Further, an analysis of time complexity is also presented here.
Section 5 presents conclusions with future directions.
2 Descriptors Used
2.1 Color Moments
The objective of color features is to retrieve all the images whose color composition
is similar to the query image. Experiments have shown that color histogram does not
capture spatial relationship of color regions. Many researchers and papers focused
on color indexing approaches based on global distributions of color in an image.
The most widely used techniques are color correlogram, color coherent vector and
CMs which perform better than the traditional techniques. In this paper, we use CM
because primitives of CMs are more robust to describe color images and it leads to
faster implementation in comparison to other methods that need expensive
computation.
For color image, the RGB model is well known. However, RGB is not uniform
color model and also it is less powerful for describing colors according to human
interpretation. The HSV (hue, saturation and value) color space is more related to
the human perception. Therefore, we convert image from RGB to HSV model and
retrieve primitive moments of each component.
In [11], to improve the discriminative power of CM technique, the authors
divided the image horizontally into three non-overlapping regions and from each of
the regions, they extracted the first three moments for each color channel and made
a descriptor of 27 floating points. On the similar lines, we divide an image into sub-
images for computation of CMs. This is described in detail in Sect. 2.2 for EHD.
The first three CMs of an image are given by mean, standard deviation and
skewness of colors. These are depicted in Eq. (1) through Eq. (3).
93 Page 4 of 23 Sens Imaging (2014) 15:93
123
Er;j ¼1
N
XN
i¼1
Ii;j; j ¼ H; S;V ð1Þ
rr;j ¼1
N
XN
i¼1
ðIi;j � Er;jÞ2 !1
2
; j ¼ H; S;V ð2Þ
Sr;j ¼1
N
XN
i¼1
ðIi;j � Er;jÞ3 !1
3
; j ¼ H; S;V ð3Þ
where Ii,j represents intensity of jth color channel at ith location in the image tile.
N represents the number of pixels in each region/tile.
The descriptor size for CMs is thus stated as:-
sizeðFCMÞ ¼ r � nc � nCM ð4Þ
where the feature vector FCM is given as:
FCM ¼
E1;H E1;S E1;V
r1;H r1;S r1;V
S1;H S1;S S1;V
���
Er;H Er;S Er;V
rr;H rr;S rr;V
Sr;H Sr;S Sr;V
26666666666664
37777777777775
; where r ¼ 16: ð5Þ
and nc is the number of color channels, nCM is the number of CMs. There are works
that describe the direct computation of moment invariants for a color model like
RGB [24, 25]. But, in order to increase the retrieval accuracy, we prefer compu-
tation of moments for each color channel separately. Thus, we divide the image into
r i.e. 16 non-overlapping sub-images of 25 9 25 pixels and extract primitive
moments of each color channel from every sub-image, obtaining a descriptor of
48 9 3 floating points. Refer Sect. 4.1 for detailed analysis of the same.
2.2 Edge Histogram Descriptor (EHD)
Texture is an important visual feature of an image. The texture descriptors provide
measures of properties such as smoothness, coarseness, and regularity [7]. The
distribution of edges is a good texture signature that is useful for image to image
matching even when the underlying texture is not homogeneous [26].
EHD is one of the widely used statistical techniques to capture texture features of
an image. This descriptor proposed for MPEG-7 [7, 26], determines the local edge
distribution in an image.
Sens Imaging (2014) 15:93 Page 5 of 23 93
123
EHD is obtained by partitioning the entire image (scaled down to 16 9 16 pixels)
into 16 non-overlapping sub-images of 4 9 4 pixels which is shown in the Fig. 1.
The edges in EHD are classified into five types, four directional edges such as,
vertical, horizontal, diagonal edges at 45� and 135� and one non-directional edge. If
the image-block does not have any directionality, it is counted as non-directional
edge. Further, when edges have been extracted from image blocks, we count the
total number of edges in every sub-image. As there are five different types of edges,
for each sub-image, we have a histogram of five bins. Since, each sub-image is of
4 9 4 pixels, we have total 16 9 5 = 80 bins histogram for each image. Every sub-
image is further divided into 4 image blocks (of 2 9 2 pixels) in order to obtain the
above said histogram. The semantics of the bins of the resulting histogram are
defined in Table 1. We employ the 2 9 2 filters shown in Fig. 2 to compute
corresponding edge intensity values of each sub-image. If the intensity values of the
edge exceed a given threshold, then the corresponding image block is considered to
be edge block [7]. In this paper, we have considered the threshold value as 11 as per
[7].
2.3 Angular Radial Transform (ART)
ART is a moment-based image description method adopted in MPEG-7 as a 2D
region-based shape descriptor. This descriptor has many desirable characteristics
such as compact size, robustness to noise and scaling, invariance to rotation and
ability to describe complex objects. A significant characteristic is its small size and
its speed which further ensures fast image retrieval process. ART is a complex
orthogonal unitary transform defined on a unit disk based on complex orthogonal
sinusoidal basis functions in polar co-ordinates [9, 10, 12]. The ART coefficients,
Fnm of order n and m, are defined by:
Fnm ¼Z2p
0
Z1
0
V�n;mðr; hÞf ðr; hÞrdrdh ð6Þ
where f(r, h) is an image intensity function in polar co-ordinates and V�n;mðr; hÞ is the
ART basis function, which is complex conjugate of Vn;mðr; hÞ that is separable along
the angular and radial directions as stated below
Vn;mðr; hÞ ¼ RnðrÞAmðhÞ ð7Þ
with
AmðhÞ ¼1
2pejmh ð8Þ
and
RnðrÞ ¼1 ðn ¼ 0Þ
2 cosðpnrÞ ðn [ 0Þ
� �ð9Þ
93 Page 6 of 23 Sens Imaging (2014) 15:93
123
where n and m represent the order and repetition of ART, respectively. For discrete
image f(x, y) of size N 9 N pixels, ART is approximately computed using Eq. (10),
where, the integral parts of Eq. (6) are replaced by summations,
Fnm ¼XN�1
i¼0
XN�1
j¼0
f ðxi; yjÞV�nmðxi; yjÞDxiDyj
where x2i þ y2
j � 1:
ð10Þ
Fig. 1 The illustration of sub-images and image blocks
Table 1 The semantics of local
edge binsHistogram bins Semantics
FEHD[1] Vertical edge of sub-image at (0,0)
FEHD[2] Horizontal edge of sub-image at (0,0)
FEHD[3] 45� edge of sub-image at (0,0)
FEHD[4] 135� edge of sub-image at (0,0)
FEHD[5] Non-directional edge of sub-image at (0,0)
FEHD[6] Vertical edge of sub-image at (0,1)
.. ..
… ….. ..
FEHD[75] Non-directional edge of sub-image at (3,2)
FEHD[76] Vertical edge of sub-image at (3,3)
FEHD[77] Horizontal edge of sub-image at (3,3)
FEHD[78] 45� edge of sub-image at (3,3)
FEHD[79] 135� edge of sub-image at (3,3)
FEHD[80] Non-directional edge of sub-image at (3,3)
Sens Imaging (2014) 15:93 Page 7 of 23 93
123
The coordinate ðxi; yjÞin a unit disk are given by
xi ¼2iþ 1� N
Dyj ¼
2jþ 1� N
Dð11Þ
where i, j = 0,1,2, …, N - 1, and
D ¼ N for inner circular disk contained in the square image
Nffiffiffi2p
for outer circular disk containing the whole square image
�ð12Þ
and
Dxi ¼2
D;Dyj ¼
2
Dð13Þ
It has been observed in [12] that ART with outer circular disk gives better result
than inner circular disk. Therefore, in this paper, we use the outer circular disk and
we also compare the experimental results of both the approaches in Sect. 4.1.1. We
use ART coefficients with order (n \ 3, m \ 12) which gives a feature vector
containing 36 moments. The rotation invariance is achieved by using the magnitude
of the coefficients. The ART shape descriptor is given by:
FART ¼ ðF0;0;F0;1; . . .;F0;11;F1;0;F1;1; . . .;F1;12;F2;0;F2;1; . . .;F2;11Þ: ð14Þ
3 Proposed Algorithm
3.1 Framework of Proposed System
In this paper, we propose a novel multistage image retrieval approach which
classifies the images initially on the basis of CMs feature vector that encompasses a
number of useful properties as described in the Sect. 2.1. The second stage features
are obtained by using the ART and the EHD descriptors. The performance of the
proposed framework is analyzed against some major available image retrieval
systems. The steps involved in the proposed framework are described diagram-
matically in Fig. 3 with details. It can be observed from the diagram that in two
stages of the proposed framework, various tasks are performed. The following steps
describe the entire approach:
(a) (b) (d)(c) (e)
Fig. 2 Filters for edge detection
93 Page 8 of 23 Sens Imaging (2014) 15:93
123
First stage: As a first step, the color feature vectors are extracted by using CMs.
The procedure is as follows:
1) The query and database images are resized to 100 9 100 pixels from their
original size.
2) We convert the images to (hue, saturation, value) HSV color space because the
RGB is not well suited for describing the colors in terms that are practical for
human interpretation [17].
3) Image (query as well as database) expressed in three HSV color channels is
then divided to number of tiles (sub-images). Here, we divide every HSV
component to 16 non-overlapping tiles, of 25 9 25 pixels.
4) The CMs for each tile (r) are calculated using Eq. (1) through Eq. (3), and a
descriptor FCM with 48 9 3 features is generated as per Eq. (5).
Fig. 3 Multistage color image retrieval using CMs, ART and EHD
Sens Imaging (2014) 15:93 Page 9 of 23 93
123
5) The color feature vector of the query image is compared to that of the database
images by using Euclidean distance classifier, as described in Eq. (15) and
Eq. (16).
6) Out of this stage, top k (e.g. k = 30 for Wang’s database) most relevant
database images are retrieved and the output of this stage is now used as a set of
database images for the next stage.
Second Stage: In this stage, shape and texture features of the query image and the
images in the database generated from previous stage are extracted by using ART
and EHD descriptors. Since ART features extract shape information globally and
EHD extracts texture information locally on the non-overlapping sub-images,
therefore the two types of features are complementary to each other and are used
together in this stage. The steps are as follows:
1) The ART feature vectors are computed for query image and subset of the
database images respectively, by resizing them to 100 9 100 pixels and
converting them to grayscale form.
2) Further, the coefficients of ART are computed for all orders such that n \ 3,
m \ 12.
3) Euclidean distance is then computed to evaluate the similarity between the
feature vectors FART of query image and the subset of database images. It is
computed through Eq. (17). Each of these distances is further normalized using
Eq. (19).
4) As described in Sect. 2.2, EHD feature vectors are also extracted for query
image and for the subset of database images generated from the previous stage.
Every image is therefore represented with a descriptor of 80 bins which shows
the distribution of edges in images.
5) Euclidean distance is again computed to know the similarity between the
feature vectors FEHD of query image and the subset of database images. It is
computed through Eq. (18). Each of these distances is further normalized using
Eq. (20).
6) The distances obtained from step 3 and step 5 are then combined using certain
weights. In the proposed framework, the weights given to the distance between
ART and EHD features are computed adaptively. This combination is based on
our observation that ART descriptor is more effective in capturing the global
details than EHD descriptor which actually captures local edge details. Thus,
the distance between the query image and the images belonging to the subset
retrieved from the first stage is a weighted combination of distances between
ART and EHD features. More details on how these distances are combined is
given in Eq. (21) of Sect. 3.2.
3.2 Similarity Matching
Image similarity measures typically assess a distance between set of image features.
However, shorter distance corresponds to higher similarity and the choice of metrics
depends on type of feature vectors. In order to compute distance between query
93 Page 10 of 23 Sens Imaging (2014) 15:93
123
image Q and training images T, we use Euclidean distance/L2-norm for CMs. We
use a weighted color channel scheme, wherein we give variety of weights to the
CMs distance of each color channel and after many experiments, we establish that
saturation has more impact in HSV color space, therefore we assign more weight to
the S component i.e. w2 is chosen to be higher than w1 and w3, which are both equal.
The CM similarity measurement is therefore, defined as follows:
DCOLðFQCOL;F
TCOLÞ ¼ w1 � dH þ w2 � dS þ w3 � dV ð15Þ
where
w1 ¼ 0:25
w2 ¼ 0:50
w3 ¼ 0:25
8><
>:and
d j ¼Xr
i¼1
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðEQ
i;j � ETi;jÞ
2q
þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðrQ
i;j � rTi;jÞ
2q
þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðSQ
i;j � STi;jÞ
2q� �
; j ¼ H; S;V
ð16Þ
where FQCOL is the color feature vector of query image and FT
COL is the color feature
vector of database images. Also, r represents region/tile of the image. EQi;j;E
Ti;j are
region-wise mean of intensities of query and database images, computed for each
color channel j. Similarly, rQi;j; r
Ti;j; S
Qi;j; S
Ti;j are the region-wise standard deviation and
skewness of intensities (of query and database images), belonging to each color
channel. All the three individual distances dH, dS, and dV, are already normalized
because of the RGB to HSV model conversion, which always causes each con-
stituent component’s value (i.e. hue, saturation and value) to lie between 0 and 1.
It is worth mentioning that we have also tested the retrieval performance of CMs
obtained through the technique described in [24] for RGB and HSV models directly
and found that their retrieval results are less than the weighted color channel scheme
proposed here.
Similarity distance between shape feature descriptors described by ART and
texture feature described by EHD are calculated using L2-norm, which is defined as
follows:
DARTðFQART ;F
TARTÞ ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiX36
i¼1
ðFQART � FT
ARTÞ2
vuut ð17Þ
DEHDðFQEHD;F
TEHDÞ ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiX80
i¼1
ðFQEHD � FT
EHDÞ2
vuut : ð18Þ
After computation of similarity distance for ART and EHD features, these
distance features may have different ranges, one is very high and one is very small.
Therefore, we use a normalization method in order to make all the texture and shape
feature distances fall in the same range. The min–max normalization method [27] is
Sens Imaging (2014) 15:93 Page 11 of 23 93
123
employed to achieve this. It performs a linear transformation on the original data.
Suppose that minA and maxA are the minimum and maximum values of the feature
vector A, the min–max normalization maps a value, v of A to v0 in the range [0, 1].
Thus, the individual distances of the ART and EHD feature vectors i.e. DART and
DEHD are normalized using Eq. (19) and Eq. (20) given below:
DnART ¼
DART �minfDARTgmaxfDARTg �minfDARTg
; 8T and Q ð19Þ
DnEHD ¼
DEHD �minfDEHDgmaxfDEHDg �minfDEHDg
; 8T and Q: ð20Þ
Finally, the combination takes place as follows:
DARTþEHD ¼ w4DnART þ w5Dn
EHD ð21Þ
where DnART is the normalized distance between ART features and Dn
EHD is the
normalized distance between EHD features, computed through Eq. (19) and Eq.
(20). The weights w4 and w5 are computed as under
w4 ¼PART
PART þ PEHD
;w5 ¼PEHD
PART þ PEHD
ð22Þ
where PEHD and PART are the average precision (defined in Sect. 3.3) of EHD and
ART descriptors respectively. In our experiments, we obtain PART as 40.70 and
PEHD as 37.0 (for Wang’s database), therefore, w4 is 0.52 and w5 is 0.48.
3.3 Parameters to Measure the Retrieval Performance
The performance of a retrieval system can be measured in terms of its precision–
recall (P–R) ratio. Precision measures the ability of the system to retrieve only
images that are relevant, while recall measures the ability of the system to retrieve
all images that are relevant. The (P–R) ratio is measured using Eq. (23).
Precision ¼ No: of relevant images retrieved
No: of retrieved images
Recall ¼ No: of relevant images retrieved
Total no: of relevant images in the database
ð23Þ
This ratio is used when there is equal number of objects in each class.
Bull’s eye performance (BEP) is used to measure the retrieval performance of the
images in which the number of images are not equal in each class. It is applied in
various researches [13, 14]. If the number of images in a database corresponding to
query image Q are N and out of 2N retrieved images, only X images are correct, then
BEP is computed as per Eq. (24).
BEP ¼ X
Nð24Þ
93 Page 12 of 23 Sens Imaging (2014) 15:93
123
We also use average retrieval rate (ARR) [28], which is a robust metric for
comparison of image retrieval methods. It can be computed using Eq. (25).
ARR ¼ 1
NQ
XNQ
q¼1
RRðqÞ ð25Þ
where NQ represents the number of queries that are used for the purpose of verifying
the descriptor in a dataset. RR(q) represents the retrieval rate of a single query and is
computed as per (26).
RRðqÞ ¼ nk
nq
� 1 ð26Þ
where nk is the number of correct retrievals and nq is the total number of images (in
the database) relevant to the query.
4 Experimental Results and Analysis
To check the retrieval performance of the proposed system over different databases,
experiments are performed in MATLAB version 8.1 on a machine with 3.40 GHz
CPU and 16 GB RAM under Microsoft Windows 64-bit operating system. The
proposed hybrid framework is evaluated through three sets of experiments. The first
set of experiments is performed on Wang’s database and the second set of
experiments is performed on OT-Scene dataset. The third is conducted on VisTex
database. The P–R ratio is considered for measuring the retrieval performance of
proposed framework on Wang’s database, whereas ARR is used for VisTex
database, and BEP measure is used for OT-Scene database.
Wang’s color image database The database is provided by Wang et al. [15] The
Wang’s dataset is a subset of COREL image database. It contains 1,000 images,
which are equally divided into ten different categories: African people, beach,
building, bus, dinosaur, elephant, flower, horse, mountain and food. Every database
image is of size 256 9 384 or 384 9 256 pixels. Figure 4a shows some sample
images of each category of images belonging to this database.
Olivia and Torralba database We also evaluated the proposed framework on the
scene classification for which the database was downloaded from http://cvcl.mit.
edu/database.htm. The dataset from Oliva and Torralba is used, and denoted as OT-
Scene dataset. It consists of 2,688 color images from eight scene categories: coast
(360 samples), forest (328 samples), mountain (374 samples), open country (410
samples), highway (260 samples), inside city (308 samples), tall building (356
samples) and street (292 samples). Figure 4b shows some sample images of each
category of images belonging to this database.
VisTex database VisTex database is a collection of colored texture images created
by MIT media Lab. The database was created with the intention of providing a large
set of high quality textures for computer vision applications. Each texture image is a
Sens Imaging (2014) 15:93 Page 13 of 23 93
123
square image of 512 9 512 pixels. The data set has two main components:
Reference Textures (with homogeneous textures in frontal and oblique perspectives)
and Texture Scenes (containing images with multiple textures, ‘‘real-world’’,
scenes). Figure 4c shows some sample images of each class belonging to this
database.
4.1 Retrieval Performance
4.1.1 Retrieval Performance on Wang’s Database
In the first set of experiments, we randomly select 50 images as query images from
Wang’s database, five per class, and every time retrieve the top first 20 images as
retrieval results. Further, we calculate the average precision and the average recall
of each class.
In order to ensure the correctness of our approach for computing CMs, we
conduct experiments for direct evaluation of moment invariants for RGB and HSV
color models. With this, we obtain only 39.40 and 40.53 % precision respectively
on Wang’s database. However, with our method of computing CMs separately for
each color channel, we are able to achieve 62.53 % precision on Wang’s database.
Fig. 4 a Sample images from Wang’s database. b Sample images from OT-Scene Database. c Sampleimages from VisTex database
93 Page 14 of 23 Sens Imaging (2014) 15:93
123
To ensure the optimal size of non-overlapping sub-images for the computation of
CMs and EHD, we conduct experiments by varying the sub-image size. Figure 5
shows graphs indicating that 4 9 4 pixels and 25 9 25 pixels is the optimal sub-
image size for EHD and CM descriptor respectively. It depicts that with these size,
we get maximum average precision on Wang’s database. Therefore, in all the
experiments, we use the sub-images of optimal size only.
It is observed from the results that our proposed framework performs better than
the individual methods under consideration. Table 2 indicates the performance of
individual techniques on each class of Wang’s database. It is important to note that
the proposed framework enhances the average retrieval performance of CMs by
13.27 % as the average precision (computed over ten classes) obtained through the
CMs alone is 62.53 % only.
Further, the experimental results with average precision and average recall are
presented in Tables 3 and 4. The proposed results are compared with other image
retrieval approaches reported in the literature [17, 19–23]. It is inferred from the
comparison shown in Table 3 that the proposed framework achieves good results on
Wang’s database with average precision of 75.80 %. In all these works compared here,
50 or 80 images are selected randomly as query and 20 images are retrieved. We have
implemented the Color Difference Histogram technique of Liu and Yang [23] for
comparison purposes. For comparing our retrieval results with that of Hiremath and
Pujari [18], we also make every database image as query and obtain average precision
of 63.22 % for Wang’s database. This is higher than 54.90 % achieved in [18].
In addition to the comparison given in Table 3, we observed that the retrieval
performance (in terms of average precision) of our proposed method is better than
the performance of hybrid scheme proposed in [6] by Wang et al. It is pertinent to
mention here that we averaged the precision given by them (when 50 random
images are used as query) for different classes to observe that they achieve
approximately 59 % precision (on average) for Wang’s database.
Table 4 shows the average recall of proposed method in comparison with the
method of Kang and Zhang [17]. Figure 6 pictorially depicts the average precision
computed on Wang’s database through the proposed framework and other hybrid
methods. Figure 7 depicts the result of image retrieval for a randomly selected
image (of horse) as query. The top 20 images retrieved by the proposed framework
are depicted in accordance of their rank from left to right following the query image,
which is shown in the top left corner.
To establish the efficacy of our proposed framework, we also conduct
experiments by swapping the descriptors used in two different stages i.e. we use
ART and EHD in the first stage and CMs in the latter. With this, we noticed a huge
drop in the precision indicating that our choice of using CMs in the first stage is
correct.
4.1.2 Retrieval Performance on OT-Scene Database
In the second set of experiments performed on OT-Scene database, we randomly
select 40 images from this database, five per class as queries. We calculate the BEP
parameter as follows:
Sens Imaging (2014) 15:93 Page 15 of 23 93
123
30.00
32.00
34.00
36.00
38.00
40.00
4X4 8X8 No Region
Ave
rag
e P
reci
sio
n
Region Size (in pixels)
EHD
45.00
50.00
55.00
60.00
65.00
70.00
12.5X12.5 25X25 50X50 No Region
Ave
rag
e P
reci
sio
n
Region Size (in pixels)
ColorMoments
(a)
(b)
Fig. 5 Average precision of a EHD and b CMs on Wang’s database for different region sizes
Table 2 Average precision of the individual descriptors and the proposed framework on Wang’s
database
Class name Individual descriptors Proposed
frameworkColor moment EHD ART (in outer
disk framework)
African 56.00 16.00 22.00 68.00
Sea 64.00 42.00 31.00 60.00
Building 38.00 18.00 17.00 63.00
Bus 49.33 23.00 29.00 88.00
Dinosaur 100.00 97.00 98.00 100.00
Elephant 62.00 58.00 72.00 74.00
Flower 70.67 33.00 37.00 82.00
Horse 93.33 42.00 57.00 99.00
Mountain 43.33 26.00 25.00 51.00
Food 48.67 15.00 19.00 73.00
Average precision 62.53 37.00 40.70 75.80
93 Page 16 of 23 Sens Imaging (2014) 15:93
123
Ta
ble
3A
ver
age
pre
cisi
on
com
par
ison
of
the
pro
pose
dfr
amew
ork
wit
ho
ther
met
ho
ds
on
Wan
g’s
dat
abas
e
Sem
anti
cn
ame
Met
ho
d
[17
](%
)
Met
ho
d
[19
](%
)
Met
ho
d
[20
](%
)
Met
ho
d
[21
](%
)
Met
ho
d
[22
](%
)
Met
ho
d
[23
](%
)
Pro
po
sed
met
ho
d
usi
ng
AR
T(i
no
ute
r
dis
kfr
amew
ork
,%
)
Pro
po
sed
met
ho
d
usi
ng
AR
T(i
nin
ner
dis
kfr
amew
ork
,%
)
Afr
ica
peo
ple
69
.07
54
.00
58
.75
61
.00
32
.00
54
.00
68
.00
69
.00
Bea
ch5
5.3
24
2.0
04
1.1
95
6.0
06
1.0
05
1.0
06
0.0
06
0.0
0
Buil
din
g5
6.4
51
6.0
04
2.3
56
3.0
03
9.0
03
8.0
06
3.0
05
9.0
0
Bus
89
.36
67
.00
71
.69
72
.00
40
.00
46
.00
88
.00
87
.00
Din
osa
urs
93
.27
99
.00
74
.53
95
.00
10
0.0
01
00
.00
10
0.0
01
00
.00
Ele
ph
ants
70
.84
40
.00
65
.08
77
.00
56
.00
63
.00
74
.00
74
.00
Flo
wer
s8
8.4
79
7.0
08
3.2
48
3.0
08
9.0
09
0.0
08
2.0
07
8.0
0
Ho
rses
81
.37
96
.00
69
.30
95
.00
65
.00
93
.00
99
.00
99
.00
Mo
un
tain
s6
4.5
84
6.0
04
4.8
66
8.0
05
6.0
04
8.0
05
1.0
05
0.0
0
Fo
od
69
.83
79
.00
44
.54
57
.00
44
.00
39
.00
73
.00
72
.00
Av
erag
ep
reci
sio
n7
3.8
66
3.6
05
9.5
57
2.7
05
8.2
06
2.2
07
5.8
07
4.8
0
Sens Imaging (2014) 15:93 Page 17 of 23 93
123
1) In the first stage, for each query image, we retrieve 900 images (double the
number of maximum images in any class of this database) using CMs and
applying similarity measures between query image and database images.
2) The ART and EHD feature vectors are computed for the query image and
subset of filtered images.
3) The L2-norm similarity measure is again used to compute the distance between
the respective feature vectors of ART and EHD for the query and the images
filtered in step 1.
4) Similarly, we combine these distances (of ART and EHD feature vectors) using
weights as described in Sect. 3.2.
5) According to Eq. (24) for computing BEP, we retrieve 2N images in this step,
where N is the number of images in a particular class to which the query image
belongs.
6) Further, we count the number of relevant retrieved images and compute BEP
according to Eq. (24).
The results of average BEP are shown in Table 5. It is observed that the retrieval
accuracy for all the eight classes is quite satisfactory.
4.1.3 Retrieval Performance on VisTex Database
We further establish the effectiveness of the proposed framework by conducting
experiments on VisTex database. Here, each texture image is divided into 16 non-
overlapping sub images with size 128 9 128 pixels. Thus, a total of 640 texture sub
images, categorized into 40 different classes are used in our experiments.
We obtain ARR [using Eqs. (25) and (26)] of 61.29 % for the proposed
framework. In the first stage, we filter 25 images from the database of 640 images.
In the second stage, we rank these images based on the fusion of EHD and ART. It
Table 4 Average recall comparsion of the proposed framework with method [17] on Wang’s database
Semantic name Method [17] Proposed method using
ART (in outer disk
framework)
Proposed method using
ART (in inner disk
framework)
Africa people 0.147 0.136 0.138
Beach 0.180 0.120 0.120
Building 0.180 0.126 0.118
Bus 0.138 0.176 0.174
Dinosaurs 0.112 0.200 0.200
Elephants 0.163 0.148 0.148
Flowers 0.127 0.164 0.156
Horses 0.121 0.198 0.198
Mountains 0.190 0.102 0.100
Food 0.157 0.146 0.144
Average recall (%) 15.15 15.16 14.96
93 Page 18 of 23 Sens Imaging (2014) 15:93
123
is pertinent to mention here that the individual descriptors (i.e. CMs, EHD and
ART) are able to attain an ARR of 55.43, 56.23 and 28.19 % respectively. Clearly,
the proposed framework is able to achieve 5.86, 5.06 and 33.10 % improvement
over CMs, EHD and ART respectively.
4.2 Time Complexity of Proposed Method
The time complexity of the proposed method is based on the complexity of feature
extraction and image retrieval. Since, this is a multistage approach, we therefore,
analyze the time complexity of both the stages separately.
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%Method[17]
Method[19]
Method[20]
Method[21]
Method[22]
Method[23]
Proposed Method using ART (inouter disk frame work)
Proposed Method using ART (ininner disk frame work)
Fig. 6 Average precision comparsion of the proposed framework with other methods on Wang’sdatabase
Query Image 1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
Fig. 7 The color image retrieval results using the proposed framework for a randomly generated queryimage (on Wang’s database)
Sens Imaging (2014) 15:93 Page 19 of 23 93
123
4.2.1 Time Complexity of First Stage
The time complexity of feature extraction for the proposed method in first stage is
derived from the time taken to compute CMs. The time complexity of computing
CMs is OðM2rÞ where r is the number of regions into which an M �M image is
divided for the computation of CMs. Further, distances dj, for j ¼ H; S;V are
computed using Eq. (16) with time complexity of Oð1Þ and then a comparison of
distances DCOL computed in Eq. (15) is performed through a sorting algorithm
whose complexity would be Oðnd log ndÞ, where nd is the number of distances to be
sorted. In first stage, nd is the size of database itself, whereas in the second stage, it
is the number of images extracted in the first stage for a given query. Thus, the time
complexity of first stage is approximately OðM2rÞ þ Oðnd log ndÞ. In this stage,
computation of CMs is the costliest effort.
4.2.2 Time Complexity of Second Stage
In the second stage, the time complexity of computing EHD and ART is OðM2rÞ and
OðM2nmÞ respectively for an image of size M �M. Here, (n, m) is the order and
repetition of ART respectively. Another important part of algorithm is normalization
of the trivially computed distances DARTþEHD in this stage. It is also a trivial operation
with the time complexity of Oð1Þ: Thus, the time complexity of second stage is
OðM2rÞ þ OðM2nmÞ þ Oðnd log ndÞ: Clearly, the time complexity of this stage is
dominated by the ART feature extraction time. This is because we use 16 regions for
computing EHD, whereas we compute 36 ART moments i.e. n\3;m\12: But as
these features are computed for a very small set of images retrieved from the database
(e.g. of the order of 30 for Wang’s database), therefore, we observe that the time spent
in second stage is considerably less than the time spent in the first stage.
4.2.3 Analysis
We further present the actual CPU time elapsed for feature extraction and image
retrieval in Table 6.
Table 5 Average BEP of the
proposed framework on OT-
Scene database
Class name Average BEP (%)
Coast 36.78
Forest 63.11
Highway 60.62
Inside city 51.49
Mountain 38.77
Open country 50.39
Street 61.03
Tall building 42.53
Total average BEP 50.59
93 Page 20 of 23 Sens Imaging (2014) 15:93
123
It can be verified from this table that the computation of ART features is the
costliest in terms of time. By comparing the feature extraction time computed for
single image, we observe that ART features are computed in almost double the time
required to compute the CMs and EHD features. However, the ART features are
computed in the second stage and the extraction time spent in the second stage is
considerably less (almost one-fourth) than the time spent for computing CMs in the
first stage. This is possible because a very few (e.g. 30 for Wang’s database) images
are returned from the first stage and computation intensive ART features are
computed only on this small dataset. Thus, the proposed method tactfully avoids the
constraint of high time complexity of computing the ART moments. It can be
observed that the total feature extraction time is dominated by the feature extraction
time of the first stage.
Table 6 shows that the CPU time taken for image retrieval is 4.0145 and 3.9249 s
for Wang’s and VisTex databases respectively. It also states the retrieval time
elapsed in both the stages. This includes the feature extraction time for the entire
database also. As far as the retrieval time is concerned, since comparison effort is
more in the first stage than the second, therefore the time consumed in the first stage
retrieval is more than the second stage for both the databases. Further, we observe
that the time spent in performing the entire retrieval process on Wang’s database is
4.0145 s, which is less than the average retrieval time (4.12 s) reported in [6] by
Wang et al. We have estimated their retrieval time by averaging the time given for
different classes of Wang’s database.
Also, with our proposed framework the average CPU time required for extracting
the features is approximately 3.9859 s (on Wang’s database), which is better than
the feature extraction time reported in [21]. The average CPU time required for
computing the feature vector is approximately 10 s in [21], wherein a new image
retrieval scheme using visually significant point features is proposed. They also find
invariant CMs at the significant image points in RGB domain.
Table 6 CPU time elapsed in feature extraction and retrieval for two different databases
Wang’s database VisTex database
Feature extraction time (in s) for single image
Color moments 0.0445 0.0410
EHD 0.0468 0.0453
ART 0.0867 0.0851
Feature extraction time (in s) for full database
First stage 2.9912 2.7912
Second stage 0.9947 0.9043
Total time 3.9859 3.6955
Retrieval time(in s) for full database
First stage 3.0056 2.9927
Second stage 1.0089 0.9322
Total time 4.0145 3.9249
Sens Imaging (2014) 15:93 Page 21 of 23 93
123
5 Conclusions and Future Directions
In this paper, we present a novel framework using combination of low level features
in a multistage manner to improve the retrieval accuracy of image retrieval system.
Firstly, we retrieve images using CMs and then apply ART and EHD on the images
filtered in the first stage. Experimental results prove that the proposed scheme
performs exceptionally well and is robust in comparison to conventional hybrid
frameworks in terms of retrieval accuracy computed on Wang’s, VisTex and OT-
Scene databases. Based on the performance analysis, following conclusions can be
drawn:
1. The retrieval accuracy computed in terms of average precision is 75.80 % on
Wang’s database which is better than many existing hybrid frameworks.
2. Inspite of multistage retrieval, the proposed framework is observed to be
efficient in terms of time complexity also.
3. For all the different databases, the performance of proposed framework is better
than that of the individual descriptors i.e. CMs, ART and EHD alone.
In the future, we would like to extend this framework through the use of variety
of fusion methods (e.g. feature level fusion) and distance measures in order to
further enhance the accuracy of image retrieval.
Acknowledgments Two of the authors are thankful to South Asian University, New Delhi for financial
support during their research work. We are also extremely grateful to the anonymous reviewers for their
valuable comments that helped us to enormously improve the quality of the paper.
References
1. Datta, R., Joshi, D., Li, J., & Wang, J. Z. (2008). Image retrieval ideas, influences, and trends of the
new age. ACM Computing Surveys, 40, 1–60.
2. Brunelli, R., & Mich, O. (2008). Histograms analysis for image retrieval. Pattern Recognition, 34,
1625–1637.
3. Rasheed, W., An, Y., Pan, S., Jeong, I., Park, J., & Kang, J. (2008). Image retrieval using maximum
frequency of local histogram based color correlogram. In Second Asia international conference on
modeling & simulation (pp. 322–326).
4. Huang, J., Kumar, S. R., Mitra, M., Zhu, W.-J., & Zabih, R. (1997). Image indexing using color
correlograms, In Proceedings of IEEE conference on computer vision and pattern recognition (pp.
762–768).
5. Lu, T.-C., & Chang, C–. C. (2007). Color image retrieval technique based on color features and
image bitmap. Information Processing and Management, 43, 461–472.
6. Wang, X.-Y., Yu, Y.-J., & Yang, H.-Y. (2011). An effective image retrieval scheme using color,
texture and shape features. Computer Standards & Interfaces, 33, 59–68.
7. Park, D. K., Jeon, Y. S., & Won, C. S. (2000). Efficient use of local edge histogram descriptor, In
Proceedings of the 2000 ACM workshops on multimedia (pp. 51–54).
8. Kim, W. Y., & Kim, Y. S. (2000). A region based shape descriptor using Zernike moments. Journal
of Signal Processing: Image Communication, 16, 95–102.
9. Amanatiadis, A., Kaburlasos, V. G., Gasteratos, A., & Papadakis, S. E. (2011). Evaluation of shape
descriptors for shape-based image retrieval. Image Processing, 5, 493–499.
10. Pooja, C. S. (2012). An effective image retrieval system using region and contour based features. In
IJCA proceedings on international conference on recent advances and future trends in information
technology (pp. 7–12).
93 Page 22 of 23 Sens Imaging (2014) 15:93
123
11. Singh, S. M., & Hemachandran, K. (2012). Content-based image retrieval using color moment and
gabor texture feature. IJCSI International Journal of Computer Science, 9, 299–309.
12. Pooja, C. S. (2012). An effective image retrieval using the fusion of global and local transforms based
features. Optics & Laser Technology, 44, 2249–2259.
13. Goyal, A., & Walia, E. (2012). An analysis of shape based image retrieval using variants of Zernike
moments as features. International Journal of Imaging and Robotics, 7, 44–69.
14. Zhang, D., & Lu, G. (2002). Shape-based image retrieval using generic Fourier descriptor. Signal
Processing: Image Communication, 17, 825–848.
15. Wang, J. Z., Li, J., & Wiederhold, G. (2001). SIMPLIcity: Semantics-sensitive integrated matching
for picture libraries. IEEE Transaction on Pattern Analysis and Machine Intelligence, 23, 947–963.
16. ElAlami, M. E. (2011). A novel image retrieval model based on the most relevant features.
Knowledge-Based Systems, 24, 23–32.
17. Kang, J., & Zhang, W. (2012). A framework for image retrieval with hybrid features. In 24th Chinese
control and decision conference (CCDC) (pp. 1326–1330).
18. Hiremath, P. S., & Pujari, J. (2007). Content based image retrieval using color, texture and shape
features. In International conference on advanced computing and communications (pp. 780–784).
19. Huang, Z.-C., Chan, P. P. K., Ng, W. W. Y., & Yeung, D. S. (2010). Content-based image retrieval
using color moment and Gabor texture feature. In International conference on machine learning and
cybernetics (pp. 719–724).
20. Yue, J., Li, Z., Liu, L., & Fu, Z. (2011). Content-based image retrieval using color and texture fused
features. Mathematical and Computer Modeling, 54, 1121–1127.
21. Banerjee, M., Kundu, M. K., & Maji, P. (2009). Content-based image retrieval using visually sig-
nificant point features. Fuzzy Sets and Systems, 160, 3323–3341.
22. Jalab, H. A. (2011). Image retrieval system based on color layout descriptor and Gabor filters. In
IEEE conference on open systems (ICOS) (pp. 32–36).
23. Liu, G.-H., & Yang, J.-Y. (2013). Content-based image retrieval using color difference histogram.
Pattern Recognition, 46, 188–198.
24. Gong, M., Li, H., & Cao, W. (2013). Moment invariants to affine transformation of colors. Pattern
Recognition Letters, 34, 1240–1251.
25. Mindru, F., Tuytelaars, T., Gool, L. V., & Moons, T. (2004). Moment invariants for recognition
under changing viewpoint and illumination. Computer Vision and Image Understanding, 94, 3–27.
26. Manjunath, B. S., Ohm, J. R., & Vasudevan, V. V. (2001). Color and texture descriptors. IEEE
Transactions on Circuits and Systems for Video Technology, 11, 703–715.
27. Jain, A., Nandakumar, K., & Ross, A. (2005). Score normalization in multimodal biometric systems.
Pattern Recognition, 38, 2270–2285.
28. Guo, J. M., Prasetyo, H., & Su, H. S. (2013). Image indexing using the color and bit pattern feature
fusion. Visual Communication and Image Representation, 24, 1360–1379.
29. Wang, X.-Y., Yang, H.-Y., & Li, D.-M. (2013). A new content-based image retrieval technique using
color and texture information. Computers & Electrical Engineering, 39(3), 746–761.
30. Alexandre D. S., & Tavares, J. M. R. S. (2010). Introduction of human perception in visualization.
International Journal of Imaging and Robotics, 4, 60–70.
Sens Imaging (2014) 15:93 Page 23 of 23 93
123