An Effective and Fast Hybrid Framework for Color Image Retrieval

ORI GIN AL PA PER

An Effective and Fast Hybrid Framework for ColorImage Retrieval

Ekta Walia • Sulaiman Vesal • Aman Pal

Received: 22 November 2013 / Revised: 15 April 2014

� Springer Science+Business Media New York 2014

Abstract This paper presents a novel, fast and effective hybrid framework for

color image retrieval through combination of all the low level features, which gives

higher retrieval accuracy than other such systems. The color moment (CMs),

angular radial transform descriptor and edge histogram descriptor (EHD) features

are exploited to capture color, shape and texture information respectively. A mul-

tistage framework is designed to imitate human perception so that in the first stage,

images are retrieved based on their CMs and then the shape and texture descriptors

are utilized to identify the closest matches in the second stage. The scheme employs

division of images into non-overlapping regions for effective computation of CMs

and EHD features. To demonstrate the efficacy of this framework, experiments are

conducted on Wang’s, VisTex and OT-Scene databases. Inspite of its multistage

design, the system is observed to be faster than other hybrid approaches.

Keywords Color moments � Angular radial transform � Edge histogram

descriptor � Content based image retrieval � Similarity measures � Time complexity

1 Introduction

Nowadays, with the development in the field of digital image processing and also

with the increased use of digital images, it has become essential to find an effective

E. Walia (&) � A. Pal

Department of Computer Science, South Asian University, New Delhi, India

e-mail: [email protected]

A. Pal


S. Vesal

Information Extend Technology, Kabul, Afghanistan


123

Sens Imaging (2014) 15:93

DOI 10.1007/s11220-014-0093-9

and efficient method for searching and indexing of images from large image

collections. The conventional annotation approaches rely on the mapping of images

with some text, keywords or descriptions. These methods can hardly describe

content diversity of images. Due to lack of discrimination power in conventional

methods, the concept of content based image retrieval (CBIR) [1] has become an

active research topic in the last decades. Currently, CBIR techniques work on

combination of low level and high level features [1]. Color, texture and shape are

the prominent low level features used in CBIR.

Color is one of the most expressive and distinguished visual features which is

used in CBIR and object recognition. There are many color descriptors that have

been proposed for image retrieval. These are color histogram which indicates the

occurrence of color in the image [2], dominant color descriptor [3] which describes

the salient distribution of color in the region of interest and color correlogram [4]

which describes the probability of finding a pair of colors based on a particular

distance. Lu and Chang [5] proposed a new technique for the purpose of more

effective image retrieval which uses the color distributions to represent the global

characteristics of image. Color moments (CMs) are known to yield better retrieval

accuracy as compared to the conventional color based features. In this paper, we use

CMs for extraction of color features. In order to increase the discrimination power

of CMs, we use non-overlapping image tiles and then extract first three moments of

every tile which yields a powerful descriptor.

Texture is another significant visual feature which has the capability to

distinguish different objects without any other information and it also describes

the structural arrangement of a region in the object. Moreover, many statistical

techniques have been proposed to describe texture features of an image, such as

gray-level co-occurrence matrix, Markov random field Gabor wavelet, edge

histogram descriptor (EHD). These features have shown better performance and

effectiveness in comparison to some of the traditional features [6, 7].

Amongst various visual features, shape based features are the most relevant,

because human perception is based on shape of an object and they can recognize

objects solely from their shapes. Shape descriptors can be classified into contour-

based descriptors which extract features from the outer boundary and region-based

descriptors which extract features from the entire region [8]. The important region-

based shape descriptors are Zernike moments (ZMs), angular radial transform

(ART), geometric moments, moment invariants, etc. [9]. ZMs possess certain

desirable properties such as rotation invariance, robustness to noise, minimum

information redundancy and fast computation for each moment order [8] but it still

needs complex computation when all moments up to certain specified order are

required. ART is a moment based method adopted by MPEG-7 and its robustness

and efficiency has been proved by many researchers. Therefore, in our proposed

framework, we use ART descriptor, which provides almost similar retrieval

accuracy as that of ZMs with extra advantage of being quite efficient in its

computation time [10].

Over the past years, most of the studies on CBIR have used only a single feature

amongst various visual features. However, it is hard to achieve satisfactory retrieval

results using a single feature because usually an image contains various

93 Page 2 of 23 Sens Imaging (2014) 15:93

123

characteristics and diverse contents. Therefore, it is essential to combine visual

features in order to gain satisfactory retrieval results. Since, shape, texture and color

features are complementary to each other, therefore, it is expected that their

combination would yield an improved retrieval performance. In order to improve

the retrieval efficiency; in [16, 19], the authors proposed color CBIR systems that

utilize a combination of color and texture features only. Wang et al. [6] proposed a

color image retrieval framework by using all the low level features. They used color

quantization algorithm with clusters merging to obtain the small number of

dominant colors and their percentages in the image. The texture features are

extracted using steerable filter decomposition and the pseudo-ZMs are used as shape

descriptor. ElAlami [16] proposed a model for image retrieval which depends on

most relevant features (color and texture) according to a Genetic Algorithm based

feature selection technique. Kang and Zhang [17] presented a color image retrieval

scheme by combining all the three i.e. color, texture and shape features, which

achieved higher retrieval efficiency. They have integrated three image features

namely; color histogram, gray-level co-occurrence matrix and ZMs. Hiremath and

Pujari [18] also presented a framework by combining all low level features to

achieve higher retrieval efficiency. In [19], Huang et al. proposed an image retrieval

approach based on color and texture features. They used CMs as color features and

Gabor filter to capture texture features whose similarity results are finally combined

for color image retrieval. Yue et al. [20] have used the fusion of color histogram and

texture features based on a co-occurrence matrix for image retrieval. Banerjee et al.

[21] have presented a content based retrieval system based on significant point

features extracted using a fuzzy set theoretic approach. Jalab [22] has used a

combination of color layout and Gabor texture descriptors for the implementation of

image retrieval system. Liu and Yang [23] presented color difference histogram, a

new image feature representation approach for image retrieval. This descriptor

encodes color, edge orientation and perceptually uniform color difference. Wang

et al. [29] propose a new CBIR scheme using color and texture information. They

use Zernike chromaticity distribution moments for capturing color contents and

texture features are obtained in contourlet domain.

In this paper, we propose a hybrid framework for color image retrieval which

gives better accuracy than the conventional methods. It is different from other

hybrid approaches because it is based on human perception and recognition process.

As humans perceive objects initially by their colors and later try to make sense

using shape and texture, it imitates their recognition phenomenon by retrieving

relevant images in two different stages. The work in [30], presents data pertaining to

different visual attributes arranged in decreasing order of their perceptivity. It is

given that color attributes are more perceivable as compared to texture and shape.

Therefore, we use color features in the first stage and shape as well as texture

features in the second stage. Thus, it is a unique approach as the retrieval is

performed in a multistage fashion. To extract the color features in the first stage, we

use CMs which are more discriminative and give detailed color information of

image. In the second stage, texture and shape features of the images returned by the

first stage are extracted. EHD is used to capture texture features of color images

which creates 80 bins histogram to simulate the distribution of edges in an image.

Sens Imaging (2014) 15:93 Page 3 of 23 93

123

ART is used for shape feature extraction. It is chosen over other moment based

shape descriptors because of small size of its descriptor. Further, the distances of the

ART and EHD features of the query image and the images belonging to the dataset

obtained in the first stage are combined using certain weights and finally 20 top

ranked images are retrieved for Wang’s database.

To study the retrieval performance of the proposed framework, various

experiments are performed on different databases such as Wang’s database, OT-

Scene and VisTex databases using Euclidean distance as a similarity measure. The

image retrieval performance of the proposed framework is also compared with other

hybrid approaches proposed in this domain. From the experiments, it is observed

that the performance of the proposed technique is better than others. A thorough

analysis of time complexity of the proposed technique is also given.

The paper is organized in the following sections as mentioned here. Section 2

describes the various descriptors used i.e. CMs, EHD and ART. Section 3 explains

the methodology employed in the proposed approach. Section 4 presents the

experimental results and analysis of retrieval performance on various databases. It

also gives comparison of the retrieval performance of the proposed scheme with

other such systems. Further, an analysis of time complexity is also presented here.

Section 5 presents conclusions with future directions.

2 Descriptors Used

2.1 Color Moments

The objective of color features is to retrieve all the images whose color composition

is similar to the query image. Experiments have shown that color histogram does not

capture spatial relationship of color regions. Many researchers and papers focused

on color indexing approaches based on global distributions of color in an image.

The most widely used techniques are color correlogram, color coherent vector and

CMs which perform better than the traditional techniques. In this paper, we use CM

because primitives of CMs are more robust to describe color images and it leads to

faster implementation in comparison to other methods that need expensive

computation.

For color image, the RGB model is well known. However, RGB is not uniform

color model and also it is less powerful for describing colors according to human

interpretation. The HSV (hue, saturation and value) color space is more related to

the human perception. Therefore, we convert image from RGB to HSV model and

retrieve primitive moments of each component.

In [11], to improve the discriminative power of CM technique, the authors

divided the image horizontally into three non-overlapping regions and from each of

the regions, they extracted the first three moments for each color channel and made

a descriptor of 27 floating points. On the similar lines, we divide an image into sub-

images for computation of CMs. This is described in detail in Sect. 2.2 for EHD.

The first three CMs of an image are given by mean, standard deviation and

skewness of colors. These are depicted in Eq. (1) through Eq. (3).


123

Er;j ¼1

N

XN

i¼1

Ii;j; j ¼ H; S;V ð1Þ

rr;j ¼1

N

XN

i¼1

ðIi;j � Er;jÞ2 !1

2

; j ¼ H; S;V ð2Þ

Sr;j ¼1

N

XN

i¼1

ðIi;j � Er;jÞ3 !1

3

; j ¼ H; S;V ð3Þ

where Ii,j represents intensity of jth color channel at ith location in the image tile.

N represents the number of pixels in each region/tile.

The descriptor size for CMs is thus stated as:-

sizeðFCMÞ ¼ r � nc � nCM ð4Þ

where the feature vector FCM is given as:

FCM ¼

E1;H E1;S E1;V

r1;H r1;S r1;V

S1;H S1;S S1;V

��

Er;H Er;S Er;V

rr;H rr;S rr;V

Sr;H Sr;S Sr;V

26666666666664

37777777777775

; where r ¼ 16: ð5Þ

and nc is the number of color channels, nCM is the number of CMs. There are works

that describe the direct computation of moment invariants for a color model like

RGB [24, 25]. But, in order to increase the retrieval accuracy, we prefer compu-

tation of moments for each color channel separately. Thus, we divide the image into

r i.e. 16 non-overlapping sub-images of 25 9 25 pixels and extract primitive

moments of each color channel from every sub-image, obtaining a descriptor of

48 9 3 floating points. Refer Sect. 4.1 for detailed analysis of the same.

2.2 Edge Histogram Descriptor (EHD)

Texture is an important visual feature of an image. The texture descriptors provide

measures of properties such as smoothness, coarseness, and regularity [7]. The

distribution of edges is a good texture signature that is useful for image to image

matching even when the underlying texture is not homogeneous [26].

EHD is one of the widely used statistical techniques to capture texture features of

an image. This descriptor proposed for MPEG-7 [7, 26], determines the local edge

distribution in an image.


123

EHD is obtained by partitioning the entire image (scaled down to 16 9 16 pixels)

into 16 non-overlapping sub-images of 4 9 4 pixels which is shown in the Fig. 1.

The edges in EHD are classified into five types, four directional edges such as,

vertical, horizontal, diagonal edges at 45� and 135� and one non-directional edge. If

the image-block does not have any directionality, it is counted as non-directional

edge. Further, when edges have been extracted from image blocks, we count the

total number of edges in every sub-image. As there are five different types of edges,

for each sub-image, we have a histogram of five bins. Since, each sub-image is of

4 9 4 pixels, we have total 16 9 5 = 80 bins histogram for each image. Every sub-

image is further divided into 4 image blocks (of 2 9 2 pixels) in order to obtain the

above said histogram. The semantics of the bins of the resulting histogram are

defined in Table 1. We employ the 2 9 2 filters shown in Fig. 2 to compute

corresponding edge intensity values of each sub-image. If the intensity values of the

edge exceed a given threshold, then the corresponding image block is considered to

be edge block [7]. In this paper, we have considered the threshold value as 11 as per

[7].

2.3 Angular Radial Transform (ART)

ART is a moment-based image description method adopted in MPEG-7 as a 2D

region-based shape descriptor. This descriptor has many desirable characteristics

such as compact size, robustness to noise and scaling, invariance to rotation and

ability to describe complex objects. A significant characteristic is its small size and

its speed which further ensures fast image retrieval process. ART is a complex

orthogonal unitary transform defined on a unit disk based on complex orthogonal

sinusoidal basis functions in polar co-ordinates [9, 10, 12]. The ART coefficients,

Fnm of order n and m, are defined by:

Fnm ¼Z2p

0

Z1

0

V�n;mðr; hÞf ðr; hÞrdrdh ð6Þ

where f(r, h) is an image intensity function in polar co-ordinates and V�n;mðr; hÞ is the

ART basis function, which is complex conjugate of Vn;mðr; hÞ that is separable along

the angular and radial directions as stated below

Vn;mðr; hÞ ¼ RnðrÞAmðhÞ ð7Þ

with

AmðhÞ ¼1

2pejmh ð8Þ

and

RnðrÞ ¼1 ðn ¼ 0Þ

2 cosðpnrÞ ðn [ 0Þ

� �ð9Þ


123

where n and m represent the order and repetition of ART, respectively. For discrete

image f(x, y) of size N 9 N pixels, ART is approximately computed using Eq. (10),

where, the integral parts of Eq. (6) are replaced by summations,

Fnm ¼XN�1

i¼0

XN�1

j¼0

f ðxi; yjÞV�nmðxi; yjÞDxiDyj

where x2i þ y2

j � 1:

ð10Þ

Fig. 1 The illustration of sub-images and image blocks

Table 1 The semantics of local

edge binsHistogram bins Semantics

FEHD[1] Vertical edge of sub-image at (0,0)

FEHD[2] Horizontal edge of sub-image at (0,0)

FEHD[3] 45� edge of sub-image at (0,0)


FEHD[5] Non-directional edge of sub-image at (0,0)


.. ..

… ….. ..



FEHD[77] Horizontal edge of sub-image at (3,3)





123

The coordinate ðxi; yjÞin a unit disk are given by

xi ¼2iþ 1� N

Dyj ¼

2jþ 1� N

Dð11Þ

where i, j = 0,1,2, …, N - 1, and

D ¼ N for inner circular disk contained in the square image

Nffiffiffi2p

for outer circular disk containing the whole square image

�ð12Þ

and

Dxi ¼2

D;Dyj ¼

2

Dð13Þ

It has been observed in [12] that ART with outer circular disk gives better result

than inner circular disk. Therefore, in this paper, we use the outer circular disk and

we also compare the experimental results of both the approaches in Sect. 4.1.1. We

use ART coefficients with order (n \ 3, m \ 12) which gives a feature vector

containing 36 moments. The rotation invariance is achieved by using the magnitude

of the coefficients. The ART shape descriptor is given by:

FART ¼ ðF0;0;F0;1; . . .;F0;11;F1;0;F1;1; . . .;F1;12;F2;0;F2;1; . . .;F2;11Þ: ð14Þ

3 Proposed Algorithm

3.1 Framework of Proposed System

In this paper, we propose a novel multistage image retrieval approach which

classifies the images initially on the basis of CMs feature vector that encompasses a

number of useful properties as described in the Sect. 2.1. The second stage features

are obtained by using the ART and the EHD descriptors. The performance of the

proposed framework is analyzed against some major available image retrieval

systems. The steps involved in the proposed framework are described diagram-

matically in Fig. 3 with details. It can be observed from the diagram that in two

stages of the proposed framework, various tasks are performed. The following steps

describe the entire approach:

(a) (b) (d)(c) (e)

Fig. 2 Filters for edge detection


123

First stage: As a first step, the color feature vectors are extracted by using CMs.

The procedure is as follows:

1) The query and database images are resized to 100 9 100 pixels from their

original size.

2) We convert the images to (hue, saturation, value) HSV color space because the

RGB is not well suited for describing the colors in terms that are practical for

human interpretation [17].

3) Image (query as well as database) expressed in three HSV color channels is

then divided to number of tiles (sub-images). Here, we divide every HSV

component to 16 non-overlapping tiles, of 25 9 25 pixels.

4) The CMs for each tile (r) are calculated using Eq. (1) through Eq. (3), and a

descriptor FCM with 48 9 3 features is generated as per Eq. (5).

Fig. 3 Multistage color image retrieval using CMs, ART and EHD


123

5) The color feature vector of the query image is compared to that of the database

images by using Euclidean distance classifier, as described in Eq. (15) and

Eq. (16).

6) Out of this stage, top k (e.g. k = 30 for Wang’s database) most relevant

database images are retrieved and the output of this stage is now used as a set of

database images for the next stage.

Second Stage: In this stage, shape and texture features of the query image and the

images in the database generated from previous stage are extracted by using ART

and EHD descriptors. Since ART features extract shape information globally and

EHD extracts texture information locally on the non-overlapping sub-images,

therefore the two types of features are complementary to each other and are used

together in this stage. The steps are as follows:

1) The ART feature vectors are computed for query image and subset of the

database images respectively, by resizing them to 100 9 100 pixels and

converting them to grayscale form.

2) Further, the coefficients of ART are computed for all orders such that n \ 3,

m \ 12.

3) Euclidean distance is then computed to evaluate the similarity between the

feature vectors FART of query image and the subset of database images. It is

computed through Eq. (17). Each of these distances is further normalized using

Eq. (19).

4) As described in Sect. 2.2, EHD feature vectors are also extracted for query

image and for the subset of database images generated from the previous stage.

Every image is therefore represented with a descriptor of 80 bins which shows

the distribution of edges in images.

5) Euclidean distance is again computed to know the similarity between the

feature vectors FEHD of query image and the subset of database images. It is

computed through Eq. (18). Each of these distances is further normalized using

Eq. (20).

6) The distances obtained from step 3 and step 5 are then combined using certain

weights. In the proposed framework, the weights given to the distance between

ART and EHD features are computed adaptively. This combination is based on

our observation that ART descriptor is more effective in capturing the global

details than EHD descriptor which actually captures local edge details. Thus,

the distance between the query image and the images belonging to the subset

retrieved from the first stage is a weighted combination of distances between

ART and EHD features. More details on how these distances are combined is

given in Eq. (21) of Sect. 3.2.

3.2 Similarity Matching

Image similarity measures typically assess a distance between set of image features.

However, shorter distance corresponds to higher similarity and the choice of metrics

depends on type of feature vectors. In order to compute distance between query


123

image Q and training images T, we use Euclidean distance/L2-norm for CMs. We

use a weighted color channel scheme, wherein we give variety of weights to the

CMs distance of each color channel and after many experiments, we establish that

saturation has more impact in HSV color space, therefore we assign more weight to

the S component i.e. w2 is chosen to be higher than w1 and w3, which are both equal.

The CM similarity measurement is therefore, defined as follows:

DCOLðFQCOL;F

TCOLÞ ¼ w1 � dH þ w2 � dS þ w3 � dV ð15Þ

where

w1 ¼ 0:25

w2 ¼ 0:50

w3 ¼ 0:25

8><

>:and

d j ¼Xr

i¼1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðEQ

i;j � ETi;jÞ

2q

þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðrQ

i;j � rTi;jÞ

2q

þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðSQ

i;j � STi;jÞ

2q� �

; j ¼ H; S;V

ð16Þ

where FQCOL is the color feature vector of query image and FT

COL is the color feature

vector of database images. Also, r represents region/tile of the image. EQi;j;E

Ti;j are

region-wise mean of intensities of query and database images, computed for each

color channel j. Similarly, rQi;j; r

Ti;j; S

Qi;j; S

Ti;j are the region-wise standard deviation and

skewness of intensities (of query and database images), belonging to each color

channel. All the three individual distances dH, dS, and dV, are already normalized

because of the RGB to HSV model conversion, which always causes each con-

stituent component’s value (i.e. hue, saturation and value) to lie between 0 and 1.

It is worth mentioning that we have also tested the retrieval performance of CMs

obtained through the technique described in [24] for RGB and HSV models directly

and found that their retrieval results are less than the weighted color channel scheme

proposed here.

Similarity distance between shape feature descriptors described by ART and

texture feature described by EHD are calculated using L2-norm, which is defined as

follows:

DARTðFQART ;F

TARTÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiX36

i¼1

ðFQART � FT

ARTÞ2

vuut ð17Þ

DEHDðFQEHD;F

TEHDÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiX80

i¼1

ðFQEHD � FT

EHDÞ2

vuut : ð18Þ

After computation of similarity distance for ART and EHD features, these

distance features may have different ranges, one is very high and one is very small.

Therefore, we use a normalization method in order to make all the texture and shape

feature distances fall in the same range. The min–max normalization method [27] is


123

employed to achieve this. It performs a linear transformation on the original data.

Suppose that minA and maxA are the minimum and maximum values of the feature

vector A, the min–max normalization maps a value, v of A to v0 in the range [0, 1].

Thus, the individual distances of the ART and EHD feature vectors i.e. DART and

DEHD are normalized using Eq. (19) and Eq. (20) given below:

DnART ¼

DART �minfDARTgmaxfDARTg �minfDARTg

; 8T and Q ð19Þ

DnEHD ¼

DEHD �minfDEHDgmaxfDEHDg �minfDEHDg

; 8T and Q: ð20Þ

Finally, the combination takes place as follows:

DARTþEHD ¼ w4DnART þ w5Dn

EHD ð21Þ

where DnART is the normalized distance between ART features and Dn

EHD is the

normalized distance between EHD features, computed through Eq. (19) and Eq.

(20). The weights w4 and w5 are computed as under

w4 ¼PART

PART þ PEHD

;w5 ¼PEHD

PART þ PEHD

ð22Þ

where PEHD and PART are the average precision (defined in Sect. 3.3) of EHD and

ART descriptors respectively. In our experiments, we obtain PART as 40.70 and

PEHD as 37.0 (for Wang’s database), therefore, w4 is 0.52 and w5 is 0.48.

3.3 Parameters to Measure the Retrieval Performance

The performance of a retrieval system can be measured in terms of its precision–

recall (P–R) ratio. Precision measures the ability of the system to retrieve only

images that are relevant, while recall measures the ability of the system to retrieve

all images that are relevant. The (P–R) ratio is measured using Eq. (23).

Precision ¼ No: of relevant images retrieved

No: of retrieved images

Recall ¼ No: of relevant images retrieved

Total no: of relevant images in the database

ð23Þ

This ratio is used when there is equal number of objects in each class.

Bull’s eye performance (BEP) is used to measure the retrieval performance of the

images in which the number of images are not equal in each class. It is applied in

various researches [13, 14]. If the number of images in a database corresponding to

query image Q are N and out of 2N retrieved images, only X images are correct, then

BEP is computed as per Eq. (24).

BEP ¼ X

Nð24Þ


123

We also use average retrieval rate (ARR) [28], which is a robust metric for

comparison of image retrieval methods. It can be computed using Eq. (25).

ARR ¼ 1

NQ

XNQ

q¼1

RRðqÞ ð25Þ

where NQ represents the number of queries that are used for the purpose of verifying

the descriptor in a dataset. RR(q) represents the retrieval rate of a single query and is

computed as per (26).

RRðqÞ ¼ nk

nq

� 1 ð26Þ

where nk is the number of correct retrievals and nq is the total number of images (in

the database) relevant to the query.

4 Experimental Results and Analysis

To check the retrieval performance of the proposed system over different databases,

experiments are performed in MATLAB version 8.1 on a machine with 3.40 GHz

CPU and 16 GB RAM under Microsoft Windows 64-bit operating system. The

proposed hybrid framework is evaluated through three sets of experiments. The first

set of experiments is performed on Wang’s database and the second set of

experiments is performed on OT-Scene dataset. The third is conducted on VisTex

database. The P–R ratio is considered for measuring the retrieval performance of

proposed framework on Wang’s database, whereas ARR is used for VisTex

database, and BEP measure is used for OT-Scene database.

Wang’s color image database The database is provided by Wang et al. [15] The

Wang’s dataset is a subset of COREL image database. It contains 1,000 images,

which are equally divided into ten different categories: African people, beach,

building, bus, dinosaur, elephant, flower, horse, mountain and food. Every database

image is of size 256 9 384 or 384 9 256 pixels. Figure 4a shows some sample

images of each category of images belonging to this database.

Olivia and Torralba database We also evaluated the proposed framework on the

scene classification for which the database was downloaded from http://cvcl.mit.

edu/database.htm. The dataset from Oliva and Torralba is used, and denoted as OT-

Scene dataset. It consists of 2,688 color images from eight scene categories: coast

(360 samples), forest (328 samples), mountain (374 samples), open country (410

samples), highway (260 samples), inside city (308 samples), tall building (356

samples) and street (292 samples). Figure 4b shows some sample images of each

category of images belonging to this database.

VisTex database VisTex database is a collection of colored texture images created

by MIT media Lab. The database was created with the intention of providing a large

set of high quality textures for computer vision applications. Each texture image is a


123

http://cvcl.mit.edu/database.htm

http://cvcl.mit.edu/database.htm

square image of 512 9 512 pixels. The data set has two main components:

Reference Textures (with homogeneous textures in frontal and oblique perspectives)

and Texture Scenes (containing images with multiple textures, ‘‘real-world’’,

scenes). Figure 4c shows some sample images of each class belonging to this

database.

4.1 Retrieval Performance

4.1.1 Retrieval Performance on Wang’s Database

In the first set of experiments, we randomly select 50 images as query images from

Wang’s database, five per class, and every time retrieve the top first 20 images as

retrieval results. Further, we calculate the average precision and the average recall

of each class.

In order to ensure the correctness of our approach for computing CMs, we

conduct experiments for direct evaluation of moment invariants for RGB and HSV

color models. With this, we obtain only 39.40 and 40.53 % precision respectively

on Wang’s database. However, with our method of computing CMs separately for

each color channel, we are able to achieve 62.53 % precision on Wang’s database.

Fig. 4 a Sample images from Wang’s database. b Sample images from OT-Scene Database. c Sampleimages from VisTex database


123

To ensure the optimal size of non-overlapping sub-images for the computation of

CMs and EHD, we conduct experiments by varying the sub-image size. Figure 5

shows graphs indicating that 4 9 4 pixels and 25 9 25 pixels is the optimal sub-

image size for EHD and CM descriptor respectively. It depicts that with these size,

we get maximum average precision on Wang’s database. Therefore, in all the

experiments, we use the sub-images of optimal size only.

It is observed from the results that our proposed framework performs better than

the individual methods under consideration. Table 2 indicates the performance of

individual techniques on each class of Wang’s database. It is important to note that

the proposed framework enhances the average retrieval performance of CMs by

13.27 % as the average precision (computed over ten classes) obtained through the

CMs alone is 62.53 % only.

Further, the experimental results with average precision and average recall are

presented in Tables 3 and 4. The proposed results are compared with other image

retrieval approaches reported in the literature [17, 19–23]. It is inferred from the

comparison shown in Table 3 that the proposed framework achieves good results on

Wang’s database with average precision of 75.80 %. In all these works compared here,

50 or 80 images are selected randomly as query and 20 images are retrieved. We have

implemented the Color Difference Histogram technique of Liu and Yang [23] for

comparison purposes. For comparing our retrieval results with that of Hiremath and

Pujari [18], we also make every database image as query and obtain average precision

of 63.22 % for Wang’s database. This is higher than 54.90 % achieved in [18].

In addition to the comparison given in Table 3, we observed that the retrieval

performance (in terms of average precision) of our proposed method is better than

the performance of hybrid scheme proposed in [6] by Wang et al. It is pertinent to

mention here that we averaged the precision given by them (when 50 random

images are used as query) for different classes to observe that they achieve

approximately 59 % precision (on average) for Wang’s database.

Table 4 shows the average recall of proposed method in comparison with the

method of Kang and Zhang [17]. Figure 6 pictorially depicts the average precision

computed on Wang’s database through the proposed framework and other hybrid

methods. Figure 7 depicts the result of image retrieval for a randomly selected

image (of horse) as query. The top 20 images retrieved by the proposed framework

are depicted in accordance of their rank from left to right following the query image,

which is shown in the top left corner.

To establish the efficacy of our proposed framework, we also conduct

experiments by swapping the descriptors used in two different stages i.e. we use

ART and EHD in the first stage and CMs in the latter. With this, we noticed a huge

drop in the precision indicating that our choice of using CMs in the first stage is

correct.

4.1.2 Retrieval Performance on OT-Scene Database

In the second set of experiments performed on OT-Scene database, we randomly

select 40 images from this database, five per class as queries. We calculate the BEP

parameter as follows:


123

30.00

32.00

34.00

36.00

38.00

40.00

4X4 8X8 No Region

Ave

rag

e P

reci

sio

n

Region Size (in pixels)

EHD

45.00

50.00

55.00

60.00

65.00

70.00

12.5X12.5 25X25 50X50 No Region

Ave

rag

e P

reci

sio

n

Region Size (in pixels)

ColorMoments

(a)

(b)

Fig. 5 Average precision of a EHD and b CMs on Wang’s database for different region sizes

Table 2 Average precision of the individual descriptors and the proposed framework on Wang’s

database

Class name Individual descriptors Proposed

frameworkColor moment EHD ART (in outer

disk framework)

African 56.00 16.00 22.00 68.00

Sea 64.00 42.00 31.00 60.00

Building 38.00 18.00 17.00 63.00

Bus 49.33 23.00 29.00 88.00

Dinosaur 100.00 97.00 98.00 100.00

Elephant 62.00 58.00 72.00 74.00

Flower 70.67 33.00 37.00 82.00

Horse 93.33 42.00 57.00 99.00

Mountain 43.33 26.00 25.00 51.00

Food 48.67 15.00 19.00 73.00

Average precision 62.53 37.00 40.70 75.80


123

Ta

ble

3A

ver

age

pre

cisi

on

com

par

ison

of

the

pro

pose

dfr

amew

ork

wit

ho

ther

met

ho

ds

on

Wan

g’s

dat

abas

e

Sem

anti

cn

ame

Met

ho

d

[17

](%

)

Met

ho

d

[19

](%

)

Met

ho

d

[20

](%

)

Met

ho

d

[21

](%

)

Met

ho

d

[22

](%

)

Met

ho

d

[23

](%

)

Pro

po

sed

met

ho

d

usi

ng

AR

T(i

no

ute

r

dis

kfr

amew

ork

,%

)

Pro

po

sed

met

ho

d

usi

ng

AR

T(i

nin

ner

dis

kfr

amew

ork

,%

)

Afr

ica

peo

ple

69

.07

54

.00

58

.75

61

.00

32

.00

54

.00

68

.00

69

.00

Bea

ch5

5.3

24

2.0

04

1.1

95

6.0

06

1.0

05

1.0

06

0.0

06

0.0

0

Buil

din

g5

6.4

51

6.0

04

2.3

56

3.0

03

9.0

03

8.0

06

3.0

05

9.0

0

Bus

89

.36

67

.00

71

.69

72

.00

40

.00

46

.00

88

.00

87

.00

Din

osa

urs

93

.27

99

.00

74

.53

95

.00

10

0.0

01

00

.00

10

0.0

01

00

.00

Ele

ph

ants

70

.84

40

.00

65

.08

77

.00

56

.00

63

.00

74

.00

74

.00

Flo

wer

s8

8.4

79

7.0

08

3.2

48

3.0

08

9.0

09

0.0

08

2.0

07

8.0

0

Ho

rses

81

.37

96

.00

69

.30

95

.00

65

.00

93

.00

99

.00

99

.00

Mo

un

tain

s6

4.5

84

6.0

04

4.8

66

8.0

05

6.0

04

8.0

05

1.0

05

0.0

0

Fo

od

69

.83

79

.00

44

.54

57

.00

44

.00

39

.00

73

.00

72

.00

Av

erag

ep

reci

sio

n7

3.8

66

3.6

05

9.5

57

2.7

05

8.2

06

2.2

07

5.8

07

4.8

0


123

1) In the first stage, for each query image, we retrieve 900 images (double the

number of maximum images in any class of this database) using CMs and

applying similarity measures between query image and database images.

2) The ART and EHD feature vectors are computed for the query image and

subset of filtered images.

3) The L2-norm similarity measure is again used to compute the distance between

the respective feature vectors of ART and EHD for the query and the images

filtered in step 1.

4) Similarly, we combine these distances (of ART and EHD feature vectors) using

weights as described in Sect. 3.2.

5) According to Eq. (24) for computing BEP, we retrieve 2N images in this step,

where N is the number of images in a particular class to which the query image

belongs.

6) Further, we count the number of relevant retrieved images and compute BEP

according to Eq. (24).

The results of average BEP are shown in Table 5. It is observed that the retrieval

accuracy for all the eight classes is quite satisfactory.

4.1.3 Retrieval Performance on VisTex Database

We further establish the effectiveness of the proposed framework by conducting

experiments on VisTex database. Here, each texture image is divided into 16 non-

overlapping sub images with size 128 9 128 pixels. Thus, a total of 640 texture sub

images, categorized into 40 different classes are used in our experiments.

We obtain ARR [using Eqs. (25) and (26)] of 61.29 % for the proposed

framework. In the first stage, we filter 25 images from the database of 640 images.

In the second stage, we rank these images based on the fusion of EHD and ART. It

Table 4 Average recall comparsion of the proposed framework with method [17] on Wang’s database

Semantic name Method [17] Proposed method using

ART (in outer disk

framework)

Proposed method using

ART (in inner disk

framework)

Africa people 0.147 0.136 0.138

Beach 0.180 0.120 0.120

Building 0.180 0.126 0.118

Bus 0.138 0.176 0.174

Dinosaurs 0.112 0.200 0.200

Elephants 0.163 0.148 0.148

Flowers 0.127 0.164 0.156

Horses 0.121 0.198 0.198

Mountains 0.190 0.102 0.100

Food 0.157 0.146 0.144

Average recall (%) 15.15 15.16 14.96


123

is pertinent to mention here that the individual descriptors (i.e. CMs, EHD and

ART) are able to attain an ARR of 55.43, 56.23 and 28.19 % respectively. Clearly,

the proposed framework is able to achieve 5.86, 5.06 and 33.10 % improvement

over CMs, EHD and ART respectively.

4.2 Time Complexity of Proposed Method

The time complexity of the proposed method is based on the complexity of feature

extraction and image retrieval. Since, this is a multistage approach, we therefore,

analyze the time complexity of both the stages separately.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%Method[17]

Method[19]

Method[20]

Method[21]

Method[22]

Method[23]

Proposed Method using ART (inouter disk frame work)

Proposed Method using ART (ininner disk frame work)

Fig. 6 Average precision comparsion of the proposed framework with other methods on Wang’sdatabase

Query Image 1 2 3 4 5 6

7 8 9 10 11 12 13

14 15 16 17 18 19 20

Fig. 7 The color image retrieval results using the proposed framework for a randomly generated queryimage (on Wang’s database)


123

4.2.1 Time Complexity of First Stage

The time complexity of feature extraction for the proposed method in first stage is

derived from the time taken to compute CMs. The time complexity of computing

CMs is OðM2rÞ where r is the number of regions into which an M �M image is

divided for the computation of CMs. Further, distances dj, for j ¼ H; S;V are

computed using Eq. (16) with time complexity of Oð1Þ and then a comparison of

distances DCOL computed in Eq. (15) is performed through a sorting algorithm

whose complexity would be Oðnd log ndÞ, where nd is the number of distances to be

sorted. In first stage, nd is the size of database itself, whereas in the second stage, it

is the number of images extracted in the first stage for a given query. Thus, the time

complexity of first stage is approximately OðM2rÞ þ Oðnd log ndÞ. In this stage,

computation of CMs is the costliest effort.

4.2.2 Time Complexity of Second Stage

In the second stage, the time complexity of computing EHD and ART is OðM2rÞ and

OðM2nmÞ respectively for an image of size M �M. Here, (n, m) is the order and

repetition of ART respectively. Another important part of algorithm is normalization

of the trivially computed distances DARTþEHD in this stage. It is also a trivial operation

with the time complexity of Oð1Þ: Thus, the time complexity of second stage is

OðM2rÞ þ OðM2nmÞ þ Oðnd log ndÞ: Clearly, the time complexity of this stage is

dominated by the ART feature extraction time. This is because we use 16 regions for

computing EHD, whereas we compute 36 ART moments i.e. n\3;m\12: But as

these features are computed for a very small set of images retrieved from the database

(e.g. of the order of 30 for Wang’s database), therefore, we observe that the time spent

in second stage is considerably less than the time spent in the first stage.

4.2.3 Analysis

We further present the actual CPU time elapsed for feature extraction and image

retrieval in Table 6.

Table 5 Average BEP of the

proposed framework on OT-

Scene database

Class name Average BEP (%)

Coast 36.78

Forest 63.11

Highway 60.62

Inside city 51.49

Mountain 38.77

Open country 50.39

Street 61.03

Tall building 42.53

Total average BEP 50.59


123

It can be verified from this table that the computation of ART features is the

costliest in terms of time. By comparing the feature extraction time computed for

single image, we observe that ART features are computed in almost double the time

required to compute the CMs and EHD features. However, the ART features are

computed in the second stage and the extraction time spent in the second stage is

considerably less (almost one-fourth) than the time spent for computing CMs in the

first stage. This is possible because a very few (e.g. 30 for Wang’s database) images

are returned from the first stage and computation intensive ART features are

computed only on this small dataset. Thus, the proposed method tactfully avoids the

constraint of high time complexity of computing the ART moments. It can be

observed that the total feature extraction time is dominated by the feature extraction

time of the first stage.

Table 6 shows that the CPU time taken for image retrieval is 4.0145 and 3.9249 s

for Wang’s and VisTex databases respectively. It also states the retrieval time

elapsed in both the stages. This includes the feature extraction time for the entire

database also. As far as the retrieval time is concerned, since comparison effort is

more in the first stage than the second, therefore the time consumed in the first stage

retrieval is more than the second stage for both the databases. Further, we observe

that the time spent in performing the entire retrieval process on Wang’s database is

4.0145 s, which is less than the average retrieval time (4.12 s) reported in [6] by

Wang et al. We have estimated their retrieval time by averaging the time given for

different classes of Wang’s database.

Also, with our proposed framework the average CPU time required for extracting

the features is approximately 3.9859 s (on Wang’s database), which is better than

the feature extraction time reported in [21]. The average CPU time required for

computing the feature vector is approximately 10 s in [21], wherein a new image

retrieval scheme using visually significant point features is proposed. They also find

invariant CMs at the significant image points in RGB domain.

Table 6 CPU time elapsed in feature extraction and retrieval for two different databases

Wang’s database VisTex database

Feature extraction time (in s) for single image

Color moments 0.0445 0.0410

EHD 0.0468 0.0453

ART 0.0867 0.0851

Feature extraction time (in s) for full database

First stage 2.9912 2.7912

Second stage 0.9947 0.9043

Total time 3.9859 3.6955

Retrieval time(in s) for full database

First stage 3.0056 2.9927

Second stage 1.0089 0.9322

Total time 4.0145 3.9249


123

5 Conclusions and Future Directions

In this paper, we present a novel framework using combination of low level features

in a multistage manner to improve the retrieval accuracy of image retrieval system.

Firstly, we retrieve images using CMs and then apply ART and EHD on the images

filtered in the first stage. Experimental results prove that the proposed scheme

performs exceptionally well and is robust in comparison to conventional hybrid

frameworks in terms of retrieval accuracy computed on Wang’s, VisTex and OT-

Scene databases. Based on the performance analysis, following conclusions can be

drawn:

1. The retrieval accuracy computed in terms of average precision is 75.80 % on

Wang’s database which is better than many existing hybrid frameworks.

2. Inspite of multistage retrieval, the proposed framework is observed to be

efficient in terms of time complexity also.

3. For all the different databases, the performance of proposed framework is better

than that of the individual descriptors i.e. CMs, ART and EHD alone.

In the future, we would like to extend this framework through the use of variety

of fusion methods (e.g. feature level fusion) and distance measures in order to

further enhance the accuracy of image retrieval.

Acknowledgments Two of the authors are thankful to South Asian University, New Delhi for financial

support during their research work. We are also extremely grateful to the anonymous reviewers for their

valuable comments that helped us to enormously improve the quality of the paper.

References

1. Datta, R., Joshi, D., Li, J., & Wang, J. Z. (2008). Image retrieval ideas, influences, and trends of the

new age. ACM Computing Surveys, 40, 1–60.

2. Brunelli, R., & Mich, O. (2008). Histograms analysis for image retrieval. Pattern Recognition, 34,

1625–1637.

3. Rasheed, W., An, Y., Pan, S., Jeong, I., Park, J., & Kang, J. (2008). Image retrieval using maximum

frequency of local histogram based color correlogram. In Second Asia international conference on

modeling & simulation (pp. 322–326).

4. Huang, J., Kumar, S. R., Mitra, M., Zhu, W.-J., & Zabih, R. (1997). Image indexing using color

correlograms, In Proceedings of IEEE conference on computer vision and pattern recognition (pp.

762–768).

5. Lu, T.-C., & Chang, C–. C. (2007). Color image retrieval technique based on color features and

image bitmap. Information Processing and Management, 43, 461–472.

6. Wang, X.-Y., Yu, Y.-J., & Yang, H.-Y. (2011). An effective image retrieval scheme using color,

texture and shape features. Computer Standards & Interfaces, 33, 59–68.

7. Park, D. K., Jeon, Y. S., & Won, C. S. (2000). Efficient use of local edge histogram descriptor, In

Proceedings of the 2000 ACM workshops on multimedia (pp. 51–54).

8. Kim, W. Y., & Kim, Y. S. (2000). A region based shape descriptor using Zernike moments. Journal

of Signal Processing: Image Communication, 16, 95–102.

9. Amanatiadis, A., Kaburlasos, V. G., Gasteratos, A., & Papadakis, S. E. (2011). Evaluation of shape

descriptors for shape-based image retrieval. Image Processing, 5, 493–499.

10. Pooja, C. S. (2012). An effective image retrieval system using region and contour based features. In

IJCA proceedings on international conference on recent advances and future trends in information

technology (pp. 7–12).


123

11. Singh, S. M., & Hemachandran, K. (2012). Content-based image retrieval using color moment and

gabor texture feature. IJCSI International Journal of Computer Science, 9, 299–309.

12. Pooja, C. S. (2012). An effective image retrieval using the fusion of global and local transforms based

features. Optics & Laser Technology, 44, 2249–2259.

13. Goyal, A., & Walia, E. (2012). An analysis of shape based image retrieval using variants of Zernike

moments as features. International Journal of Imaging and Robotics, 7, 44–69.

14. Zhang, D., & Lu, G. (2002). Shape-based image retrieval using generic Fourier descriptor. Signal

Processing: Image Communication, 17, 825–848.

15. Wang, J. Z., Li, J., & Wiederhold, G. (2001). SIMPLIcity: Semantics-sensitive integrated matching

for picture libraries. IEEE Transaction on Pattern Analysis and Machine Intelligence, 23, 947–963.

16. ElAlami, M. E. (2011). A novel image retrieval model based on the most relevant features.

Knowledge-Based Systems, 24, 23–32.

17. Kang, J., & Zhang, W. (2012). A framework for image retrieval with hybrid features. In 24th Chinese

control and decision conference (CCDC) (pp. 1326–1330).

18. Hiremath, P. S., & Pujari, J. (2007). Content based image retrieval using color, texture and shape

features. In International conference on advanced computing and communications (pp. 780–784).

19. Huang, Z.-C., Chan, P. P. K., Ng, W. W. Y., & Yeung, D. S. (2010). Content-based image retrieval

using color moment and Gabor texture feature. In International conference on machine learning and

cybernetics (pp. 719–724).

20. Yue, J., Li, Z., Liu, L., & Fu, Z. (2011). Content-based image retrieval using color and texture fused

features. Mathematical and Computer Modeling, 54, 1121–1127.

21. Banerjee, M., Kundu, M. K., & Maji, P. (2009). Content-based image retrieval using visually sig-

nificant point features. Fuzzy Sets and Systems, 160, 3323–3341.

22. Jalab, H. A. (2011). Image retrieval system based on color layout descriptor and Gabor filters. In

IEEE conference on open systems (ICOS) (pp. 32–36).

23. Liu, G.-H., & Yang, J.-Y. (2013). Content-based image retrieval using color difference histogram.

Pattern Recognition, 46, 188–198.

24. Gong, M., Li, H., & Cao, W. (2013). Moment invariants to affine transformation of colors. Pattern

Recognition Letters, 34, 1240–1251.

25. Mindru, F., Tuytelaars, T., Gool, L. V., & Moons, T. (2004). Moment invariants for recognition

under changing viewpoint and illumination. Computer Vision and Image Understanding, 94, 3–27.

26. Manjunath, B. S., Ohm, J. R., & Vasudevan, V. V. (2001). Color and texture descriptors. IEEE

Transactions on Circuits and Systems for Video Technology, 11, 703–715.

27. Jain, A., Nandakumar, K., & Ross, A. (2005). Score normalization in multimodal biometric systems.

Pattern Recognition, 38, 2270–2285.

28. Guo, J. M., Prasetyo, H., & Su, H. S. (2013). Image indexing using the color and bit pattern feature

fusion. Visual Communication and Image Representation, 24, 1360–1379.

29. Wang, X.-Y., Yang, H.-Y., & Li, D.-M. (2013). A new content-based image retrieval technique using

color and texture information. Computers & Electrical Engineering, 39(3), 746–761.

30. Alexandre D. S., & Tavares, J. M. R. S. (2010). Introduction of human perception in visualization.

International Journal of Imaging and Robotics, 4, 60–70.


123

Date post:	25-Jan-2017
Category:	Documents
Upload:	aman
View:	214 times
Download:	1 times

An Effective and Fast Hybrid Framework for Color Image Retrieval

Documents