INTERNATIONAL JOURNAL OF c 2007 Institute for Scientiï¬c

INTERNATIONAL JOURNAL OF c© 2007 Institute for ScientificINFORMATION AND SYSTEMS SCIENCES Computing and InformationVolume 3, Number 3, Pages 349–364

VISUAL INFORMATION PROCESSING AND CONTENTMANAGEMENT: AN OVERVIEW

PHILIP O. OGUNBONA

Abstract. Visual information processing and the management of visual con-

tent has become a significant part of contemporary economy. The visual infor-

mation processing pipeline is divided into several modules including, (i) cap-

ture and enhancement, (ii) efficient representation for storage and transmission,

(iii) processing for efficient and secure distribution, and, (iv) representation for

efficient archiving and retrieval. Advances in semiconductor technology and op-

timum signal processing models and algorithms, provide tools to improve each

module of the processing pipeline. Insight from other areas of study including

psychology augments and informs the models being developed to understand

and design efficient visual content management systems. The paper provides a

brief overview of the modules in the pipeline in one place for easy reference.

Key Words. active pixel sensor, CMOS image sensor, CCD, image processing,

image coding, video coding, image retrieval, image watermarking.

1. Introduction

Visual information capture, processing, storage, transmission and distributionhave become a viable commercial workflow due to significant advances in semi-conductor technology and the development of intelligent digital signal processingalgorithms. The ability to capture and process digital visual information has foundapplication in diverse areas including, space exploration, video surveillance, indus-trial monitoring, medical imaging for diagnosis, mining, advertising, filming, etc.More recently, the ready availability of professional and consumer image and videocapture devices have created a community of visual information content creators.At the same time the digital nature of the visual information and, availability ofnumerous powerful processing algorithms and computers have placed the power ofdigital image processing in the hands of professionals and consumers.

In this paper we present an overview of the theoretical principles of visual infor-mation processing and the management of visual content. The visual informationprocessing pipeline can be divided into several modules including, (i) capture andenhancement, (ii) efficient representation for storage and transmission, (iii) pro-cessing for efficient and secure distribution, and, (iv) representation for efficientarchiving and retrieval. The paper is divided accordingly, into four sections. Sec-tion 2 explores the principles of image capture and the technical problems associatedwith the process. We present and review some of the solutions available in con-temporary literature. We also present possible future trends. In Section 3, weproceed to explore the problem of image and video coding. In particular we reviewsome of the classical methods of signal representation and the more contemporary

Received by the editors January 1, 2004 and, in revised form, March 22, 2004.2000 Mathematics Subject Classification. 35R35, 49J40, 60G40.

349

350 P. O. OGUNBONA

multi-resolution approach. Orthogonal transform and wavelet techniques of signalrepresentation are reviewed. In Section 4, we review the problem of visual con-tent distribution from the viewpoint of the enabling technologies. In particular, wereview techniques developed for copyright protection, tamper detection, etc. Theimportant problem of image and video archiving and retrieval is reviewed in Section5. Particularly, we present the problem from the viewpoint of multilevel descriptionthat starts out at the pixel level and ends at the conceptual or semantic level ofdescription. The problem of semantic gap is articulated and techniques availableto bridge the gap are reviewed. In the concluding section we summarize the trendsthat have emerged from this overview.

2. Image Capture and Enhancement

The pipeline of processes in a simplified single sensor digital still camera as shownin Figure 1 depicts an image sensor followed by a series of image enhancementmodules that generates the finished image.

Figure 1. A simplified single sensor camera.

The sensor employs colour filter array (CFA) to separate incoming light into aspecific spatial arrangement of the color components. One of the possible patternsis the Red-Green-Blue (RGB) Bayer CFA mosaic [1] shown in Figure 2. Colourinterpolation is employed to estimate the missing colours in the mosaic - a processreferred to as de-mosaicing. The image obtained through the colour filters need tobe corrected for the white-point and colour saturation in order to reproduce thecolour of the original scene with high fidelity and expected human visual response.It is interesting to note that the white-point and colour correction are dependent onthe illuminant. Gamma correction is performed to compensate for the nonlinearityof viewing and printing devices. The automatic exposure module is coupled withthe sensor and used to dynamically adjust the integration time of the sensor andproduce correctly exposed images.

R G R G · · ·G B G B · · ·R G R G · · ·G B G B · · ·· · · · · · · · · · · · · · ·

Figure 2. A Bayer pattern of pixels.

Image or video capture devices are developed using CMOS- or CCD-based imagesensor technology. The development of the active pixel sensor (APS) as a replace-ment for the passive CMOS imager promoted CMOS to the position of competition

VISUAL INFORMATION PROCESSING AND CONTENT MANAGEMENT: AN OVERVIEW 351

with CCD based imagers especially in terms of low power dissipation, scaling andthe possibility of camera system integration [5, 6]. In CMOS technology, bothphotodiode and photogate (also PIN photodiode) have been employed as the pho-todetector. The pixel consists of the photodetector in addition to the readout,amplification, row select and reset transistors. CCD-based image sensors are typi-cally classified into, frame-transfer (FT), interline-transfer (IT), and virtual-phase(VP) architectures [8]. In general the CCD image sensor is made up of an imagearea, a horizontal CCD, and an output amplifier. The image area includes photo-diodes and vertical CCD that capture the illumination and transfer charges to thehorizontal CCD and detecting amplifiers [7]. Essentially the photodetector and theassociated electronics are responsible for amplification, analog-to-digital conversionand digital readout. In the photodetector, photons are absorbed in the semiconduc-tor material and the absorbed energy releases an electron-hole pair. The integratedcharge at the site of minority carrier collection is proportional to the product of theamount of light falling on the pixel and the exposure time [2].

Advances in both CMOS and CCD based image sensors have been made interms of improvements in the basic performance indicators. Some of these include,dark current noise, conversion gain, optical sensitivity, dynamic range, fixed-patternnoise, quantum efficiency and pixel cross-talk. The improvements have been gainedfrom better circuit design and process innovations [3, 4, 5, 6, 12]. Early passiveCMOS sensors were beset with fixed pattern noise caused by device mismatch.These have been suppressed with the development of active pixel sensor architec-ture and the use of correlated double sampling to suppress the so-called 1/f andand kT/C noise figures. Correlated double sampling as a signal processing methodhas been used successfully in both CMOS and CCD readout circuits. The darkcurrent noise is the statistical fluctuation of the electrons created in the photode-tector independent of the light falling on the detector but linearly correlated withthe integration time. There is also a portion of the fixed pattern noise that is at-tributable to the dark current nonuniformity. The dark current noise performanceof CMOS sensors is poorer than those of CCD [5]. Some of the techniques used toimprove dark current noise figure include the use of charge pumping and surfacepinning. The fill factor (FF) is the ratio of the sensitive area to the total pixel area.In CMOS active pixel sensor where there are more circuitry than in CCD basedsensors, achieving a high fill factor is a challenge. The effective quantum efficiency(QE) is reduced by the fill factor, QEeff = FFQE. It is expected that the quan-tum efficiency can be improved as feature dimensions of CMOS technology shrinksand increased fill factor is attained [5]. Conversion gain refers the charge-to-voltageconversion efficiency of the sensor and design of the photodiode based CMOS sen-sors lead to a lower figure of merit than photogate sensors. In general, photodiodedesigns are more sensitive to visible light, especially in the short-wavelength regionof the spectrum. Photogate devices usually have larger pixel areas, but a lower fillfactor and much poorer short-wave-length light response than photodiodes. It ispossible for photons incident on one pixel to generate carriers that are collected bya different pixel - a phenomenon referred to as pixel cross-talk. The effect of pixelcross-talk is to reduce image sharpness and degrade colorimetric accuracy. Dynamicrange of the CMOS and CCD are determined by different factors. In CCD, factorsthat enter the defining equation include capacity of the CCD stage, conversion gain,dark current noise and total r.m.s read noise. Whereas the CMOS dynamic rangeis dependent on the threshold voltage of the n-device, combined gain of the pixeland column source followers and conversion gain.

352 P. O. OGUNBONA

3. Efficient Representation for Storage and Transmission

Typically, the spatial resolution of images generated from the capture processis on the order of several million pixels with each pixel having a depth of 8 - 12bits. Such size of images dictates large storage requirement and high transfer rateto be useful in a practical imaging system. The theory of source coding providesa rich set of techniques that have led to the development of efficient compressionalgorithms. In this paper we are interested in source coding techniques that allowlossy compression because they provide high compression ratio. The performanceof such algorithms is not only measured in term of the compression ratio (or bitrate) but is also based on some measure of the degree of fidelity retained in thereconstructed image. The rate-distortion theorem (cite) gives bounds on the designchoice available in selecting the parameters of a practical compression system.

The starting point is the modelling of the captured image and this can be con-veniently achieved by representing it as a sample of a stochastic process. Such aprocess is completely described by the knowledge of its joint probability density.One of the major stochastic models that has found application in image process-ing are the covariance models in one- and two-dimensions [15, 16]. The covariancemodel has formed the basis of very efficient compression algorithms that employunitary transforms. We refer to these algorithms as transform coders.

For a one-dimensional sequence, {u(n), 0 ≤ n ≤ N − 1}, of image pixels repre-sented as a vector u of size N , a unitary transformation is given by [15],

(1) v = Au; v(k) =N−1∑n=0

a(k, n)u(n), 0 ≤ k ≤ N − 1

where the unitary property of the transformation, A, implies A−1 = A∗T , and thusthe inverse transformation is conveniently given as,

(2) u = A∗T v; u(n) =N−1∑k=0

v(k)a∗(k, n) 0 ≤ n ≤ N − 1

It is the energy compaction property of unitary transforms that make them veryuseful for image and video compression. If µu and Ru denote the mean and co-variance of the vector of image pixels, u, the transformed vector v has mean andcovariance given by,

µv = Aµu(3)

Rv = ARuA∗T(4)

Furthermore, if the components of the vector are highly correlated, the coefficientsof the transformation are usually uncorrelated and the structure of the covariancematrix is such that the off-diagonal terms are small compared to the diagonal terms.The Karhunen-Loeve (KL) transform is the optimum in terms of energy packingproperty in that the representation of the vector in the transformed domain hasmost of the energy packed in the fewest number of coefficients. This representationpresents opportunity for compression since any truncation of the coefficients yieldsan optimum representation based on the retained coefficients. Ahmed et al. [17]have shown that the performance of the discrete cosine transform (DCT) comparesfavourably with that of the Karhunen-Loeve transform, in terms of energy packingfor data modeled as Markov source with high inter-sample correlation. Also, the


Figure 3. A typical transform coding system.

rate-distortion performances are found to be comparable [17, 15]. In [18] it was alsoshown that the DCT performs best compared to the Walsh-Hadamard and Haartransforms when encoding step changes with arbitrary phase in images. Smoothares of images have very high correlation and are amenable to high compression be-cause the resulting ac components of the DCT coefficients barely carry much energyand can be truncated. The edges in images are well represented and thus leads toreasonable reconstructed images at moderate bit rates. There is noticeable ringingartifacts around edges at very low bit rates [18]. The DCT has been successful asthe transform of choice in JPEG, MPEG and H.263 coders because of the goodperformance at most practical bit rates and the availability of a fast implementa-tion. A typical transform coding system is shown in Figure 3. It is important tostress the fact that the compression is achieved through efficient quantization ofthe transform coefficients.

The problem of signal representation for subsequent encoding is related to theefficient representation of the features of the signal. A multiresolution analysisallows the representation of the details of an image at different scales and location.The development and usefulness of wavelet transform stem from the inadequacyof analysis techniques based on Fourier series or Fourier transform. We begin theintroduction of the wavelet transform with the idea of a scaling function. We definea set of scaling functions in terms of integer translates of a basic scaling function[19],

ϕk(t) = ϕ(t− k) k ∈ Z and ϕ ∈ L2(5)

where L2 is the space of square integrable functions and Z is the set of integers.For −∞ ≤ k ≤ ∞ these functions generates functions in a subspace of L2 and ingeneral the scaling functions can be made to span arbitrarily sized spaces by scalingand translating, as ϕj,k(t) = 2j/2ϕ(2jt − k). In a multiresolution developmentthe nesting of the spaces spanned by the scaling function gives rise to a scaledrepresentation of the functions [19]. It turns out that the differences between thespaces spanned by the scaling functions can also provide a description of the signal.A wavelet, ψj,k(t), spans the differences between the spaces spanned by the scalingfunctions. We note that the wavelet themselves can be described in terms of thescaling functions. Thus the scaling function and the wavelet provides a means ofrepresenting any function g(t) ∈ L2(R) [19],

g(t) =∞∑

k=−∞

c(k)ϕk(t) +∞∑

j=0

d(j, j)ϕj,k(t)(6)

354 P. O. OGUNBONA

where the coefficients c(k) and d(j, k) are appropriately determined by inner prod-ucts if the expansion of Equation(6) represents an orthonormal basis. There is alsoan alternative development of wavelets in terms of filterbank [19, 20, 21] and thishas formed the basis of several practical implementation and presentation of manyproperties of the wavelet. The representation given by wavelets form an uncondi-tional basis and thus leads to a sparse representation in which the coefficients dropsoff rapidly as j and k increases. It is important to note that both DCT and wavelettransform capture different features of the image. The DCT is good at capturingthe oscillating features of an image while the wavelet is better at capturing thepoint singularities. Wavelets have form the basis of more recent image compressionstandard such as JPEG2000 [22]. Several properties of wavelets have been exploitedto a great advantage to achieve embedded bitstream presentation of encoded im-age and provide scalability in both spatial resolution and bit rate (signal-to-noiseratio). The fact that wavelets are able to capture point singularity leaves roomfor further exploration and there has been continued activity in the research com-munity [23]-[29] to better capture other features including lines and curvatures inthe image. The sparse representation provided by both DCT and wavelet familyof transforms allows efficient quantization and compression of images. In generalthe multiresolution representation allows a mapping of the wavelet coefficients in aquadtree structure that captures the respective image features. Wavelet-based com-pression algorithms such as embedded zerotree wavelet (EZW)[30], SPIHT [31], andspace frequency quantization (SFQ)[32] exploit this sparsity to achieve compressionwithout incurring significant distortion at very low bit rates.

4. Visual Content Distribution

Efficient compression of images has facilitated storage and transmission acrossheterogenous networks. In particular, the embedded bitstream of encoded JPEG2000images and the progressive encoding of JPEG images allow images to be transmit-ted and viewed over low bandwidth networks. Despite the utility provided bycompression, the distribution of visual content for commercial purposes requiresguarantees on the preservation of the copyrights of the owners at different levels.Encryption techniques are only useful in guaranteeing the reception of transmittedimages by the intended recipient. Once received the images can be easily replicatedand re-distributed. We can identify at least three areas of rights management andthese include authentication (or verification), proof of ownership and covert com-munication (or steganography). Efforts to meet these requirements through thedevelopment of various digital watermarking algorithms have engaged the researchcommunity over the last decade. It is interesting to note that digital watermarksare inserted in images or video prior to compression and must at least survive var-ious types and levels of compression. This is in addition to other image processingoperations that the image or video might undergo. In [33], digital watermark wasdefined as a set of secondary digital data that is embedded into a primary digitalimage (also called host signal) by modifying the pixel values of the primary image.Digital watermarks were further classified according to their appearance and ap-plication domain. Three classes were identified [33], (i) visible, (ii) invisible-robustand (iii) invisible-fragile. Each of these categories impose some requirements onthe process of embedding, visibility (or imperceptibility) and robustness. A visiblewatermark should be obvious in both colour and monochrome images and evenvisible to people with colour blindness. However, the watermark should not besuch that it obscures the image being watermarked. Additionally, the watermark


Figure 4. A Generic Watermark Encoder [34].

Figure 5. A Generic Watermark Decoder [34].

must be difficult ro remove. Both invisible-robust and invisible-fragile watermarksmust not introduce noticeable artifacts into the watermarked images. Robust-ness to standard image processing operations is a very important requirement forinvisible-robust watermark while security of the watermark is the most essentialrequirement in the class of invisible-fragile watermarks. The amount of watermarkthat can be embedded in an image is related to the robustness and also correlateswith the degree of impairment and consequently visibility of watermark. Thesethree constraints, viz robustness, capacity and imperceptibility have guided guidedthe design and evaluation of watermarks. From the viewpoint of extraction we cancategorize watermarking as, (i) blind, (ii) semi-blind and (iii) non-blind techniques.

More formally digital watermarking (Figure 4) is the embedding of a given digitalsignal, w, into another signal, Co, called the cover signal or host signal, such thatthe presence is imperceptible. A secret key, K, is employed to ensure security ofthe output watermarked signal, Cw.

The watermark decoder (Figure 5) can employ the marked and possibly manip-ulated signal, Cw, the original host signal, Co, the watermark, w, and the key, K,to produce an estimate of the watermark, w′. The manipulation is often thoughtof as the attack on the watermark to render it useless for the intended purpose.

This formulation can be written as [34],• EK : O ×K ×W → O ; EK(co,w) = cw

• DK : O ×K →W

• Cτ : W2 → {0, 1} ; Cτ (w′, w) ={

1, if c ≥ τ0, if c < τ

where O is the set of all images (marked and unmarked), K is the set of possiblekeys and W denotes the set of watermark signals. EK and DK denote the encoding(or embedding) and decoding functions respectively. The comparator function, Cτ ,compares the extracted watermark with the embedded watermark using a threshold,τ . A somewhat unifying model has been proposed by Chen and Wornell [35] based

356 P. O. OGUNBONA

Figure 6. A Generic watermark embedding model of [35].

Figure 7. Equivalent super-channel model [35].

on the role played by the host signal in the watermark extraction scheme. The modelstarts as the simple information embedding process of Figure 6. An equivalentmodel that allows the interpretation of the information embedding process as acascade (Figure 7) of an encoder and a super channel (the super channel in turnis a cascade of an adder and a true channel) is derived in [35, 36]. The host signalCo is interpreted as a state of the super channel that is known at the encoder. Inparticular this model allows the categorization of watermarking schemes as either,host-interference non-rejecting or host-interference rejecting and thus depicts thehost signal as a source of interference.

In general, knowledge of the presence of the host signal can be exploited at theencoder to produce the class of host-interference rejecting methods. On the otherhand, Chen and Wornell [35, 36] showed that a large number of techniques presentedin the literature can be categorized as host-interference non-rejecting because theydo not exploit the presence of the host at the encoder. Such methods include thosethat have been described as spread spectrum based watermarking [37] - [40]. Themodel of [35] also provides a framework to evaluate and compare the performanceof these embedding techniques. The performance of spread spectrum techniquesis severely limited and it is only in the non-blind watermarking extraction (hostavailable at the decoder) regime that the interference of the host can be obviated[38, 41].

A particularly attractive class of host-interference rejecting watermarking methodis the quantization index modulation (QIM) where the embedding function E(Co, w)is a set of functions of the host signal Co, indexed by the watermark, w. In orderto achieve high perceptual imperceptibility the functions should be chosen so thatthe distortion introduced is minimized. For example, they can be designed as op-timal lattice quantizers that minimize some error criterion. If the reconstruction


points of the indexed quantizers do not intersect, QIM yields the required host sig-nal non-interference property. This implementation of QIM is referred to as dithermodulation. Another implementation, the distortion-compensated QIM, provides apost quantization processing to improve the achievable distortion-robustness trade-off [35]. Several variations of QIM have been proposed in the literature with a lotof attention paid to the embedding process. In [43] a set theoretic framework forquantization index modulation (QIM) embedding was introduced in a semi-fragilewatermarking scheme that is claimed to be both visually adaptive and tolerant tocompression. The mean around randomly selected locations of the image is quan-tized by quantizers designed to meet detectability constraints. A sliding windowembedding scheme that applies the local average quantization index modulation,to achieve geometric attack robustness was proposed in [42]. Qian and Cox [44]proposed an embedding method based on human perceptual model. They employedan adaptive quantizer with step size guided by the Watsons perceptual model. Inaddition the Watson’s model was modified to allow for amplitude scaling attack.

There have also been efforts to improve the design of the detector to providerobustness against attacks including JPEG compression, geometric distortion, etc.Wu et al. [45, 46] developed and introduced a multilevel embedding scheme thatallow the amount of embedded information that can be reliably extracted to beadaptive to the noise introduced by the attack. If knowledge of the statistics ofthe embedded watermark is available and the watermark satisfies smoothness con-straint, a maximum a posteriori (MAP) detector was shown in [47] to give a robustdetection of watermark in the presence of several attacks, notably JPEG compres-sion. Figure 8 shows the original lena image and the watermarked version usingthe method described in [47]. The embedded watermark, which is a binary logo, isdepicted in Figure 9.

Figure 8. Original(left) and the watermarked lena, PSNR=38.8 dB

The MAP formulation of the detector [47] was compared with the conventionalminimum distance decoder used in QIM watermarking schemes. The results areshown in Figure 10 attest to the efficacy of the method.

The watermarking model of Figure 4 admits a secret key whose function is todetermine certain parameters of the embedding function. This gives rise to such

358 P. O. OGUNBONA

Figure 9. The original logo

Figure 10. MD(left) and MAP(right) decoded logos on JPEG50% compressed watermarked lena

schemes as key-dependent transform where the attacker does not have knowledge ofthe transform domain in which the embedding took place [49]. In general, the secretkey, K, enters the watermarking process as the input to a function that generatesthe secret parameters of the embedding or decoding processes [48]. Parametersof interest include embedding domain, codebook, indices of transform coefficientsto be watermarked, direction of the watermarked space, etc. Perez-Freire et al.[48] identified three types of attacks namely, (1) blind watermark removal, (2) keyestimation and (3) tampering. This categorization should be considered in thelight of intentional and non-intentional or incidental attacks. Detailed survey ofthe evaluation of security of watermarks and the results on the major techniquesare given in [48].


5. Representation for Efficient Archiving and Retrieval

In Section 3 we showed that wavelets and orthogonal transforms such as DCT canefficiently describe important features of an image and lead to efficient compressionschemes. Visual information retrieval relies on the availability of a representationsuitable for description and query formulation. The basic representation of an image(or video) as an arrangement of pixels characterized by some statistical distributionhas been exploited successfully in image compression and some pattern recognitionproblems. This representation is at one end of the scale of descriptions that an image(or video) can be given. Humans will usually employ a semantic or conceptual baseddescription when retrieving images or video. The diagram in Figure 11 depictslevels of representations of an image and the ease of utilization by computers andhumans as we move up toward the apex of the pyramid. Pixel representation isamenable to computer manipulation but difficult for human usage. While semanticdescription is natural for humans, it is difficult to infer this form of descriptioncomputationally because the mapping from an image to semantic description isone-to-many. Largely, description of an image by humans is both subjective andcontextual.

Figure 11. Pyramid of image representation.

Text-based description has played a major role in the retrieval of images andvideos from databases but its efficacy in capturing all relevant semantic descriptionsis questionable. This is more so when the multiplicity of context and meaning thatcan be associated with an image is considered. Pattern recognition and computervision methods are being employed to extract features and provide description atlevels of the representation pyramid (Figure 11) that are closer to human usage. Forexample, segmentation of an image into homogenous regions based on colour offers arepresentation akin to how humans might describe an image made of homogenouscolour patches. In essence these descriptions intend to bridge the gap betweenthe lower levels of descriptions and the higher level semantic descriptions. Thisproblem has been termed the semantic gap. Methods based on the description ofthe contents of an image using low level features have given rise to content-basedretrieval schemes. Over the last decade there has been a spate of research activitiesaimed at devising descriptions and indexing techniques suitable for image and videoretrieval [50, 51, 52]. Statistical features based on colour, texture, shape and motionattributes of images and videos have found application in the various content-based

360 P. O. OGUNBONA

retrieval systems. A categorization of content-based retrieval schemes distinguishesamong, (i) template matching,(ii) global features matching, and (iii) local featuresmatching.The underlying assumption of each category determines its applicabilityand efficiency. Template matching based techniques are limited by possible changesin the orientation and size of target objects. Most content-based retrieval schemesemploy global features in that the selected features are computed over the wholeimage. This has the disadvantage of not being able to localize and adequatelydescribe local features. Taking the viewpoint of user-specified query content-basedretrieval systems could be categorized [53] as, (i) target search in which the userwishes to retrieve a specific image, (ii) category search in which a set of imagesfrom a category is to be retrieved and (iii) open-ended browsing that allows userto specify salient visual features that describes the target image.

Figure 12 depicts a generic content-based image retrieval system [52]. It is a madeup of two subsystems namely, database generation and database query processing.Perhaps a key point to make here is that the features extracted from the images orvideos are appropriately selected to match the application at hand. The databasegeneration subsystem takes images or video data as input and extracts the selectedfeatures that are used to generate the associated indexed metadata. On the otherhand the database query processing subsystem takes user input and employ theassociated features to index the database content. The features could be generatedfrom an example image (as in query by example, QBE) or user selected feature.

The feature extraction process transforms the image or video into a point inthe feature space that is assumed to be metric and the retrieval problem becomesa similarity measure in the feature space. In [54] a distinction is made betweenperceived dissimilarity (or similarity) and judged dissimilarity. Assuming we havetwo image stimuli SA and SB and we denote their perceived similarity as s(SA, SB),and d(SA, SB) as the perceived dissimilarity. If we denote the judged dissimilaritybetween SA and SB as δ(SA, SB), the relationship is given (approximately) by

δ(SA, SB) = g(d(SA, SB))(7)

where g(·) is a monotonic function. In [54] the link between recognition and sim-ilarity is related to the use of distance-based similarity measure to predict theconfusions in recognition experiments. The most significant impact of the workfor content-based retrieval systems is the that general recognition theory containsEuclidean distance models of similarity as a special case and is not constrainedby any distance axioms. In other words perceived dissimilarity does not need tocorrespond to well known notion of metric space. It is thus possible that similaritybased on some selected feature and metric does not correspond to perceived similar-ity. Unfortunately, most similarity (or dissimilarity) measure used in content-basedretrieval systems are Euclidean. Santini and Jain [55] have proposed an alternativesimilarity measure based on fuzzy logic and Tverskys feature-contrast model in anattempt to derive similarity measure that is independent of the selected feature.The concept of ”betweenness” was introduced in [56] to provide a ranking that doesnot require a metric.

Relevance feedback as a technique borrowed from the information retrieval com-munity has been proposed to improve the retrieval results from CBIR systems[57, 58]. The idea of introducing the user into the process of refining the querybased on retrieved images takes the form of the user being required to select fromthe returned images those that are deemed more relevant to the query. Likewise


Figure 12. Generic Content-based retrieval System [52].

the images that are deemed non-relevant to the query are indicated. This schemethen affords a means of modifying the query and re-submitting it to the system.An in-depth consideration and evaluation of relevance feedback schemes that havebeen proposed for CBIR systems is given in [59].

6. Conclusions

In this paper we have provided a brief overview of the processing pipeline in-volved in the capture of an image or video, compression for efficient storage andtransmission, secure distribution and eventual consumption during retrieval. Thetreatment explores the processes and the attending problems and solutions.

As technology matures and solutions to the problems of scaling are found, imagesensors will see more integration to provide smart information on the chip. Theability to design new wavelet family of filters that allow both multiresolution de-scription and more complete feature description will see more efficient compressionschemes. The possibility to take advantage of the richer feature description and

362 P. O. OGUNBONA

jointly compress and embed watermark in images is very attractive. Such schemesare already being pursued in the bi-level compression and watermarking of text doc-uments. The greater insight provided by the super channel model of watermarkingand other information-theoretic models will lead to the development of watermark-ing techniques that can optimize the triple constraints of robustness, capacity andimperceptibility. There are already attempts to develop similarity measures thatcorrespond to the human perceptual understanding of similarity. However, moreempirical work needs to be done with humans in order to accurately model therelationship between human perceptual similarity and computational feature basedsimilarity. This overview did not discuss the important problem of automatic im-age annotation. The problem consists of how to label images with semanticallysignificant keywords that describe the content. This is a problem that is closelyrelated to the notion of similarity, categorization and recognition, and the solutionsproposed in the perceptual similarity domain will be applicable.

References

[1] Bayer, B. E., Color Imaging Array, U.S. Patent 3971065, July 1976.

[2] Lavine, J. P., Trabaka, E. A., Burkey, B. C., Trewell, T. J., Nelson, E. T. and Anagnostopou-los, Steady-State Photocarrier Collection in Silicon Imaging Devices, IEEE Trans. Electron

Devices, Vol. ED-30, No. 9, pp. 1123–1134, Sept. 1983.

[3] Wong, H-S. P., Frank, D., Solomon, P. M., Wann, C.H.J. and Wesler, J. J., NanoscaleCMOS, IEEE Proceedings, Vol. 87, No. 4, pp. 537-570, Apr. 1999.

[4] Iwai, H., CMOS Technology -Year 2010 and Beyond, IEEE Journal of Solid State Circuits,Vol. 34, No. 3, pp. 357-366, Mar. 1999.

[5] Blanksby, A. J. and Loinaz, M. J., Performance Analysis of a Color CMOS Photogate Image

Sensor, IEEE Trans. on Electron Devices, Vol. 47., No. 1, pp. 55–64, Jan. 2000.[6] Fossum, E., Active Pixel Image Sensor-Are CCD’s dinosaurs?, Proc. SPIE, Vol.1900, pp.2–

14, Feb. 1993.

[7] Furumiya, M., Hatano, K., Nakashiba,Y., Murakami, I., Yamada, T., Nakano, T.,Kawakami, Y., Kawasaki,T. and Hokari, Y., A 1/2-in, 1.3M-Pixel Progressive-Scan CCD

Image Sensor Employing 0.25- m Gap Single-Layer Poly-Si Electrodes, IEEE Journal of

Solid-State Circuits, pp. 1835-1842, Vol. 34, No. 12, Dec 1999,[8] Tabei, M., Kobayashi, K. and Shizukuishi, M., A New CCD Architecture of High-Resolution

and Sensitivity for Color Digital Still Picture, IEEE Trans. on Electron Devices, Vol. 38,

No. 5, pp. 1052-1058, May 1991[9] Theuwissen, A., Peek, H., Centen, P., Boesten, R., Cox, J., Hartog, P., Kokshoorn, A., van

Kuijk, H., O’Dwyer, B., Oppers, J., Vledder, F., A 2.2 Mpixel FT-CCD imager, accordingto the Eureka HDTV-standard, International Electron Devices Meeting,Technical Digest,

pp. 167 - 170, 8-11 Dec. 1991

[10] Cho, K-B., Krymski, A. and Fossum, E., A 3-pin 1.5 V 550 mW 176 x 144 self-clockedCMOS active pixel image sensor, Proceedings of the 2001 international symposium on Low

power electronics and design, pp. 316-321, August 2001.

[11] Niclass, C., Sergio, M. and Charbon, E., A single photon avalanche diode array fabricatedin deep-submicron CMOS technology, Proceedings of the conference on Design, automation

and test in Europe, pp. 81-86, 2006.

[12] Abe, H., Device technologies for high quality and smaller pixel in CCD and CMOS imagesensors, IEEE International Electron Devices Meeting, Technical Digest, pp. 989 - 992, 13-15

Dec. 2004.

[13] Peters, I.M., Kleimann,A., Polderdijk, F., Klaassens, W., Frost, R., Bosiers, J.T., Darkcurrent reduction in very-large area CCD imagers for professional DSC applications, IEEE

International Electron Devices Meeting, Technical Digest, pp. 993 - 996, 13–15 Dec. 2004[14] Lukac, R., Plataniotis, K. N. and Venetsanopoulos, A. N., Bayer Pattern Demosaicking

Using Local-Correlation Approach, Computational Science - ICCS 2004, LNCS 3039/2004,

pp. 26–33[15] Jain, A. K., Fundamentals of Digital Image Processing, Prentice-Hall Inc. 1989.

[16] Pratt, W. K., Digital Image Processing, 2nd Edition, John Wiley & Sons, Inc., 1991


[17] Ahmed, N., Natarjan, T. and Rao, K. O., Discrete Cosine Transform, IEEE Trans, Com-

puters, Vol.C-23 Jan. 1974, pp. 90–93.[18] Andrew, J. P. and Ogunbona, P.O., On the Step Ressponse of the DCT, IEEE Trans.

Circuits and Systems-II: Analog and Digital Signal Processing, Vol. 44, No. 3, Mar. 1997,

pp. 260–262[19] Burrus, C. S., Gopinath, R. A. and Guo, H., Introduction to Wavelets and Wavelet Trans-

forms - A Primer, Prentice-Hall Inc. 1998.[20] Strang, G. and Nguyen, T., Wavelets and Filterbanks, Wellesley-Cambridge Press, 1996

[21] Vaidyanathan, P. P., Multirate Digital Filters, Filterbanks, Polyphase Networks and Appli-

cations: A Tutorial, Proceedings of the IEEE, Vol. 78, No. 1, 1990.[22] Taubman, D. S. and Marcellin, M. W., JPEG2000: Image Compression Fundamentals,

Standards, and Practice. Boston, MA: Kluwer, 2002.

[23] Froment, J., Image compression through level lines and wavelet packets, Wavelets in Signaland Image Analysis, A. A. Petrosian and F. G. Meyer, Eds. Norwell, MA: Kluwer, 2001.

[24] Cands, E. and Donoho, D. L., Newtight frames of curvelets and optimal representations of

objects with piecewise C singularities, Commun. Pure Appl. Math., vol. 57, pp. 219-266,2004.

[25] Shukla, R., Dragotti, P. L., Do, M. and Vetterli, M., Rate-distortion optimized tree-

structured compression algorithms for piecewise polynomial images, IEEE Trans. ImageProcess., vol. 14, no. 3, pp. 343-359, Mar. 2005.

[26] M. N. Do and M. Vetterli, The contourlet transform: an efficient directional multiresolutionimage representation, IEEE Trans. Image Process., vol. 14, no. 12, pp. 2091-2016, Dec. 2005.

[27] Le Pennec, E. and Mallat, S., Sparse geometric image representations with bandelets, IEEE

Trans. Image. Process., vol. 14, no. 4, pp. 423-438, Apr. 2005.[28] Arandiga, F., Cohen, A., Doblas, M., Donat, R. and Matei, B., Sparse representations

of images by edge adapted nonlinear multiscale transforms, Proc. IEEE Int. Conf. Image

Processing, Barcelona, Spain, Sep. 2003, pp. 701-704.[29] Wakin, M. B., Romberg, J. K. , Choi, H. and Baraniuk, R. G., Wavelet-Domain Approxi-

mation and Compression of Piecewise Smooth Images,IEEE Trans. Image Process., vol. 15,

no. 5, pp. 1071–1087 May 2006[30] Shapiro, J., Embedded image coding using zerotrees of wavelet coefficients, IEEE Trans.

Signal Process., vol. 41, no. 12, pp. 3445-3462, Dec. 1993.

[31] Said, A. and Pearlman, W. A., A new fast and efficient image codec based on set partitioningin hierarchical trees, IEEE Trans. Circuits Syst. Video Technol., vol. 6, no. 3, pp. 243-250,

Jun. 1996.[32] Xiong, Z. , Ramchandran, K. and Orchard, M. T. , Space-frequency quantization for wavelet

image coding, IEEE Trans. Image. Process., vol. 6, no. 5, pp. 677-693, May 1997

[33] Yeung, M. M., Mintzer, F. C., Braudaway, G. W. and Rao, A.R., Digital Watermarking forHigh-quality Imaging, First IEEE Workshop on Multimedia Signal Processing, Princeton,

NJ. 23-25 Jun. 1997, pp. 357–362.

[34] Hartung, F. and Kutter, M., Multimedia Watermarking Techniques, Proceedings of theIEEE, Vol. 87, No. 7, pp. 1079–1107, July 1999.

[35] Chen, B. and Wornell, G. W., Quantization Index Modulation Methods for Digital Water-marking and Information Embedding of Multimedia, Journal VLSI Signal Process., No. 27,pp. 7-33, 2001.

[36] Chen, B. and Wornell, G. W., Quantization Index Modulation: A Class Provably Good

Methods for Digital Watermarking and Information Embedding, IEEE Trans. InformationTheory, Vol. 47, pp. 1432–1443, 2001.

[37] Cox, I. J., Killian, J., Leighton, T. and Shamoon, T., A Secure Robust Watermark forMultimedia, First International Workshop on Information Hiding, Jun. 1996, pp 185–206.

[38] Cox, I. J., Killian, J., Leighton, T. and Shamoon, T., Secure Spread Spectrum Watermarking

for Multimedia, IEEE Trans. on Image Processing, Vol.6, pp.1673–1687, 1997.[39] Smith, J. R. and Comiskey, B. O., Modulation and Information Hiding in Images, First

International Workshop on Information Hiding, Jun. 1996, pp 207–226.

[40] Bender, W., Gruhl, D., Moromoto, N. and Lu, A., Techniques for data Hiding, IBM SystemsJournal, Vol. 35, No. 3/4, 1996, pp. 313–336.

[41] Podilchuk, C. I. and Zeng, W., Image-Adaptive Watermarking Using Visual Models, IEEE

Journal on Selected Areas in Communictaions, Vol. 16, 1998, pp. 525–539.

364 P. O. OGUNBONA

[42] Lu, W., Li, W., Safavi-Naini, R. and Ogunbona, P., A new QIM-based image watermarking

method and system, Proceedings Asia- Pacific Workshop on Visual Information Processing,Hong Kong, 2005, pp. 160-164.

[43] Altun, O., Sharma, G. and Bocko M., Set Theoretic Qantization Index Modulation Wa-

termarking, Proceedings IEEE International Conference on Acoustics, Speech, and SignalProcessing, May 14-19, 2006, Volume 2, pp. 229–232.

[44] Qian, L. and Cox, I. J., Using Perceptual Models to Improve Fidelity and Provide Invarianceto Valumetric Scaling for Quantization Index Modulation Watermarking, Proceedings IEEE

International Conference on Acoustics, Speech, and Signal Processing, March 18-23, 2005,

Volume 2, pp. 1–4.[45] Wu, M. and Liu B., Data Hiding in Image and Video: Part I - Fundamental Issues and

Solutions, IEEE Trans. Image Process. Vol. 12, No. 6, pp. 685–695, June 2003.

[46] Wu, M., Yu, H. and Liu B., Data Hiding in Image and Video: Part II - Designs andApplications, IEEE Trans. Image Process. Vol. 12, No. 6, pp. 696–705, June 2003.

[47] Lu, W., Li, W., Safavi-Naini, R. and Ogunbona, P., A pixel-based robust image watermark-

ing system, ICME 2006 (accepted)[48] Perez-Freire, L., Comesana, P., Troncoso-Pastoriza, J. R. and Perez-Gonzalez, F., Water-

marking Security: A Survey, Y.Q. Shi (Ed.): Transactions on DHMS I, LNCS 4300, pp.

4172, 2006.[49] Fridrich, J., Key-dependent Random Image Transforms and their Applications in Image Wa-

termating, Proceedings, International Conf. on Imaging Science , Systems and Technology,pp. 237–243, 1999.

[50] Ahanger, G. and Little, T.D.C., A Survey of Technologies for Parsing and Indexing Digital

Video, Journal of Visual Communication and Image Representation, Vol. 10, No. 2, pp.28-43, 1996.

[51] Rui, Y., Huang, T. S. and Change, S. F., Image Retrieval:Current Techniques, Promising

directions, and Open Issues, Journal of Visual Communication and Image Representation,Vol. 10, No. 1, pp. 39-62, 1999.

[52] Antani, S., Kasturi, R. and Jian, R., A Survey onthe use of Pattern Recognition Methods

for Abstraction, Indexing and Retrieval of Images and Video, Pattern Recognition, Vol. 35,pp. 945–965, 2002.

[53] Meilhac, C. and Nastar, C., Relevance Feedback and Category Search in Image Databases,

IEEE International Conference onMultimedia Computing Systems, Vol. 1, 1999, pp. 512–517.

[54] Ashby, F.G. and Perrin, N. A., Towards a Unified Theory of Similarity and Recognition,Psychological Review, Vol. 3, pp. 179–202, 1996.

[55] Santini, S. and Jain, R., Similarity Measures, IEEE Trans. Pattern Analysis and Machine

Intelligence, Vol. 21, No. 9, pp. 871–883, 1999.[56] Brinke, W., Squire, D. McG., and Bigelow, J., Similarity: Measurement, Ordering and

Betweenness, Lecture Notes in Computer Science Volume Volume 3214/2004 Knowledge-

Based Intelligent Information and Engineering Systems[57] Rui, Y., Huang, T. S. and Mehrotra, S., Content-Based Image Retrieval with Relevance

Feedback in MARS, Proceedings International Conference on Image Processing, 1997. ,vol.2, pp.815 - 818, 26-29 Oct. 1997.

[58] Rui, Y., Huang, T. S., Ortega, M. and Mehrotra, S., Relevance Feedback: A Power Tool for

Interactive Content-Based Image Retrieval, IEEE Transactions on Circuits and Systems for

Video Technology, Vol. 8, No. 5, pp. 644–655, Sept. 1998.[59] Doulamis, N., Doulamis, A., Evaluation of Relevance Feedback Schemes in Content-Based

Retrieval Systems, Signal Processing: Image Communication, Vol. 21, pp.334357, 2006.

School of Information Technology and Computer Science, University of Wollongong, NSW 2522Australia

E-mail : [email protected]

Date post:	03-Feb-2022
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

INTERNATIONAL JOURNAL OF c 2007 Institute for Scientiï¬c

Documents