Visual Saliency in Noisy Images - University of California ...milanfar/publications/... · Visual...

Journal of Vision (20??) ?, 1–? http://journalofvision.org/?/?/? 1

Visual Saliency in Noisy Images

Chelhwon KimElectrical Engineering Department, University of California,

v )Santa Cruz, CA, USA

Peyman MilanfarGoogle, Inc. and Electrical Engineering Department, University of California,

v )Santa Cruz, CA, USA

The human visual system possesses the remarkable ability to pick out salient objects in images. Even more impressive

is its ability to do the very same in the presence of disturbances. In particular, the ability persists despite the presence

of noise, poor weather, and other impediments to perfect vision. Meanwhile, noise can significantly degrade the accuracy

of automated computational saliency detection algorithms. In this paper we set out to remedy this shortcoming. Existing

computational saliency models generally assume that the given image is clean and a fundamental and explicit treatment

of saliency in noisy images is missing from the literature. Here we propose a novel and statistically sound method for

estimating saliency based on a non-parametric regression framework, and investigate the stability of saliency models for

noisy images and analyze how state-of-the-art computational models respond to noisy visual stimuli. The proposed model

of saliency at a pixel of interest is a data-dependent weighted average of dissimilarities between a center patch around that

pixel and other patches. In order to further enhance the degree of accuracy in predicting the human fixations and of stability

to noise, we incorporate a global and multi-scale approach by extending the local analysis window to the entire input image,

even further to multiple scaled copies of the image. Our method consistently outperforms six other state-of-the-art models

(Zhang, Tong, & Marks, 2008; Bruce & Tsotsos, 2009; Hou & Zhang, 2007; Goferman, Zelnik-Manor, & Tal., 2010; Seo &

Milanfar, 2009; Garcia-Diaz, Fdez-Vidal, Pardo, & Dosil, 2012) for both noise-free and noisy cases.

Keywords: Saliency, non-parametric regression, saliency for noisy images, global approach, multi-scale approach

1. Introduction

Visual saliency is an important aspect of human vision, as it

directs our attention to what we want to perceive. It also affects

the processing of information in that it allocates limited perceptual

resources to objects of interest and suppresses our awareness of ar-

eas worth ignoring in our visual field. In computer vision tasks,

finding salient regions in the visual field is also essential because it

allows computer vision systems to process a flood of visual infor-

mation and allocate limited resources to relatively small, but inter-

esting regions or a few objects. In recent years, extensive research

has focused on finding saliency in natural images and predicting

where humans look in the image. As such, a wide diversity of

computational saliency models have been introduced (Seo & Mi-

lanfar, 2009; Itti, Koch, & Niebur, 1998; Zhang et al., 2008; Bruce

& Tsotsos, 2009; Gao, Mahadevan, & Vasoncelos, 2008; Gofer-

man et al., 2010; Garcia-Diaz et al., 2012; Hou & Zhang, 2007)

and aimed at transforming a given image into a scalar-valued map

(the saliency map) representing visual saliency in that image. This

saliency map has been useful in many applications such as ob-

ject detection (Rosin, 2009; Zhicheng & Itti, 2011; Rutishauser,

Walther, Koch, & Perona, 2004; Seo & Milanfar, 2010), image

quality assessment (Ma & Zhang, 2008; Niassi, LeMeur, Lecallet,

& Barba, 2007) and action detection (Seo & Milanfar, 2009) and

more.

Most saliency models are biologically inspired and based on

a bottom-up computational model. Itti et al. (Itti et al., 1998) in-

troduced a model based on the biologically plausible architecture

proposed by (Koch & Ullman, 1985) and measure center-surround

doi: Received: December 20, 2012 ISSN 1534–7362 c© 20?? ARVO

http://journalofvision.org/

http://soe.ucsc.edu/~chkim

mailto:[email protected]

http://soe.ucsc.edu/~milanfar

mailto:[email protected]

Journal of Vision (20??) ?, 1–? 2

!"#$%&'($)#&

*+',-&'($)#&. /012134&

!"#$%&'()"% *+(,)%"$%(-.% /012"%3%!#4$#4#% 540%3%*+(,)% 647"1'(,%"$%(-.% 8"4%3%9&-(,7(1% 6(12&(:;&(<%"$%(-.% =14>4#"?%'"$+4?%!"#$%&'()" A4(0)%"$%(+? B,=1"%C%!#8$#8# D8=%C%A4(0) E89",'(0%"$%(+? 5"8%C%F&+(09(, E(,1&(G*&(H%"$%(+? I,838#"2%'"$482

Figure 1: The results of the state-of-the-art saliency models given a noisy image. The noise added to the test image is a white Gaussian noise withvariance σ2 = 0.05.

contrast using a Difference of Gaussians (DoG) approach. Bruce

& Tsotsos (Bruce & Tsotsos, 2009) proposed Attention based on

Information Maximization (AIM) model. They measured saliency

at a pixel in the image by Shannon’s self-information of that loca-

tion with respect to its surrounding context. To estimate the prob-

ability density of a visual feature in the high dimensional space,

they employed a representation based on independent components,

which are determined from natural scenes. Zhang et al. (Zhang et

al., 2008)’s saliency model uses natural image statistics within a

Bayesian framework from which bottom-up saliency emerges nat-

urally as the self-information of visual features. Seo & Milanfar

(Seo & Milanfar, 2009) proposed the self-resemblance mechanism

to measure saliency. At each pixel, they first extract visual fea-

tures (local regression kernels) that are robust in extracting local

geometry of the image. Then, matrix cosine similarity (Seo & Mi-

lanfar, 2009, 2010) is employed to measure the resemblance of

each pixel to its surroundings. Hou & Zhang (Hou & Zhang, 2007)

derived saliency by measuring the spectral residual of an image

which is the difference between the log spectrum of the image and

its smoothed version. They posit that the statistical singularities

in the spectrum may be responsible for anomalous regions in the

image. Goferman et al. (Goferman et al., 2010) proposed context-

aware saliency. Their saliency model aims to detect not only the

dominant objects but also the parts of their surroundings that con-

vey the context. This type of model is useful in applications where

the context of the dominant objects is just as essential as the objects

themselves. Garcia-Diaz et al. (Garcia-Diaz et al., 2012) proposed

the Adaptive Whitening Saliency (AWS) model. The whitening

process (decorrelation and variance normalization) is applied to the

chromatic components of the image. Then, the multi-oriented and

multi-scale local energy representation of the image is obtained

by applying a bank of log-Gabor filters parameterized by different

scales and orientations. Visual saliency is measured by a simple

vector norm computation in the obtained representation.

Despite the wide variety of computational saliency models,

they all assume the given image is clean and free of distortions.

However, as in Fig. 1, when we feed a noisy image instead of

a clean image into existing saliency detection algorithms, many

fail, frivolously declaring saliency in the noisy image. Especially,

for the model by (Zhang et al., 2008), (Bruce & Tsotsos, 2009),

(Goferman et al., 2010), and (Seo & Milanfar, 2009), it is apparent

that any applications using these noisy saliency maps cannot per-

form well. In contrast, (Hou & Zhang, 2007) and (Garcia-Diaz et

al., 2012) provide more stable results because they implicitly sup-

press the noise during the process of computing saliency. In the

model by Hou & Zhang, spectral filtering of the image will sup-

press the noise in the image. In the AWS by Garsia-Diaz et al.,

incorporating multi-oriented and multi-scale representation with

their whitening process also implicitly suppresses the noise. Al-

though their results tend to be apparently somewhat insensitive to

noise, a fundamental and explicit treatment of saliency in noisy

images is missing from the literature (Le Meur, 2011). We shall

provide this in this paper. Furthermore, we will demonstrate that

the price for this apparent insensitivity to noise is that the over-

all performance over a large range of noise strengths is dimin-

ished. In this paper, we aim to achieve two goals simultaneously.

First, we propose a simple and statistically well-motivated compu-

tational saliency model which achieves a high degree of accuracy

in predicting where humans look. Second, we illustrate that the

proposed model is stable when a noise-corrupted image is given


Test image Dissimilarity between

center and nearby patches

Saliency map

Weighted average of the

observed dissimilarities

Aggregation

Figure 2: Overview of saliency detection: We observe dissimilarity of a center patch around xj relative to other patches. The proposed saliency model isa weighted average of the observed dissimilarities.

and improves on other state-of-the-art models over a large range of

noise strengths.

The proposed saliency model is based on a bottom-up com-

putational model. As such, an underlying hypothesis is that human

eye fixations are driven to conspicuous regions in the test image,

which stand out from their surroundings. In order to measure this

distinctiveness of region, we observe dissimilarities between a cen-

ter patch of the region and other patches (Fig. 2). Once we have

measured these dissimilarities, the problem of interest is how to

aggregate them to obtain an estimate of the underlying saliency of

that region. We look at this problem from an estimation theory

point of view and propose a novel and statistically sound saliency

model. We assume that each observed dissimilarity has an under-

lying true value, which is measured with uncertainty. Given these

noisy observations, we estimate the underlying saliency by solving

a local data-dependent weighted least squares problem. As we will

see in the next section, this results in an aggregation of the dissimi-

larities with weights depending on a kernel function to be specified.

We define the kernel function so that it gives higher weight to simi-

lar patch pairs than dissimilar patch pairs. Giving higher weights to

more similar patch pairs would seem counter-intuitive at first. But

this process will ensure that only truly salient objects would be de-

clared so, sparing us from too many false declarations of saliency.

The proposed estimate of saliency at pixel xj is defined as:

s(xj) =

N∑i=1

wijyi (1)

where yi and wij are the observed dissimilarity (to be defined

shortly in the next section) and the weight for ij-th patch pair, re-

spectively.

It is important to highlight the direct relation of our approach

to two earlier approaches of Seo & Milanfar and Goferman et al.

We make this comparison explicit here because these methods too

involve aggregation of local dissimilarities. While this was not

made entirely clear in either (Goferman et al., 2010) or (Seo &

Milanfar, 2009), it is interesting to note that these methods em-

ployed arithmetic and harmonic averaging of local dissimilarities,

respectively. In (Seo & Milanfar, 2009), they defined the estimate

of saliency at pixel xj by

s(xj) =exp(1/τ)

N

N∑Ni=1 1/yi︸︷︷︸

harmonic mean

(2)

where yi = exp(−ρiτ ) and ρi is the cosine similarity between visual

features extracted from the center patch around the pixel xj and its

i-th nearby patch. This saliency model is (to within a constant) the

harmonic mean of dissimilarities, yi’s.

In (Goferman et al., 2010), they formularized the saliency at


pixel xj as

s(xj) = 1− exp(− 1

N

N∑i=1

yi︸︷︷︸arithmetic mean

) (3)

where yi is the dissimilarity measure between a center patch around

the pixel xj and any other patch observed in the test image. This

saliency model is the arithmetic mean of yi’s. Besides the use of

the exponential, the important difference as compared to our ap-

proach is that they use constant weights wij = 1/N for the aggre-

gation of dissimilarities, whereas we use data-dependent weights.

In summary, among those saliency models in which dissim-

ilarities (either local or global) are combined by different aggre-

gation techniques, our proposed method is simpler, better justified,

and indeed a more effective arithmetic aggregation based on kernel

regression.

Many saliency models have leveraged the multi-scale approach

(Gao et al., 2008; Goferman et al., 2010; Zhicheng & Itti, 2011;

Walther & Koch, 2006; Zhang et al., 2008; Garcia-Diaz et al.,

2012). In the proposed model, we also exploit global and multi-

scale approach by extending the window to the whole image. By

doing so, we enhance the degree of accuracy in predicting human

fixations and further realize strong stability to noise as well.

The paper is organized as follows. In the next section, we

provide further technical details about the proposed saliency model

and describe the global & multi-scale approach to saliency compu-

tational model. In the performance evaluation section, we demon-

strate the efficacy of this saliency model in predicting human fix-

ations with six other state-of-the-art models (Zhang et al., 2008;

Bruce & Tsotsos, 2009; Hou & Zhang, 2007; Goferman et al.,

2010; Seo & Milanfar, 2009; Garcia-Diaz et al., 2012) and investi-

gate the stability of our method in the presence of noise. In the last

section, we conclude the paper.

2. Technical details

2.1. Non-parametric regression for saliency

In this section, we propose a measure of saliency at a pixel of

interest from observations of dissimilarity between a center patch

around the pixel and its nearby patches (See Fig. 2). Let us denote

by ρi the similarity between a patch centered at a pixel of interest,

and its i-th neighboring patch. Then, the dissimilarity is measured

as a decreasing function of ρ as follows:

yi = e−ρi , i = 1, ..., N (4)

The similarity function ρ can be measured in a variety of ways

(Seo & Milanfar, 2009; Swain & Ballard, 1991; Rubner, Tomasi,

& Guibas, 2000), for instance using the matrix cosine similarity

between visual features computed in the two patches (Seo & Mi-

lanfar, 2009, 2010). For our experiments, we shall use the LARK

features as defined in (Takeda, Farsiu, & Milanfar, 2007), which

have been shown to be robust to the presence of noise and other

distortions. Much detailed description of these features is given

in (Takeda et al., 2007; Takeda, Milanfar, Protter, & Elad, 2009).

We note that the effectiveness of LARK as a visual descriptor has

led to its use for object and action detection and recognition, even

in the presence of significant noise (Seo & Milanfar, 2009, 2010).

From an estimation theory point of view, we assume that each ob-

servation yi is in essence a measurement of the true saliency, but

measured with some error. This observation model can be posed

as:

yi = si + ηi, i = 1, ..., N (5)

where ηi is noise. Given these observations, we assume a locally

constant model of saliency and estimate the expected saliency at

pixel xj by solving the weighted least squares problem

s(xj) = argminsj

N∑i=1

[yi − s(xj)]2K(yi, yr) (6)

where yr is a reference observation. We choose yr = mini yi

where i = 1, , N ranges in a neighborhood of j. As such, yr is

the most similar patch to the patch at j. Depending on the differ-

ence between this reference observation yr and each observation

yi, the kernel function K(·) gives higher or lower weight to each

observation as follows:

K(yi, yr) = e−(yi−yr)2

h (7)

Therefore, the weight function gives higher weight to similar patch

pairs than dissimilar patch pairs. The rationale behind this way of

weighting is to avoid easily declaring saliency; that is, the aggrega-

tion of dissimilarities for a truly salient region should be still high

even if we put more weight on the most similar patch pairs. Put

yet another way, we do not easily allow any region to be declared

salient and thus we reduce the likelihood of false alarms. We set


the weight of the reference observation itself, wr = maxi wi. This

setting avoids the excessive weighting of the reference observation

in the average. The parameter h controls the decay of the weights,

and is determined empirically to get best performance.

Minimizaing Equation 6, the result is merely a weighted av-

erage of the measured dissimilarities, where the weights are com-

puted based on ”distances” between each observation and the ref-

erence observation,

s(xj) =

∑Ni=1K(yi, yr) yi∑Ni=1K(yi, yr)

=

N∑i=1

wijyi (8)

2.2. Global and multi-scale saliency

So far, the underlying idea is that the saliency of a pixel is

measured by the distinctiveness of a center patch around that pixel

relative to its neighbors. In this section, we extend our local anal-

ysis window (gray dashed rectangle in Fig. 2) to the entire input

image. By doing this, we aggregate all dissimilarities between the

center patch and all patches observed from the entire image. This

is a sensible and well-motivated extension because in general it

is consistent with the way the human visual system inspects the

global field of view at once to determine saliency. Furthermore,

we incorporate a multi-scale approach by taking the patches (to be

compared to the center patch) from the multiscale Gaussian pyra-

mid constructed from the given image. Fig. 3 illustrates the global

and multi-scale saliency computation. In our implementation, we

follow the same general procedure as in (Goferman et al., 2010).

First, we denote by R = {r1, r2, ..., rM} the multiple scales ap-

plied to the input image (the horizontal axis in Fig. 3), and then

at each scale rm, where 1 ≤ m ≤ M , we compute the dissim-

ilarity of the center patch relative to all patches observed in the

images whose scales are Rq = {rm, rm/2, rm/4} (the vertical

axis in Fig. 3). Consequently, for M scales, M saliency maps are

computed and resized to the original image size by bilinear inter-

polation. The resulting multiple saliency maps are then combined

into one by simple averaging. Fig. 4 demonstrates the difference

between those saliency maps obtained at different scales. While

the fine scale result detects details such as textures and edges, the

coarse scale result detects global features. Note that we fixed the

size of the patch at each scale rm as the yellow rectangle shown in

Fig 3.

In order to rewrite the saliency equation in a multi-scale fash-

ion, we denote again the dissimilarity measure defined in Equation

4 by yi = e−ρ(pi,pj), where pi is i-th patch observed across the

multi-scale pyramid and pj is the center patch at pixel xj . There-

fore, we can rewrite the saliency equation at each scale rm as fol-

lows:

s(xj)(rm) =

∑pi∈Rq

wije−ρ(pi,pj) (9)

The saliency at pixel xj is taken as the mean of its saliency

across all scales:

s(xj) =1

M

M∑m=1

s(xj)(rm) (10)

In the next section we first evaluate our saliency model for

clean images against six existing saliency models (Zhang et al.,

2008; Bruce & Tsotsos, 2009; Hou & Zhang, 2007; Goferman et

al., 2010; Seo & Milanfar, 2009; Garcia-Diaz et al., 2012), and

then investigate the stability of our saliency model for noisy im-

ages. We also see the effect of the global and multi-scale approach

on overall performance.

3. Performance evaluation

3.1. Predicting human fixation data

In this section, we evaluate the proposed saliency model in

predicting human eye fixations on Bruce & Tsotsos’s dataset (Avail-

able at http://www-sop.inria.fr/members/Neil.Bruce/). This is a

dataset of 120 indoor and outdoor natural images and has been

commonly used to validate many state-of-the-art saliency models

(Seo & Milanfar, 2009; Zhang et al., 2008; Bruce & Tsotsos, 2009;

Garcia-Diaz et al., 2012). The subjects were given no instructions

except to observe the images and the eye fixations were recorded

during 4s (Bruce & Tsotsos, 2009).

The six state-of-the-art models used for comparison are the

Saliency Using Natural statistics (SUN) model by Zhang et al.,

the Attention based on Information Maximization (AIM) model by

Bruce & Tsotsos, the Spectral Residual model by Hou & Zhang,

the Context Aware model by Goferman et al., the Self-resemblance

model by Seo & Milanfar and the Adaptive Whitening Saliency

(AWS) model by Garcia-Diaz et al. For each model, we used the

default parameters suggested by the respective authors.

In our implementation, the similarity function ρ in Equation 4

is computed using the matrix cosine similarity between the LARK

features as in the model of Seo & Milanfar. We sample patches of

http://www-sop.inria.fr/members/Neil.Bruce/


100% 60% 40%

All patches are observed in the images

whose scales are

Figure 3. Global and multi-scale saliency computation

Figure 3: Global and multi-scale saliency computation: At each scale rm ∈ R (column), we search all patches to be compared to the center patch (yellowrectangle) across multiple images whose scales are Rq = {rm, rm/2, rm/4}.

size 7 × 7 with 50% overlap from multiple scale images. We use

three scales (M = 3), R = {1.0, 0.6, 0.4} and the smallest scale

allowed in Rq is 20% of the original size as in (Goferman et al.,

2010).

Fig. 5 demonstrates a qualitative comparison of the proposed

model with the fixation density map and the saliency maps pro-

duced by the six competing models. All saliency maps were nor-

malized to range between zero and one in order to make a com-

parison with equalized contrast. There is qualitative similarity be-

tween Seo & Milanfar and our model except that ours are seen to

have fewer spurious salient regions. We also note that our global

and multi-scale approach also contributes to the advantage of our

model. More precisely, as similarly argued in (Goferman et al.,

2010), background pixels are likely to have similar patches in the

entire image at multiple scales, while the salient pixels have sim-

ilar patches in the nearby region, and at a few scales. Therefore,

incorporating global and multi-scale approach not only emphasizes

the contrast between salient and non-salient regions, but also sup-

presses frequently occurring features in the background. Fig. 6

shows three different test images, each of which has one or two

salient objects. The saliency models with the global and multi-

scale considerations such as Goferman et al.’s model and the pro-

posed model produce more reliable results than others. It seems

that Goferman et al.’s model suppress saliencies on the frequently

occurring features more efficiently than ours. However, we note

that Goferman et al. simulated the visual contextual effect by iden-

tifying the attended areas where saliency value exceeds a certain

threshold and weighting each pixel outside the attended areas ac-

cording to its Euclidean distance to the closest attended pixel. This

apparently suppresses more saliencies on the features outside the

attended areas such as the bricks in the first test image. By contrast,

the output of Zhang et al., Bruce & Tsotsos and Seo & Milanfar

found high saliency in the uninteresting background. Especially,

the saliency model by Seo & Milanfar is seen to be most sensitive

to the frequently occurring features. While it seems that Hou &

Zhang’s model is also robust to the frequently occurring features

as seen in the first and third image, it still declares high saliency

values in the pile of green peppers.

For quantitative performance analysis, we use area under the

receiver operating characteristic curve (AUC) and Spearman’s rank


100% 40% Final saliency map Test image 60%

Figure 4: The saliency maps obtained at different scales. The multi-scale approach not only gives high saliency values at object edges (from the finescale result), but also detects global features (from the coarse scale result).

correlation coefficient (SCC). The AUC metric determines how

well fixated and non-fixated locations can be discriminated by the

saliency map using a simple threshold (Tatler, Baddeley, & Gilchrist,

2010). If the values of saliency map exceed the threshold then

we declare them as fixated. By sweeping the threshold between

the minimum and maximum values in the saliency map, the true

positive rate (declaring fixated locations as fixated) and the false

positive rate (declaring non-fixated locations as fixated) are cal-

culated and the receiver operating characteristic (ROC) curve is

constructed by plotting the true positive rate as a function of the

false positive rate across all possible thresholds. The SCC metric

measures the degree of similarity between two ranked saliency and

fixation density maps (See Fig. 5 for examples of the fixation den-

sity map). If they are not well-matched, the correlation coefficient

is zero.

Zhang et al. (Zhang et al., 2008) pointed out two problems in

using the AUC metric: First, simply using a Gaussian blob centered

in the middle of the image as the saliency map produces excellent

results because most human eye fixation data have a center bias as

photographers tend to place objects of interest in the center (Tatler

et al., 2010; Parkhurst & Niebur, 2003). Secondly, some saliency

models (Seo & Milanfar, 2009; Bruce & Tsotsos, 2009) have im-

age border effects due to invalid filter responses at the borders of

images and this also produces an artificial improvement in AUC

metric (Zhang et al., 2008). To avoid these problems, they set the

non-fixated locations of a test image as the fixated locations in an-

other image from the same test set. We follow the same procedure:

For each test image, we first compute a histogram of saliency at the

fixated locations of the test image and a histogram of saliency at the

fixated locations, but of a randomly chosen image from the test set.

Then, we compute all possible true positive and false positive rates

by varying the threshold on these two histograms respectively. Fi-

nally, we compute the AUC. All AUC’s computed for the various

images in the database are averaged to derive the reported overall

AUC. Because the test images for the non-fixations are randomly

chosen, we repeat this procedure 100 times and report the mean

and the standard error of the results in Table 1. As this shows, our

saliency model outperforms most other state-of-the-art models in

AUC metric. Only AWS is slightly better in AUC than ours but

the difference is roughly within the standard error bounds. In con-

trast to the AUC metric, our model holds third place in SCC metric.

However, we have more confidence in the AUC metric that is based

on the human fixations rather than the SCC metric that is based on

the fixation density map produced by a 2D Gaussian kernel density

estimate based on the human fixations.

We note that eye tracking data may contain error which orig-

inate from systematic error in the course of calibrating the eye

tracker and its lack of accuracy. Therefore, we perform a simu-

lation of this error by adding Gaussian noise to the fixated location

in the image. Table 2 shows all AUC’s computed for the various

standard deviations of Gaussian noise. We observed that this does

not affect much the performance (at least for the standard deviation

less than 10 and for the AUC metric) and our method still outper-

forms most other state-of-the art models.

In the next section, we will see that the proposed model is

more stable than others when the input images are corrupted by

noise, and thus produces better performance overall across a large

range of noise strengths.


!"#$%&'()" O4(0)%"$%(+? P,=1"%Q%!#8$#8# R8=%Q%O4(0) G89",'(0%"$%(+? 5"8%Q%S&+(09(, G(,1&(H*&(T%"$%(+? U,838#"2%'"$482 <&V(:80%'(3

<&)=,"%W?%XV('3+"#%89%,"#=+$#?% Figure 5: Examples of result. For comparison, the fixation density maps produced based on the fixation points are provided by Bruce & Tsotsos.


!"#$%&'()" O4(0)%"$%(+? P,=1"%Q%!#8$#8# R8=%Q%O4(0) G89",'(0%"$%(+? 5"8%Q%S&+(09(, G(,1&(H*&(T%"$%(+? U,838#"2%'"$482

<&)=,"%D?%XV('3+"#%89%#(+&"01-%&0%9,"Y="0$+-%811=,,&0)%3(Z",0#? Figure 6: Examples of saliency on images containing frequently occurring features.

Table 1: Performance in predicting human fixations in clean images

Model AUC (SE) SCC

Proposed method 0.713 (0.0007) 0.386

Garcia-Diaz et al. 0.714 (0.0008) 0.362

Seo & Milanfar 0.696 (0.0007) 0.346

Goferman et al. 0.686 (0.0008) 0.405

Hou & Zhang 0.672 (0.0007) 0.317

Bruce & Tsotsos 0.672 (0.0007) 0.424

Zhang et al. 0.639 (0.0007) 0.243

Table 2: AUC for the various standard deviations (std) from the original fixa-tion data.

Model std(0) std(5) std(10)

Proposed method 0.713 0.713 0.709

Garcia-Diaz et al. 0.714 0.713 0.710

Seo & Milanfar 0.696 0.695 0.693

Goferman et al. 0.686 0.686 0.681

Hou & Zhang 0.672 0.670 0.668

Bruce & Tsotsos 0.672 0.672 0.672

Zhang et al. 0.639 0.638 0.638

3.2. Stability of saliency models for noisy im-ages

In this section, we investigate the stability of saliency mod-

els for noisy images. The same original test images from Bruce

and Tsotsos’s dataset are used and the noise added to the test im-

ages is white Gaussian with variance σ2 which equals to 0.01, 0.05,

0.1 or 0.2 (The intensity value for each pixel of the image ranges

from 0 to 1). The saliency maps computed from the noisy im-

ages are compared to the human fixations through the same pro-

cedure. One may be concerned that the human fixations used in

this evaluation were recorded from noise-free images and not the

corrupted images. However, we focus on investigating the sen-

sitivity of computational models of visual attention subjected to

visual degradations rather than evaluating the performance in pre-

dicting human fixation data in noisy images. Therefore, we use the

same human fixations to see if the computational models achieve

the same performance as in the noise-free case. Also, to the best

of our knowledge, there is no available public fixation database on

noisy images. So, we resorted instead to analyzing how state-of-

the-art computational models respond to noisy visual stimuli.

Examples of noisy test images and their saliency maps are de-

picted in Fig. 7. We observe that the proposed model shows more

stable responses in flat regions in the background such as sky, road

and wall than in the regions containing high-frequency textured

area such as leaves of tree. This phenomenon can be explained as


Clean image

Noise variance = 0.01

Noisy image

0.05 0.1 0.2

Figure 7: Saliency maps produced by the proposed method on increasingly noisy images. From left to right, a clean image, and noisy images with noisevariance σ2 = {0.01, 0.05, 0.1, 0.2}, respectively.


follows: Background pixels in such flat regions are likely to have

more similar patches in the entire image while salient pixels have

similar patches in the nearby region. Furthermore, since we give

higher weights to similar patch pairs than dissimilar ones when we

aggregate the dissimilarities of those patches, we tend to average

more dissimilarities for the background pixels and suppress more

noise than on the salient pixels.

For quantitative performance analysis, we plot the AUC val-

ues for each method against the noise strength. As one may expect,

the performance in predicting human fixations generally decreases

as the noise strength increases (See the curves in Fig. 8). However,

our saliency model outperforms the six other state-of-the-art mod-

els over a wide range of noise strengths. Only Garcia-Diaz et al.’s

model shows similar performance for the noise-free case, but the

proposed model shows better performance for the noisy case.

As we alluded to earlier, most saliency models implicitly sup-

press the noise by blurring and downsampling the input image.

Hou & Zhang and Seo and Milanfar downsampled the input im-

age to 64×64. Bruce & Tsotsos, Zhang et al. and Garcia-Diaz

et al. also used input image downsampled by a factor of two. In

Goferman et al.’s model, the input image is downsampled to 250

pixels. However, as illustrated in Fig. 8, the price for this im-

plicit treatment is that the overall performance over a large range

of noise strengths is diminished except in Hou & Zhang’s model.

Since Hou & Zhang removes redundancies in the frequency do-

main after the input image is downsampled, they suppress more

noise and show stable results. However, we note that their method

does not achieve a high degree of accuracy overall in predicting

human fixations. In contrast, our regression based saliency model

achieves a high degree of accuracy for noise-free and noisy cases

simultaneously, and improves on competing models over a large

range of noise strengths.

We investigated how state-of-the-art computational models re-

spond to noisy visual stimuli. Based on the Helmholtz principle

(Desolneux, Moisan, & Morel, 2008), the human visual system

does not perceive structure in a uniform random image. Only when

some relatively large deviation from randomness occurs, a struc-

ture is perceived. According to this principle, the bottom-up ap-

proaches should result in roughly similar saliency maps to those

produced using clean images because the random features in the

input image are largely suppressed. That is to say, a good com-

putational saliency model should behave similarly in the presence

of noise, and return stable results. We made several noisy syn-

thetic images by adding different amounts of white Gaussian noise

to a 128×128 gray image containing a 19×19 black square in the

center (Fig. 9). The saliency maps computed from these noisy syn-

thetic images are normalized to range from zero to one. We note

that the input images were not downsampled or blurred before cal-

culating the saliency map, and thus the implicit noise suppression

was not included in this experiment. Fig. 9 shows results pro-

duced by the six other state-of-the-art models and the proposed

model. Only Garcia-Diaz et al.’s model and the proposed model

remained robust to the noise in the saliency map. We also ob-

served that the saliency maps from Seo & Milanfar’s model and

Hou & Zhang’s model in the second row (noise variance, 0.05) are

severely degraded compared to the ones in Fig. 1. In addition, Hou

& Zhang’s model only detect details (notice the white in bound-

ary of the square) and Zhang et al.’s model does not respond to

the black square of a given size. We believe that each model has

different inherent sensitivity to noise and different responses to im-

age features at a given scale. Therefore, different degree of blurring

and downsampling will no doubt affect on the result of each model

differently. In order to investigate an inherent sensitivity of each

model to noise, we performed the same evaluation on the saliency

maps, but with the same degree of resizing and blurring applied to

input images. To do this, we downsampled all the images to the

same size of 250 pixels. Fig. 10 shows the performance in predict-

ing the human fixations. We observed that the proposed model still

outperforms other models and achieves a high degree of accuracy

for both noise-free and noisy cases.

Finally, for the sake of completeness, we show the effect of

global and multi-scale approach on our saliency computational model.

To this end, we first evaluated the proposed model without the

global and multi-scale approach. In other words, we only consid-

ered the patches in the 7x7 local window to measure the dissimilar-

ities for each pixel (we denote this by ”Local + Single-scale” in Fig.

11.) Then, we extended the local analysis window to the entire im-

age and evaluated it again (”Global + Single-scale”). Last, we fur-

ther observed those patches from multiple scale images (”Global +

Multi-scale”). As seen in Fig. 11, we can get better performance

with this global approach. Multiple scale approach also improves

the performance, but the amount of improvement is not as signifi-

cant as that obtained by the global approach.


0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

0.5

0.55

0.6

0.65

0.7

Noise variance

AU

C

Z hang e t al .

B ruc e & T sotsos

Hou & Z hang

Gof e rman e t al .

S e o & M ilan f ar

Garc ia-D iaz e t al .

Prop osed me thod

Figure 8: The performance in predicting the human fixations decreases as the amount of noise increases. However, the proposed method outperformsthe six other state-of-the-art models over a wide range of noise strengths.

@291'&A#,9#$.'&BC*CDE

@291'&A#,9#$.'&BC*CFE

@291'&A#,9#$.'&BC*DE

@291'&A#,9#$.'&BC*GE !"#$%&'(&#)* +,-.'&/&012(121 32-&/&!"#$% 425',6#$&'(&#)* 7'2&/&89)#$5#, 4#,.9#:;9#<&'(&#)* =,2>21'?&6'("2?

Figure 9: Examples of saliency on noisy synthetic images.


0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

0.5

0.55

0.6

0.65

0.7

Noise variance

AU

C

Z hang e t al .

B ruc e & T sotsos

Hou & Z hang

Gof e rman e t al .

S e o & M ilan f ar

Garc ia-D iaz e t al .

Prop osed me thod

Figure 10: The performance in predicting the human fixations. The same degree of resizing and blurring were applied to input images.

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20.65

0.66

0.67

0.68

0.69

0.7

0.71

0.72

Noise variance

AU

C

Local+S ingle - sc ale

Global+S ingle - sc ale

Global+Multi - sc ale

Figure 11: The effect of global and multi-scale approach on our saliency computational model.


4. Conclusion and future work

In this paper, we have proposed a simple and statistically well-

motivated saliency model based on non-parametric regression, which

is a data-dependent weighted combination of dissimilarities ob-

served in the given image. The proposed method is practically

appealing and effective because of its simple mathematical form.

In order to enhance its performance, we incorporate global and

multi-scale approach by extending the local analysis window to

the entire input image, even further to multiple scaled copies of the

image. Experiments on challenging sets of human fixations data

demonstrate that the proposed saliency model not only achieves

a high degree of accuracy in the standard noise-free scenario, but

also improves on other state-of-the-art models for noisy images.

Due to its robustness to noise, we expect the proposed model to

be quite effective in other computer vision applications subject to

severe degradation by noise.

We investigated how different computational saliency models

predict human fixations on images corrupted by white Gaussian

noise. For future work, it would be interesting to investigate how

they do on other type of distortions such as blur, low resolution,

snow, rain or air turbulence, which occur often in real-world ap-

plications. In addition, it would be interesting to pre-process de-

graded data before attempting to calculate the saliency map. As

observed in our earlier work (Kim & Milanfar, 2012), the perfor-

mance of saliency models can be improved by applying a denois-

ing approach first. Unfortunately, this is not consistent with the

way the human visual system operates; so an algorithm based on

filtering first would not seem to be well-motivated by biology. In

any event, imperfect denoising might further distort the data, and

thus this is at best sub-optimal.

Notes and Acknowledgements

Source code for the proposed algorithm will be released on-

line upon acceptance of our paper at the following link:

http://users.soe.ucsc.edu/ chkim/SaliencyDetection.html

This work was supported by AFOSR Grant FA9550-07-1-

0365 and NSF Grant CCF-1016018.

References

Bruce, N. D. B., & Tsotsos, J. K. (2009). Saliency, attention, and

visual search: An information theoretic approach. Journal

of Vision, 9(3), 1–24.

Desolneux, A., Moisan, L., & Morel, J. (2008). The TEXFrom

Gestalt Theory to Image Analysis. Interdisciplinary Applied

Mathematics.

Gao, D., Mahadevan, V., & Vasoncelos, N. (2008). On the plausi-

bility of the discriminant center-surround hypothesis for vi-

sual saliency. Journal of Vision, 8(7):13, 1–18.

Garcia-Diaz, A., Fdez-Vidal, X. R., Pardo, X. M., & Dosil, R.

(2012). Saliency from hierarchical adaptation through decor-

relation and variance normalization. Image and Vision Com-

puting, 30(1), 51–64.

Goferman, S., Zelnik-Manor, L., & Tal., A. (2010). Context-aware

saliency detection. IEEE International Conference on Com-

puter Vision and Pattern Recognition.

Hou, X., & Zhang, L. (2007). Saliency detection: A spectral

residual approach. Proc. IEEE Conf. Computer Vision and

Pattern Recognition.

Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-

based visual attention for rapid scene analysis. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, 20,

1254–1259.

Kim, C., & Milanfar, P. (2012). Finding saliency in noisy images.

SPIE Conference on Computational Imaging.

Koch, C., & Ullman, S. (1985). Shifts in selective visual attention:

Towards the underlying neural circuitry. Human Neurobiol-

ogy, 4(4), 219–227.

Le Meur, O. (2011). Robustness and repeatability of saliency

models subjected to visual degradations. IEEE International

Conference on Image Processing, 3285–3288.

Ma, Q., & Zhang, L. (2008). Saliency-based image quality assess-

ment criterion. Advanced Intelligent Computing Theories

and Applications. With Aspects of Theoretical and Method-

ological Issues, 5226, 1124–1133.

Niassi, A., LeMeur, O., Lecallet, P., & Barba, D. (2007). Does

where you gaze on an image affect your perception of qual-

ity? applying visual attention to image quality metric. IEEE

International Conference on Image Processing, II-169–II-

172.

Parkhurst, D., & Niebur, E. (2003). Scene content selected by

active vision. Spatial Vision, 16(2), 125–154.

Rosin, P. L. (2009). A simple method for detecting salient regions.

Pattern Recognition, 42(11), 2363–2371.

http://users.soe.ucsc.edu/~chkim/SaliencyDetection.html


Rubner, Y., Tomasi, C., & Guibas, L. J. (2000). The earth movers

distance as a metric for image retrieval. International Jour-

nal of Computer Vision, 40, 99–121.

Rutishauser, U., Walther, D., Koch, C., & Perona, P. (2004). Is

bottom-up attention useful for object recognition? IEEE

Conference on Computer Vision and Pattern Recognition, 2,

II-37–II-44.

Seo, H., & Milanfar, P. (2009). Static and space-time visual

saliency detection by self-resemblance. Journal of Vision,

9(12), 1–27.

Seo, H., & Milanfar, P. (2010). Training-free, generic object de-

tection using locally adaptive regression kernel. IEEE Trans.

on Pattern Analysis and Machine Intelligence, 32(9), 1688–

1704.

Swain, M. J., & Ballard, D. H. (1991). Color indexing. Interna-

tional Journal of Computer Vision, 7(1), 11–32.

Takeda, H., Farsiu, S., & Milanfar, P. (2007). Kernel regression for

image processing and reconstruction. IEEE Trans. on Image

Processing, 16(2), 349–366.

Takeda, H., Milanfar, P., Protter, M., & Elad, M. (2009). Super-

resolution without explicit subpixel motion estimation. IEEE

Trans. on Image Processing, 18, 1958–1975.

Tatler, B. W., Baddeley, R. J., & Gilchrist, I. D. (2010). Visual

correlates of fixation selection: Effects of scale and time.

Vision Research, 45(5), 643–659.

Walther, D., & Koch, C. (2006). Modeling attention to salient

proto-objects. Neural Networks, 19(9), 1395–1407.

Zhang, L., Tong, M. H., & Marks, T. K. (2008). Sun: A bayesian

framework for saliency using natural statistics. Journal of

Vision, 8(7), 1–20.

Zhicheng, L., & Itti, L. (2011). Saliency and gist features for

target detection in satellite images. IEEE Transactions on

Image Processing, 20(7), 2017–2029.

Date post:	03-Jun-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Visual Saliency in Noisy Images - University of California ...milanfar/publications/... · Visual...

Documents