Journal of Vision (20??) ?, 1–? http://journalofvision.org/?/?/? 1
Visual Saliency in Noisy Images
Chelhwon KimElectrical Engineering Department, University of California,
v )Santa Cruz, CA, USA
Peyman MilanfarGoogle, Inc. and Electrical Engineering Department, University of California,
v )Santa Cruz, CA, USA
The human visual system possesses the remarkable ability to pick out salient objects in images. Even more impressive
is its ability to do the very same in the presence of disturbances. In particular, the ability persists despite the presence
of noise, poor weather, and other impediments to perfect vision. Meanwhile, noise can significantly degrade the accuracy
of automated computational saliency detection algorithms. In this paper we set out to remedy this shortcoming. Existing
computational saliency models generally assume that the given image is clean and a fundamental and explicit treatment
of saliency in noisy images is missing from the literature. Here we propose a novel and statistically sound method for
estimating saliency based on a non-parametric regression framework, and investigate the stability of saliency models for
noisy images and analyze how state-of-the-art computational models respond to noisy visual stimuli. The proposed model
of saliency at a pixel of interest is a data-dependent weighted average of dissimilarities between a center patch around that
pixel and other patches. In order to further enhance the degree of accuracy in predicting the human fixations and of stability
to noise, we incorporate a global and multi-scale approach by extending the local analysis window to the entire input image,
even further to multiple scaled copies of the image. Our method consistently outperforms six other state-of-the-art models
(Zhang, Tong, & Marks, 2008; Bruce & Tsotsos, 2009; Hou & Zhang, 2007; Goferman, Zelnik-Manor, & Tal., 2010; Seo &
Milanfar, 2009; Garcia-Diaz, Fdez-Vidal, Pardo, & Dosil, 2012) for both noise-free and noisy cases.
Keywords: Saliency, non-parametric regression, saliency for noisy images, global approach, multi-scale approach
1. Introduction
Visual saliency is an important aspect of human vision, as it
directs our attention to what we want to perceive. It also affects
the processing of information in that it allocates limited perceptual
resources to objects of interest and suppresses our awareness of ar-
eas worth ignoring in our visual field. In computer vision tasks,
finding salient regions in the visual field is also essential because it
allows computer vision systems to process a flood of visual infor-
mation and allocate limited resources to relatively small, but inter-
esting regions or a few objects. In recent years, extensive research
has focused on finding saliency in natural images and predicting
where humans look in the image. As such, a wide diversity of
computational saliency models have been introduced (Seo & Mi-
lanfar, 2009; Itti, Koch, & Niebur, 1998; Zhang et al., 2008; Bruce
& Tsotsos, 2009; Gao, Mahadevan, & Vasoncelos, 2008; Gofer-
man et al., 2010; Garcia-Diaz et al., 2012; Hou & Zhang, 2007)
and aimed at transforming a given image into a scalar-valued map
(the saliency map) representing visual saliency in that image. This
saliency map has been useful in many applications such as ob-
ject detection (Rosin, 2009; Zhicheng & Itti, 2011; Rutishauser,
Walther, Koch, & Perona, 2004; Seo & Milanfar, 2010), image
quality assessment (Ma & Zhang, 2008; Niassi, LeMeur, Lecallet,
& Barba, 2007) and action detection (Seo & Milanfar, 2009) and
more.
Most saliency models are biologically inspired and based on
a bottom-up computational model. Itti et al. (Itti et al., 1998) in-
troduced a model based on the biologically plausible architecture
proposed by (Koch & Ullman, 1985) and measure center-surround
doi: Received: December 20, 2012 ISSN 1534–7362 c© 20?? ARVO
Journal of Vision (20??) ?, 1–? 2
!"#$%&'($)#&
*+',-&'($)#&. /012134&
!"#$%&'()"% *+(,)%"$%(-.% /012"%3%!#4$#4#% 540%3%*+(,)% 647"1'(,%"$%(-.% 8"4%3%9&-(,7(1% 6(12&(:;&(<%"$%(-.% =14>4#"?%'"$+4?%!"#$%&'()" A4(0)%"$%(+? B,=1"%C%!#8$#8# D8=%C%A4(0) E89",'(0%"$%(+? 5"8%C%F&+(09(, E(,1&(G*&(H%"$%(+? I,838#"2%'"$482
Figure 1: The results of the state-of-the-art saliency models given a noisy image. The noise added to the test image is a white Gaussian noise withvariance σ2 = 0.05.
contrast using a Difference of Gaussians (DoG) approach. Bruce
& Tsotsos (Bruce & Tsotsos, 2009) proposed Attention based on
Information Maximization (AIM) model. They measured saliency
at a pixel in the image by Shannon’s self-information of that loca-
tion with respect to its surrounding context. To estimate the prob-
ability density of a visual feature in the high dimensional space,
they employed a representation based on independent components,
which are determined from natural scenes. Zhang et al. (Zhang et
al., 2008)’s saliency model uses natural image statistics within a
Bayesian framework from which bottom-up saliency emerges nat-
urally as the self-information of visual features. Seo & Milanfar
(Seo & Milanfar, 2009) proposed the self-resemblance mechanism
to measure salien- cy. At each pixel, they first extract visual fea-
tures (local regression kernels) that are robust in extracting local
geometry of the image. Then, matrix cosine similarity (Seo & Mi-
lanfar, 2009, 2010) is employed to measure the resemblance of
each pixel to its surroundings. Hou & Zhang (Hou & Zhang, 2007)
derived saliency by measuring the spectral residual of an image
which is the difference between the log spectrum of the image and
its smoothed version. They posit that the statistical singularities
in the spectrum may be responsible for anomalous regions in the
image. Goferman et al. (Goferman et al., 2010) proposed context-
aware saliency. Their saliency model aims to detect not only the
dominant objects but also the parts of their surroundings that con-
vey the context. This type of model is useful in applications where
the context of the dominant objects is just as essential as the objects
themselves. Garcia-Diaz et al. (Garcia-Diaz et al., 2012) proposed
the Adaptive Whitening Saliency (AWS) model. The whitening
process (decorrelation and variance normalization) is applied to the
chromatic components of the image. Then, the multi-oriented and
multi-scale local energy representation of the image is obtained
by applying a bank of log-Gabor filters parameterized by different
scales and orientations. Visual saliency is measured by a simple
vector norm computation in the obtained representation.
Despite the wide variety of computational saliency models,
they all assume the given image is clean and free of distortions.
However, as in Fig. 1, when we feed a noisy image instead of
a clean image into existing saliency detection algorithms, many
fail, frivolously declaring saliency in the noisy image. Especially,
for the model by (Zhang et al., 2008), (Bruce & Tsotsos, 2009),
(Goferman et al., 2010), and (Seo & Milanfar, 2009), it is apparent
that any applications using these noisy saliency maps cannot per-
form well. In contrast, (Hou & Zhang, 2007) and (Garcia-Diaz et
al., 2012) provide more stable results because they implicitly sup-
press the noise during the process of computing saliency. In the
model by Hou & Zhang, spectral filtering of the image will sup-
press the noise in the image. In the AWS by Garsia-Diaz et al.,
incorporating multi-oriented and multi-scale representation with
their whitening process also implicitly suppresses the noise. Al-
though their results tend to be apparently somewhat insensitive to
noise, a fundamental and explicit treatment of saliency in noisy
images is missing from the literature (Le Meur, 2011). We shall
provide this in this paper. Furthermore, we will demonstrate that
the price for this apparent insensitivity to noise is that the over-
all performance over a large range of noise strengths is dimin-
ished. In this paper, we aim to achieve two goals simultaneously.
First, we propose a simple and statistically well-motivated compu-
tational saliency model which achieves a high degree of accuracy
in predicting where humans look. Second, we illustrate that the
proposed model is stable when a noise-corrupted image is given
Journal of Vision (20??) ?, 1–? 3
Test image Dissimilarity between
center and nearby patches
Saliency map
Weighted average of the
observed dissimilarities
Aggregation
Figure 2: Overview of saliency detection: We observe dissimilarity of a center patch around xj relative to other patches. The proposed saliency model isa weighted average of the observed dissimilarities.
and improves on other state-of-the-art models over a large range of
noise strengths.
The proposed saliency model is based on a bottom-up com-
putational model. As such, an underlying hypothesis is that human
eye fixations are driven to conspicuous regions in the test image,
which stand out from their surroundings. In order to measure this
distinctiveness of region, we observe dissimilarities between a cen-
ter patch of the region and other patches (Fig. 2). Once we have
measured these dissimilarities, the problem of interest is how to
aggregate them to obtain an estimate of the underlying saliency of
that region. We look at this problem from an estimation theory
point of view and propose a novel and statistically sound saliency
model. We assume that each observed dissimilarity has an under-
lying true value, which is measured with uncertainty. Given these
noisy observations, we estimate the underlying saliency by solving
a local data-dependent weighted least squares problem. As we will
see in the next section, this results in an aggregation of the dissimi-
larities with weights depending on a kernel function to be specified.
We define the kernel function so that it gives higher weight to simi-
lar patch pairs than dissimilar patch pairs. Giving higher weights to
more similar patch pairs would seem counter-intuitive at first. But
this process will ensure that only truly salient objects would be de-
clared so, sparing us from too many false declarations of saliency.
The proposed estimate of saliency at pixel xj is defined as:
s(xj) =
N∑i=1
wijyi (1)
where yi and wij are the observed dissimilarity (to be defined
shortly in the next section) and the weight for ij-th patch pair, re-
spectively.
It is important to highlight the direct relation of our approach
to two earlier approaches of Seo & Milanfar and Goferman et al.
We make this comparison explicit here because these methods too
involve aggregation of local dissimilarities. While this was not
made entirely clear in either (Goferman et al., 2010) or (Seo &
Milanfar, 2009), it is interesting to note that these methods em-
ployed arithmetic and harmonic averaging of local dissimilarities,
respectively. In (Seo & Milanfar, 2009), they defined the estimate
of saliency at pixel xj by
s(xj) =exp(1/τ)
N
N∑Ni=1 1/yi︸ ︷︷ ︸
harmonic mean
(2)
where yi = exp(−ρiτ ) and ρi is the cosine similarity between visual
features extracted from the center patch around the pixel xj and its
i-th nearby patch. This saliency model is (to within a constant) the
harmonic mean of dissimilarities, yi’s.
In (Goferman et al., 2010), they formularized the saliency at
Journal of Vision (20??) ?, 1–? 4
pixel xj as
s(xj) = 1− exp(− 1
N
N∑i=1
yi︸ ︷︷ ︸arithmetic mean
) (3)
where yi is the dissimilarity measure between a center patch around
the pixel xj and any other patch observed in the test image. This
saliency model is the arithmetic mean of yi’s. Besides the use of
the exponential, the important difference as compared to our ap-
proach is that they use constant weights wij = 1/N for the aggre-
gation of dissimilarities, whereas we use data-dependent weights.
In summary, among those saliency models in which dissim-
ilarities (either local or global) are combined by different aggre-
gation techniques, our proposed method is simpler, better justified,
and indeed a more effective arithmetic aggregation based on kernel
regression.
Many saliency models have leveraged the multi-scale approach
(Gao et al., 2008; Goferman et al., 2010; Zhicheng & Itti, 2011;
Walther & Koch, 2006; Zhang et al., 2008; Garcia-Diaz et al.,
2012). In the proposed model, we also exploit global and multi-
scale approach by extending the window to the whole image. By
doing so, we enhance the degree of accuracy in predicting human
fixations and further realize strong stability to noise as well.
The paper is organized as follows. In the next section, we
provide further technical details about the proposed saliency model
and describe the global & multi-scale approach to saliency compu-
tational model. In the performance evaluation section, we demon-
strate the efficacy of this saliency model in predicting human fix-
ations with six other state-of-the-art models (Zhang et al., 2008;
Bruce & Tsotsos, 2009; Hou & Zhang, 2007; Goferman et al.,
2010; Seo & Milanfar, 2009; Garcia-Diaz et al., 2012) and investi-
gate the stability of our method in the presence of noise. In the last
section, we conclude the paper.
2. Technical details
2.1. Non-parametric regression for saliency
In this section, we propose a measure of saliency at a pixel of
interest from observations of dissimilarity between a center patch
around the pixel and its nearby patches (See Fig. 2). Let us denote
by ρi the similarity between a patch centered at a pixel of interest,
and its i-th neighboring patch. Then, the dissimilarity is measured
as a decreasing function of ρ as follows:
yi = e−ρi , i = 1, ..., N (4)
The similarity function ρ can be measured in a variety of ways
(Seo & Milanfar, 2009; Swain & Ballard, 1991; Rubner, Tomasi,
& Guibas, 2000), for instance using the matrix cosine similarity
between visual features computed in the two patches (Seo & Mi-
lanfar, 2009, 2010). For our experiments, we shall use the LARK
features as defined in (Takeda, Farsiu, & Milanfar, 2007), which
have been shown to be robust to the presence of noise and other
distortions. Much detailed description of these features is given
in (Takeda et al., 2007; Takeda, Milanfar, Protter, & Elad, 2009).
We note that the effectiveness of LARK as a visual descriptor has
led to its use for object and action detection and recognition, even
in the presence of significant noise (Seo & Milanfar, 2009, 2010).
From an estimation theory point of view, we assume that each ob-
servation yi is in essence a measurement of the true saliency, but
measured with some error. This observation model can be posed
as:
yi = si + ηi, i = 1, ..., N (5)
where ηi is noise. Given these observations, we assume a locally
constant model of saliency and estimate the expected saliency at
pixel xj by solving the weighted least squares problem
s(xj) = argminsj
N∑i=1
[yi − s(xj)]2K(yi, yr) (6)
where yr is a reference observation. We choose yr = mini yi
where i = 1, , N ranges in a neighborhood of j. As such, yr is
the most similar patch to the patch at j. Depending on the differ-
ence between this reference observation yr and each observation
yi, the kernel function K(·) gives higher or lower weight to each
observation as follows:
K(yi, yr) = e−(yi−yr)2
h (7)
Therefore, the weight function gives higher weight to similar patch
pairs than dissimilar patch pairs. The rationale behind this way of
weighting is to avoid easily declaring saliency; that is, the aggrega-
tion of dissimilarities for a truly salient region should be still high
even if we put more weight on the most similar patch pairs. Put
yet another way, we do not easily allow any region to be declared
salient and thus we reduce the likelihood of false alarms. We set
Journal of Vision (20??) ?, 1–? 5
the weight of the reference observation itself, wr = maxi wi. This
setting avoids the excessive weighting of the reference observation
in the average. The parameter h controls the decay of the weights,
and is determined empirically to get best performance.
Minimizaing Equation 6, the result is merely a weighted av-
erage of the measured dissimilarities, where the weights are com-
puted based on ”distances” between each observation and the ref-
erence observation,
s(xj) =
∑Ni=1K(yi, yr) yi∑Ni=1K(yi, yr)
=
N∑i=1
wijyi (8)
2.2. Global and multi-scale saliency
So far, the underlying idea is that the saliency of a pixel is
measured by the distinctiveness of a center patch around that pixel
relative to its neighbors. In this section, we extend our local anal-
ysis window (gray dashed rectangle in Fig. 2) to the entire input
image. By doing this, we aggregate all dissimilarities between the
center patch and all patches observed from the entire image. This
is a sensible and well-motivated extension because in general it
is consistent with the way the human visual system inspects the
global field of view at once to determine saliency. Furthermore,
we incorporate a multi-scale approach by taking the patches (to be
compared to the center patch) from the multiscale Gaussian pyra-
mid constructed from the given image. Fig. 3 illustrates the global
and multi-scale saliency computation. In our implementation, we
follow the same general procedure as in (Goferman et al., 2010).
First, we denote by R = {r1, r2, ..., rM} the multiple scales ap-
plied to the input image (the horizontal axis in Fig. 3), and then
at each scale rm, where 1 ≤ m ≤ M , we compute the dissim-
ilarity of the center patch relative to all patches observed in the
images whose scales are Rq = {rm, rm/2, rm/4} (the vertical
axis in Fig. 3). Consequently, for M scales, M saliency maps are
computed and resized to the original image size by bilinear inter-
polation. The resulting multiple saliency maps are then combined
into one by simple averaging. Fig. 4 demonstrates the difference
between those saliency maps obtained at different scales. While
the fine scale result detects details such as textures and edges, the
coarse scale result detects global features. Note that we fixed the
size of the patch at each scale rm as the yellow rectangle shown in
Fig 3.
In order to rewrite the saliency equation in a multi-scale fash-
ion, we denote again the dissimilarity measure defined in Equation
4 by yi = e−ρ(pi,pj), where pi is i-th patch observed across the
multi-scale pyramid and pj is the center patch at pixel xj . There-
fore, we can rewrite the saliency equation at each scale rm as fol-
lows:
s(xj)(rm) =
∑pi∈Rq
wije−ρ(pi,pj) (9)
The saliency at pixel xj is taken as the mean of its saliency
across all scales:
s(xj) =1
M
M∑m=1
s(xj)(rm) (10)
In the next section we first evaluate our saliency model for
clean images against six existing saliency models (Zhang et al.,
2008; Bruce & Tsotsos, 2009; Hou & Zhang, 2007; Goferman et
al., 2010; Seo & Milanfar, 2009; Garcia-Diaz et al., 2012), and
then investigate the stability of our saliency model for noisy im-
ages. We also see the effect of the global and multi-scale approach
on overall performance.
3. Performance evaluation
3.1. Predicting human fixation data
In this section, we evaluate the proposed saliency model in
predicting human eye fixations on Bruce & Tsotsos’s dataset (Avail-
able at http://www-sop.inria.fr/members/Neil.Bruce/). This is a
dataset of 120 indoor and outdoor natural images and has been
commonly used to validate many state-of-the-art saliency models
(Seo & Milanfar, 2009; Zhang et al., 2008; Bruce & Tsotsos, 2009;
Garcia-Diaz et al., 2012). The subjects were given no instructions
except to observe the images and the eye fixations were recorded
during 4s (Bruce & Tsotsos, 2009).
The six state-of-the-art models used for comparison are the
Saliency Using Natural statistics (SUN) model by Zhang et al.,
the Attention based on Information Maximization (AIM) model by
Bruce & Tsotsos, the Spectral Residual model by Hou & Zhang,
the Context Aware model by Goferman et al., the Self-resemblance
model by Seo & Milanfar and the Adaptive Whitening Saliency
(AWS) model by Garcia-Diaz et al. For each model, we used the
default parameters suggested by the respective authors.
In our implementation, the similarity function ρ in Equation 4
is computed using the matrix cosine similarity between the LARK
features as in the model of Seo & Milanfar. We sample patches of
Journal of Vision (20??) ?, 1–? 6
100% 60% 40%
All patches are observed in the images
whose scales are
Figure 3. Global and multi-scale saliency computation
Figure 3: Global and multi-scale saliency computation: At each scale rm ∈ R (column), we search all patches to be compared to the center patch (yellowrectangle) across multiple images whose scales are Rq = {rm, rm/2, rm/4}.
size 7 × 7 with 50% overlap from multiple scale images. We use
three scales (M = 3), R = {1.0, 0.6, 0.4} and the smallest scale
allowed in Rq is 20% of the original size as in (Goferman et al.,
2010).
Fig. 5 demonstrates a qualitative comparison of the proposed
model with the fixation density map and the saliency maps pro-
duced by the six competing models. All saliency maps were nor-
malized to range between zero and one in order to make a com-
parison with equalized contrast. There is qualitative similarity be-
tween Seo & Milanfar and our model except that ours are seen to
have fewer spurious salient regions. We also note that our global
and multi-scale approach also contributes to the advantage of our
model. More precisely, as similarly argued in (Goferman et al.,
2010), background pixels are likely to have similar patches in the
entire image at multiple scales, while the salient pixels have sim-
ilar patches in the nearby region, and at a few scales. Therefore,
incorporating global and multi-scale approach not only emphasizes
the contrast between salient and non-salient regions, but also sup-
presses frequently occurring features in the background. Fig. 6
shows three different test images, each of which has one or two
salient objects. The saliency models with the global and multi-
scale considerations such as Goferman et al.’s model and the pro-
posed model produce more reliable results than others. It seems
that Goferman et al.’s model suppress saliencies on the frequently
occurring features more efficiently than ours. However, we note
that Goferman et al. simulated the visual contextual effect by iden-
tifying the attended areas where saliency value exceeds a certain
threshold and weighting each pixel outside the attended areas ac-
cording to its Euclidean distance to the closest attended pixel. This
apparently suppresses more saliencies on the features outside the
attended areas such as the bricks in the first test image. By contrast,
the output of Zhang et al., Bruce & Tsotsos and Seo & Milanfar
found high saliency in the uninteresting background. Especially,
the saliency model by Seo & Milanfar is seen to be most sensitive
to the frequently occurring features. While it seems that Hou &
Zhang’s model is also robust to the frequently occurring features
as seen in the first and third image, it still declares high saliency
values in the pile of green peppers.
For quantitative performance analysis, we use area under the
receiver operating characteristic curve (AUC) and Spearman’s rank
Journal of Vision (20??) ?, 1–? 7
100% 40% Final saliency map Test image 60%
Figure 4: The saliency maps obtained at different scales. The multi-scale approach not only gives high saliency values at object edges (from the finescale result), but also detects global features (from the coarse scale result).
correlation coefficient (SCC). The AUC metric determines how
well fixated and non-fixated locations can be discriminated by the
saliency map using a simple threshold (Tatler, Baddeley, & Gilchrist,
2010). If the values of saliency map exceed the threshold then
we declare them as fixated. By sweeping the threshold between
the minimum and maximum values in the saliency map, the true
positive rate (declaring fixated locations as fixated) and the false
positive rate (declaring non-fixated locations as fixated) are cal-
culated and the receiver operating characteristic (ROC) curve is
constructed by plotting the true positive rate as a function of the
false positive rate across all possible thresholds. The SCC metric
measures the degree of similarity between two ranked saliency and
fixation density maps (See Fig. 5 for examples of the fixation den-
sity map). If they are not well-matched, the correlation coefficient
is zero.
Zhang et al. (Zhang et al., 2008) pointed out two problems in
using the AUC metric: First, simply using a Gaussian blob centered
in the middle of the image as the saliency map produces excellent
results because most human eye fixation data have a center bias as
photographers tend to place objects of interest in the center (Tatler
et al., 2010; Parkhurst & Niebur, 2003). Secondly, some saliency
models (Seo & Milanfar, 2009; Bruce & Tsotsos, 2009) have im-
age border effects due to invalid filter responses at the borders of
images and this also produces an artificial improvement in AUC
metric (Zhang et al., 2008). To avoid these problems, they set the
non-fixated locations of a test image as the fixated locations in an-
other image from the same test set. We follow the same procedure:
For each test image, we first compute a histogram of saliency at the
fixated locations of the test image and a histogram of saliency at the
fixated locations, but of a randomly chosen image from the test set.
Then, we compute all possible true positive and false positive rates
by varying the threshold on these two histograms respectively. Fi-
nally, we compute the AUC. All AUC’s computed for the various
images in the database are averaged to derive the reported overall
AUC. Because the test images for the non-fixations are randomly
chosen, we repeat this procedure 100 times and report the mean
and the standard error of the results in Table 1. As this shows, our
saliency model outperforms most other state-of-the-art models in
AUC metric. Only AWS is slightly better in AUC than ours but
the difference is roughly within the standard error bounds. In con-
trast to the AUC metric, our model holds third place in SCC metric.
However, we have more confidence in the AUC metric that is based
on the human fixations rather than the SCC metric that is based on
the fixation density map produced by a 2D Gaussian kernel density
estimate based on the human fixations.
We note that eye tracking data may contain error which orig-
inate from systematic error in the course of calibrating the eye
tracker and its lack of accuracy. Therefore, we perform a simu-
lation of this error by adding Gaussian noise to the fixated location
in the image. Table 2 shows all AUC’s computed for the various
standard deviations of Gaussian noise. We observed that this does
not affect much the performance (at least for the standard deviation
less than 10 and for the AUC metric) and our method still outper-
forms most other state-of-the art models.
In the next section, we will see that the proposed model is
more stable than others when the input images are corrupted by
noise, and thus produces better performance overall across a large
range of noise strengths.
Journal of Vision (20??) ?, 1–? 8
!"#$%&'()" O4(0)%"$%(+? P,=1"%Q%!#8$#8# R8=%Q%O4(0) G89",'(0%"$%(+? 5"8%Q%S&+(09(, G(,1&(H*&(T%"$%(+? U,838#"2%'"$482 <&V(:80%'(3
<&)=,"%W?%XV('3+"#%89%,"#=+$#?% Figure 5: Examples of result. For comparison, the fixation density maps produced based on the fixation points are provided by Bruce & Tsotsos.
Journal of Vision (20??) ?, 1–? 9
!"#$%&'()" O4(0)%"$%(+? P,=1"%Q%!#8$#8# R8=%Q%O4(0) G89",'(0%"$%(+? 5"8%Q%S&+(09(, G(,1&(H*&(T%"$%(+? U,838#"2%'"$482
<&)=,"%D?%XV('3+"#%89%#(+&"01-%&0%9,"Y="0$+-%811=,,&0)%3(Z",0#? Figure 6: Examples of saliency on images containing frequently occurring features.
Table 1: Performance in predicting human fixations in clean images
Model AUC (SE) SCC
Proposed method 0.713 (0.0007) 0.386
Garcia-Diaz et al. 0.714 (0.0008) 0.362
Seo & Milanfar 0.696 (0.0007) 0.346
Goferman et al. 0.686 (0.0008) 0.405
Hou & Zhang 0.672 (0.0007) 0.317
Bruce & Tsotsos 0.672 (0.0007) 0.424
Zhang et al. 0.639 (0.0007) 0.243
Table 2: AUC for the various standard deviations (std) from the original fixa-tion data.
Model std(0) std(5) std(10)
Proposed method 0.713 0.713 0.709
Garcia-Diaz et al. 0.714 0.713 0.710
Seo & Milanfar 0.696 0.695 0.693
Goferman et al. 0.686 0.686 0.681
Hou & Zhang 0.672 0.670 0.668
Bruce & Tsotsos 0.672 0.672 0.672
Zhang et al. 0.639 0.638 0.638
3.2. Stability of saliency models for noisy im-ages
In this section, we investigate the stability of saliency mod-
els for noisy images. The same original test images from Bruce
and Tsotsos’s dataset are used and the noise added to the test im-
ages is white Gaussian with variance σ2 which equals to 0.01, 0.05,
0.1 or 0.2 (The intensity value for each pixel of the image ranges
from 0 to 1). The saliency maps computed from the noisy im-
ages are compared to the human fixations through the same pro-
cedure. One may be concerned that the human fixations used in
this evaluation were recorded from noise-free images and not the
corrupted images. However, we focus on investigating the sen-
sitivity of computational models of visual attention subjected to
visual degradations rather than evaluating the performance in pre-
dicting human fixation data in noisy images. Therefore, we use the
same human fixations to see if the computational models achieve
the same performance as in the noise-free case. Also, to the best
of our knowledge, there is no available public fixation database on
noisy images. So, we resorted instead to analyzing how state-of-
the-art computational models respond to noisy visual stimuli.
Examples of noisy test images and their saliency maps are de-
picted in Fig. 7. We observe that the proposed model shows more
stable responses in flat regions in the background such as sky, road
and wall than in the regions containing high-frequency textured
area such as leaves of tree. This phenomenon can be explained as
Journal of Vision (20??) ?, 1–? 10
Clean image
Noise variance = 0.01
Noisy image
0.05 0.1 0.2
Figure 7: Saliency maps produced by the proposed method on increasingly noisy images. From left to right, a clean image, and noisy images with noisevariance σ2 = {0.01, 0.05, 0.1, 0.2}, respectively.
Journal of Vision (20??) ?, 1–? 11
follows: Background pixels in such flat regions are likely to have
more similar patches in the entire image while salient pixels have
similar patches in the nearby region. Furthermore, since we give
higher weights to similar patch pairs than dissimilar ones when we
aggregate the dissimilarities of those patches, we tend to average
more dissimilarities for the background pixels and suppress more
noise than on the salient pixels.
For quantitative performance analysis, we plot the AUC val-
ues for each method against the noise strength. As one may expect,
the performance in predicting human fixations generally decreases
as the noise strength increases (See the curves in Fig. 8). However,
our saliency model outperforms the six other state-of-the-art mod-
els over a wide range of noise strengths. Only Garcia-Diaz et al.’s
model shows similar performance for the noise-free case, but the
proposed model shows better performance for the noisy case.
As we alluded to earlier, most saliency models implicitly sup-
press the noise by blurring and downsampling the input image.
Hou & Zhang and Seo and Milanfar downsampled the input im-
age to 64×64. Bruce & Tsotsos, Zhang et al. and Garcia-Diaz
et al. also used input image downsampled by a factor of two. In
Goferman et al.’s model, the input image is downsampled to 250
pixels. However, as illustrated in Fig. 8, the price for this im-
plicit treatment is that the overall performance over a large range
of noise strengths is diminished except in Hou & Zhang’s model.
Since Hou & Zhang removes redundancies in the frequency do-
main after the input image is downsampled, they suppress more
noise and show stable results. However, we note that their method
does not achieve a high degree of accuracy overall in predicting
human fixations. In contrast, our regression based saliency model
achieves a high degree of accuracy for noise-free and noisy cases
simultaneously, and improves on competing models over a large
range of noise strengths.
We investigated how state-of-the-art computational models re-
spond to noisy visual stimuli. Based on the Helmholtz principle
(Desolneux, Moisan, & Morel, 2008), the human visual system
does not perceive structure in a uniform random image. Only when
some relatively large deviation from randomness occurs, a struc-
ture is perceived. According to this principle, the bottom-up ap-
proaches should result in roughly similar saliency maps to those
produced using clean images because the random features in the
input image are largely suppressed. That is to say, a good com-
putational saliency model should behave similarly in the presence
of noise, and return stable results. We made several noisy syn-
thetic images by adding different amounts of white Gaussian noise
to a 128×128 gray image containing a 19×19 black square in the
center (Fig. 9). The saliency maps computed from these noisy syn-
thetic images are normalized to range from zero to one. We note
that the input images were not downsampled or blurred before cal-
culating the saliency map, and thus the implicit noise suppression
was not included in this experiment. Fig. 9 shows results pro-
duced by the six other state-of-the-art models and the proposed
model. Only Garcia-Diaz et al.’s model and the proposed model
remained robust to the noise in the saliency map. We also ob-
served that the saliency maps from Seo & Milanfar’s model and
Hou & Zhang’s model in the second row (noise variance, 0.05) are
severely degraded compared to the ones in Fig. 1. In addition, Hou
& Zhang’s model only detect details (notice the white in bound-
ary of the square) and Zhang et al.’s model does not respond to
the black square of a given size. We believe that each model has
different inherent sensitivity to noise and different responses to im-
age features at a given scale. Therefore, different degree of blurring
and downsampling will no doubt affect on the result of each model
differently. In order to investigate an inherent sensitivity of each
model to noise, we performed the same evaluation on the saliency
maps, but with the same degree of resizing and blurring applied to
input images. To do this, we downsampled all the images to the
same size of 250 pixels. Fig. 10 shows the performance in predict-
ing the human fixations. We observed that the proposed model still
outperforms other models and achieves a high degree of accuracy
for both noise-free and noisy cases.
Finally, for the sake of completeness, we show the effect of
global and multi-scale approach on our saliency computational model.
To this end, we first evaluated the proposed model without the
global and multi-scale approach. In other words, we only consid-
ered the patches in the 7x7 local window to measure the dissimilar-
ities for each pixel (we denote this by ”Local + Single-scale” in Fig.
11.) Then, we extended the local analysis window to the entire im-
age and evaluated it again (”Global + Single-scale”). Last, we fur-
ther observed those patches from multiple scale images (”Global +
Multi-scale”). As seen in Fig. 11, we can get better performance
with this global approach. Multiple scale approach also improves
the performance, but the amount of improvement is not as signifi-
cant as that obtained by the global approach.
Journal of Vision (20??) ?, 1–? 12
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
0.5
0.55
0.6
0.65
0.7
Noise variance
AU
C
Z hang e t al .
B ruc e & T sotsos
Hou & Z hang
Gof e rman e t al .
S e o & M ilan f ar
Garc ia-D iaz e t al .
Prop osed me thod
Figure 8: The performance in predicting the human fixations decreases as the amount of noise increases. However, the proposed method outperformsthe six other state-of-the-art models over a wide range of noise strengths.
@291'&A#,9#$.'&BC*CDE
@291'&A#,9#$.'&BC*CFE
@291'&A#,9#$.'&BC*DE
@291'&A#,9#$.'&BC*GE !"#$%&'(&#)* +,-.'&/&012(121 32-&/&!"#$% 425',6#$&'(&#)* 7'2&/&89)#$5#, 4#,.9#:;9#<&'(&#)* =,2>21'?&6'("2?
Figure 9: Examples of saliency on noisy synthetic images.
Journal of Vision (20??) ?, 1–? 13
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
0.5
0.55
0.6
0.65
0.7
Noise variance
AU
C
Z hang e t al .
B ruc e & T sotsos
Hou & Z hang
Gof e rman e t al .
S e o & M ilan f ar
Garc ia-D iaz e t al .
Prop osed me thod
Figure 10: The performance in predicting the human fixations. The same degree of resizing and blurring were applied to input images.
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20.65
0.66
0.67
0.68
0.69
0.7
0.71
0.72
Noise variance
AU
C
Local+S ingle - sc ale
Global+S ingle - sc ale
Global+Multi - sc ale
Figure 11: The effect of global and multi-scale approach on our saliency computational model.
Journal of Vision (20??) ?, 1–? 14
4. Conclusion and future work
In this paper, we have proposed a simple and statistically well-
motivated saliency model based on non-parametric regression, which
is a data-dependent weighted combination of dissimilarities ob-
served in the given image. The proposed method is practically
appealing and effective because of its simple mathematical form.
In order to enhance its performance, we incorporate global and
multi-scale approach by extending the local analysis window to
the entire input image, even further to multiple scaled copies of the
image. Experiments on challenging sets of human fixations data
demonstrate that the proposed saliency model not only achieves
a high degree of accuracy in the standard noise-free scenario, but
also improves on other state-of-the-art models for noisy images.
Due to its robustness to noise, we expect the proposed model to
be quite effective in other computer vision applications subject to
severe degradation by noise.
We investigated how different computational saliency models
predict human fixations on images corrupted by white Gaussian
noise. For future work, it would be interesting to investigate how
they do on other type of distortions such as blur, low resolution,
snow, rain or air turbulence, which occur often in real-world ap-
plications. In addition, it would be interesting to pre-process de-
graded data before attempting to calculate the saliency map. As
observed in our earlier work (Kim & Milanfar, 2012), the perfor-
mance of saliency models can be improved by applying a denois-
ing approach first. Unfortunately, this is not consistent with the
way the human visual system operates; so an algorithm based on
filtering first would not seem to be well-motivated by biology. In
any event, imperfect denoising might further distort the data, and
thus this is at best sub-optimal.
Notes and Acknowledgements
Source code for the proposed algorithm will be released on-
line upon acceptance of our paper at the following link:
http://users.soe.ucsc.edu/ chkim/SaliencyDetection.html
This work was supported by AFOSR Grant FA9550-07-1-
0365 and NSF Grant CCF-1016018.
References
Bruce, N. D. B., & Tsotsos, J. K. (2009). Saliency, attention, and
visual search: An information theoretic approach. Journal
of Vision, 9(3), 1–24.
Desolneux, A., Moisan, L., & Morel, J. (2008). The TEXFrom
Gestalt Theory to Image Analysis. Interdisciplinary Applied
Mathematics.
Gao, D., Mahadevan, V., & Vasoncelos, N. (2008). On the plausi-
bility of the discriminant center-surround hypothesis for vi-
sual saliency. Journal of Vision, 8(7):13, 1–18.
Garcia-Diaz, A., Fdez-Vidal, X. R., Pardo, X. M., & Dosil, R.
(2012). Saliency from hierarchical adaptation through decor-
relation and variance normalization. Image and Vision Com-
puting, 30(1), 51–64.
Goferman, S., Zelnik-Manor, L., & Tal., A. (2010). Context-aware
saliency detection. IEEE International Conference on Com-
puter Vision and Pattern Recognition.
Hou, X., & Zhang, L. (2007). Saliency detection: A spectral
residual approach. Proc. IEEE Conf. Computer Vision and
Pattern Recognition.
Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-
based visual attention for rapid scene analysis. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 20,
1254–1259.
Kim, C., & Milanfar, P. (2012). Finding saliency in noisy images.
SPIE Conference on Computational Imaging.
Koch, C., & Ullman, S. (1985). Shifts in selective visual attention:
Towards the underlying neural circuitry. Human Neurobiol-
ogy, 4(4), 219–227.
Le Meur, O. (2011). Robustness and repeatability of saliency
models subjected to visual degradations. IEEE International
Conference on Image Processing, 3285–3288.
Ma, Q., & Zhang, L. (2008). Saliency-based image quality assess-
ment criterion. Advanced Intelligent Computing Theories
and Applications. With Aspects of Theoretical and Method-
ological Issues, 5226, 1124–1133.
Niassi, A., LeMeur, O., Lecallet, P., & Barba, D. (2007). Does
where you gaze on an image affect your perception of qual-
ity? applying visual attention to image quality metric. IEEE
International Conference on Image Processing, II-169–II-
172.
Parkhurst, D., & Niebur, E. (2003). Scene content selected by
active vision. Spatial Vision, 16(2), 125–154.
Rosin, P. L. (2009). A simple method for detecting salient regions.
Pattern Recognition, 42(11), 2363–2371.
Journal of Vision (20??) ?, 1–? 15
Rubner, Y., Tomasi, C., & Guibas, L. J. (2000). The earth movers
distance as a metric for image retrieval. International Jour-
nal of Computer Vision, 40, 99–121.
Rutishauser, U., Walther, D., Koch, C., & Perona, P. (2004). Is
bottom-up attention useful for object recognition? IEEE
Conference on Computer Vision and Pattern Recognition, 2,
II-37–II-44.
Seo, H., & Milanfar, P. (2009). Static and space-time visual
saliency detection by self-resemblance. Journal of Vision,
9(12), 1–27.
Seo, H., & Milanfar, P. (2010). Training-free, generic object de-
tection using locally adaptive regression kernel. IEEE Trans.
on Pattern Analysis and Machine Intelligence, 32(9), 1688–
1704.
Swain, M. J., & Ballard, D. H. (1991). Color indexing. Interna-
tional Journal of Computer Vision, 7(1), 11–32.
Takeda, H., Farsiu, S., & Milanfar, P. (2007). Kernel regression for
image processing and reconstruction. IEEE Trans. on Image
Processing, 16(2), 349–366.
Takeda, H., Milanfar, P., Protter, M., & Elad, M. (2009). Super-
resolution without explicit subpixel motion estimation. IEEE
Trans. on Image Processing, 18, 1958–1975.
Tatler, B. W., Baddeley, R. J., & Gilchrist, I. D. (2010). Visual
correlates of fixation selection: Effects of scale and time.
Vision Research, 45(5), 643–659.
Walther, D., & Koch, C. (2006). Modeling attention to salient
proto-objects. Neural Networks, 19(9), 1395–1407.
Zhang, L., Tong, M. H., & Marks, T. K. (2008). Sun: A bayesian
framework for saliency using natural statistics. Journal of
Vision, 8(7), 1–20.
Zhicheng, L., & Itti, L. (2011). Saliency and gist features for
target detection in satellite images. IEEE Transactions on
Image Processing, 20(7), 2017–2029.