Convergence Guarantees for Generalized Adaptive StochasticSearch Methods for Continuous Global Optimization
Rommel G. RegisMathematics Department
Saint Joseph’s University, Philadelphia, PA 19131, USA, [email protected]
June 23, 2010This paper presents some simple technical conditions that guarantee the convergence of a general
class of adaptive stochastic global optimization algorithms. By imposing some conditions on the
probability distributions that generate the iterates, these stochastic algorithms can be shown to
converge to the global optimum in a probabilistic sense. These results also apply to global optimiza-
tion algorithms that combine local and global stochastic search strategies and also those algorithms
that combine deterministic and stochastic search strategies. This makes the results applicable to
a wide range of global optimization algorithms that are useful in practice. Moreover, this paper
provides convergence conditions involving the conditional densities of the random vector iterates
that are easy to verify in practice. It also provides some convergence conditions in the special
case when the iterates are generated by elliptical distributions such as the multivariate Normal
and Cauchy distributions. These results are then used to prove the convergence of some practical
stochastic global optimization algorithms, including an evolutionary programming algorithm. In
addition, this paper introduces the notion of a stochastic algorithm being probabilistically dense in
the domain of the function and shows that, under simple assumptions, this is equivalent to seeing
any point in the domain with probability 1. This, in turn, is equivalent to almost sure convergence
to the global minimum. Finally, some simple results on convergence rates are also proved.
Key words: global optimization; stochastic search; random search; global search; local search;
convergence; elliptical distribution; evolutionary algorithm; evolutionary programming
1 Introduction
Let f be a deterministic function defined on a set D ⊆ Rd. A point x∗ ∈ D such that f(x∗) ≤ f(x)
for all x ∈ D is said to be a global minimum point (or global minimizer) for f on D. If f is continuous
and D is compact, then a global minimum point for f on D is guaranteed to exist. However, a
global minimizer may also exist even if D is not compact or even if f is discontinuous on certain
regions in D. We shall prove simple conditions that guarantee the convergence of a general class of
adaptive stochastic algorithms for finding the global minimum of f on D. In particular, we prove
a theorem that extends the result given on page 40 of Spall (2003) and derive consequences of this
theorem for stochastic algorithms that are used in practice.
1
We mention that there is a substantial body of literature on stochastic methods for local op-
timization of noisy loss functions (e.g. see Chin (1997) or Spall (2003)). These methods are
called stochastic approximation algorithms and they typically use approximations of the gradient
of the loss function. Examples of these methods include the standard finite-difference stochastic
approximation (FDSA) algorithm (Kiefer and Wolfowitz 1952) and the Simultaneous Perturbation
Stochastic Approximation (SPSA) algorithm (Spall 1992). However, the focus of this paper is on
stochastic search methods for global optimization of a deterministic function.
Many results have been provided on the convergence of stochastic search algorithms for global
optimization (e.g. Solis and Wets 1981, Pinter 1996, Stephens and Baritompa 1998, Maryak and
Chin 2001, Zabinsky 2003). However, many of these convergence conditions are cumbersome to
verify for algorithms that are used in practice. Moreover, some of these convergence results apply
only to a specific type of stochastic search algorithm. For example, Maryak and Chin (2001) showed
that under certain conditions, the Simultaneous Perturbation Stochastic Approximation (SPSA)
algorithm converges in probability to the global optimum. In addition, many of these convergence
conditions are usually applied to the uniform distribution or its variants but seldom applied to
other distributions that are used in practical stochastic global optimization algorithms like the
Normal, Cauchy or even the Triangular distributions. For example, the Pure Adaptive Search (PAS)
algorithm by Zabinsky and Smith (1992) uses the uniform distribution on the improving level set
in each iteration. However, in many heuristic optimization algorithms such as evolution strategies
and evolutionary programming algorithms, the multivariate Normal distribution is typically used.
Section 3 will provide a very simple and very general framework that can capture a wide range
of stochastic global optimization algorithms that can be designed in practice. The main goal of
this paper is to provide a set of convergence conditions for this general framework that are easy to
verify and that apply to commonly used probability distributions in practice.
2 Notations
Because the algorithms are stochastic, we treat the iterates as d-dimensional random vectors whose
realizations are in D ⊆ Rd. Consider a stochastic algorithm whose iterates are given by the
sequence of random vectors Ynn≥1 defined on a probability space (Ω,B, P ), where the random
vector Yn : (Ω,B) → (D,B(D)) represents the nth function evaluation point. Here, Ω is the sample
space, B is a σ-field of subsets of Ω, and B(D) are the Borel sets in D. For maximum generality, we
also focus on adaptive algorithms. Here, adaptive means that Yn possibly depends on the random
vectors Y1, . . . , Yn−1 for all n > 1. Our use of the term adaptive is more general than the one used
2
by Zabinsky (2003) in the Pure Adaptive Search (PAS) algorithm. In fact, in PAS, each Yn has
the uniform distribution on the improving level set x ∈ D : f(x) < f(Yi), i = 1, . . . , n− 1.In practical global optimization algorithms, it is not uncommon to combine both deterministic
and stochastic strategies for the selection of function evaluation points. For example, a practitioner
might want to start with a set of predetermined function evaluation points that he or she believes
would be good starting guesses for the location of the global minimum. If D is a closed hypercube,
another possibility is to begin with an optimal space-filling experimental design. Moreover, one
might also have a sequence of deterministically selected points in between sequences of randomly
selected points. For example, after doing some stochastic search, one might refine the current best
solution by running a deterministic gradient-based local minimization solver from it. To capture
global optimization algorithms in practice, we also allow for the possibility that some of the Yn’s
are degenerate random vectors. Here, a degenerate random vector in Rd is one whose mass is
concentrated at a single point in Rd. Note that when a particular Yn is degenerate this essentially
means that Yn is deterministically selected. Note that in this case, it is still possible that Yn depends
on the realizations of the previous random vectors in the sequence Y1, . . . , Yn−1.In practice, it could happen that a randomly generated point falls outside the domain D. For
example, if Y is a multivariate Normal random vector with a positive definite covariance matrix,
then theoretically its realization can be anywhere on Rd even if its mean vector is restricted to
D. In this case, the runaway random point is typically replaced by a suitable point in D. More
precisely, let Y : (Ω,B) → (Rd,B(Rd)) be a random vector and let D ⊆ Rd. It would be useful to
have a deterministic function ρD : Rd → D with the property that ρD(x) = x for all x ∈ D. We
refer to such a function as an absorbing transformation for D since it “absorbs” any point of Rd
into D. For example, if D is compact, ρD : Rd → D may be defined such that ρD(x) is a point in
D with ‖x − ρD(x)‖ = infy∈D ‖x − y‖. That is, ρD(x) is a point in D that is as close as possible
to x ∈ Rd. When D = [a, b] ⊆ Rd is a hypercube, we have ρD(x) = min(max(a, x), b), where the
max and min are taken componentwise. Another example of ρD might denote reflecting across a
boundary of D if it has one.
3 Main Convergence Results
Our goal is to provide simple conditions that guarantee the convergence of stochastic algorithms
that follow the following framework, which generalizes the well-known Simple (or “Blind”) Random
Search that is given on page 38 of Spall (2003):
3
Algorithm A: Generalized Adaptive Random Search (GARS) Framework
Inputs:
(1) The objective function f : D → R, where D ⊆ Rd.
(2) A deterministic absorbing transformation ρD : Rd → D, i.e. ρD(x) = x for all x ∈ D.
(3) A collection of intermediate random elements Λi,j : (Ω,B) → (Ωi,j ,Bi,j) : i ≥ 1 and j =
1, 2, . . . , ki that are used to determine the trial random vector iterates. These Λi,j ’s could
be random variables, random vectors or other types of random elements defined on the same
probability space (Ω,B, P ).
For convenience, we define the following notation. Define E0 = ∅, and for each n ≥ 1, define the
collection of random elements
En := Λi,j : i = 1, 2, . . . , n; j = 1, 2, . . . , kn = En−1
⋃Λn,1, Λn,2, . . . ,Λn,kn.
Step 0. Set n = 1.
Step 1. Generate a realization of the random vector Yn : (Ω,B) → (Rd,B(Rd)) as follows:
Step 1.1 For each j = 1, . . . , kn, generate a realization of the intermediate random element
Λn,j : (Ω,B) → (Ωn,j ,Bn,j) according to some probability distribution.
Step 1.2 Set Yn = Θn(En) for some deterministic function Θn. That is, Yn is a determin-
istic function of the random elements in En. (Hence, Y1, Y2 . . . , Yn are not necessarily
independent.)
Step 2. Set Xn = ρD(Yn) and evaluate f(Xn).
Step 3. Increment n = n + 1 and go back to Step 1.
We refer to a stochastic algorithm that follows the GARS framework above as a GARS algorithm.
In a GARS algorithm, the actual sequence of iterates (i.e. function evaluation points) is given by
Xnn≥1. Note that the realization of the random vector Xn is the same as that of Yn if the
realization of Yn belongs to D. Now we can also define the sequence X∗nn≥1 where X∗
1 = X1, and
for n > 1, X∗n = Xn if f(Xn) < f(X∗
n−1) while X∗n = X∗
n−1 otherwise. Note that X∗nn≥1 is the
sequence of best points encountered by the algorithm. We say that a GARS algorithm converges
to the global minimum of f on D in probability (or almost surely) if f(X∗n)n≥1 converges to
4
f∗ := infx∈D f(x) in probability (or almost surely). In this paper, we are interested in conditions
that guarantee the convergence of a GARS algorithm in a probabilistic sense. In the convergence
theorems below, we shall assume that the GARS algorithm under consideration is allowed to run
indefinitely.
In Step 1 of the GARS framework, each Λi,j could be a random variable, random vector or any
type of random element. If Λi,j is a random variable, then Ωi,j = R and Bi,j = B(R). If Λi,j is an
m-dimensional random vector, then Ωi,j = Rm and Bi,j = B(Rm). Here, we allow for the possibility
that we first generate the realizations of possibly several intermediate random elements before we
determine the realization of the trial random vector iterate Yn. Moreover, the realization of Yn
possibly depends on the realizations of the current and previous intermediate random elements
Λi,j : i = 1, 2, . . . , n and j = 1, 2, . . . , ki through a deterministic function Θn. The introduction
of these intermediate random elements provides flexibility to the framework so that it can capture
some practical stochastic global optimization algorithms.
The GARS framework also covers the special case where the Yn’s are generated independently.
In this case, kn = 1 and Yn = Λn,1 for all n ≥ 1, and the d-dimensional random vectors Λn,1n≥1
are independently generated. This shows that the GARS framework is indeed an extension of the
well-known Simple Random (“Blind”) Search algorithm, which is given on page 38 of Spall (2003).
However, we are more interested in the general case where the random vectors Ynn≥1 are possibly
dependent. For example, Yn may be defined by adding random perturbations to the components of
X∗n−1 (the best solution after n − 1 function evaluations) that are Normally distributed with zero
mean.
There are many convergence results on random search (e.g. Pinter 1996, Solis and Wets 1981,
Spall 2003, Zhigljavsky 1991). However, many of these convergence results (e.g. Solis and Wets
1981) are hard to verify in practice. In addition, except for the trivial case of the uniform distribu-
tion, previous results on random search did not include a verification of the convergence conditions
for other commonly used distributions in practice such as the multivariate Normal and Cauchy
distributions.
The following theorem is an extension of the theorem in page 40 of Spall (2003). It presents
a condition that guarantees the convergence of a GARS algorithm. The main differences between
this theorem and one given by Spall (2003) are: (i) the sampling in one iteration is not necessarily
independent of the previous iterations; (ii) the convergence condition is only required for some
subsequence of the iterations; and (iii) the objective function might have multiple global minima
(not just multiple local minima) over the given domain. Having a convergence theorem that can
5
handle (i) and (ii) is important because these conditions hold in many practical stochastic algo-
rithms. Moreover, (iii) is also important since in the most general case, the objective function in a
global optimization problem might have multiple global minima, possibly even an infinite number
of global minima.
Before we present this theorem, we first introduce additional notation. Recall the definition of
En in the GARS framework. For each n ≥ 0, we also define σ(En) to be the σ-field generated by
the random elements in En. We can think of σ(En) as representing all the information that can be
derived from the random elements in En.
Theorem 1. Let f be a real-valued function on D ⊆ Rd such that f∗ := infx∈D f(x) > −∞.
Suppose that a GARS algorithm satisfies the following property: For any ε > 0, there exists 0 <
L(ε) < 1 such that
(Global Search Condition 1) P [Xnk∈ D : f(Xnk
) < f∗ + ε | σ(E(nk)−1)] ≥ L(ε) (1)
for some subsequence nkk≥1. Then, f(X∗n) −→ f∗ almost surely (a.s.).
The proof is similar to the one given in Spall (2003) but we include it to illustrate that it is
not necessary to assume independence among the Xn’s and that we do not need to require the
above global search condition on every Xn. In addition, we also allow for the possibility that f has
multiple global minimizers on D.
Proof. Fix ε > 0 and define Sε := x ∈ D : f(x) < f∗ + ε. By assumption,
P [Xnk∈ Sε | σ(E(nk)−1)] ≥ L(ε), for any k ≥ 1.
Now for each k ≥ 1, we have
P [Xn1 6∈ Sε, Xn2 6∈ Sε, . . . , Xnk6∈ Sε] =
k∏
i=1
P [Xni 6∈ Sε|Xn1 6∈ Sε, Xn2 6∈ Sε, . . . , Xn(i−1)6∈ Sε].
By conditioning on the random elements in E(ni)−1, it is easy to check that for each ε > 0, we have
P [Xni ∈ Sε|Xn1 6∈ Sε, Xn2 6∈ Sε, . . . , Xn(i−1)6∈ Sε] ≥ L(ε). Thus,
P [Xn1 6∈ Sε, Xn2 6∈ Sε, . . . , Xnk6∈ Sε] =
k∏
i=1
P [Xni 6∈ Sε|Xn1 6∈ Sε, Xn2 6∈ Sε, . . . , Xn(i−1)6∈ Sε]
=k∏
i=1
(1− P [Xni ∈ Sε|Xn1 6∈ Sε, Xn2 6∈ Sε, . . . , Xn(i−1)
6∈ Sε])≤ (1− L(ε))k .
6
Observe that if i is the smallest index such that Xi ∈ Sε, it follows that X∗i = Xi and X∗
n ∈ Sε
for all n ≥ i. Consequently, if X∗nk6∈ Sε, then Xn1 6∈ Sε, Xn2 6∈ Sε, . . . , Xnk
6∈ Sε. Hence, for each
k ≥ 1, we have
P [f(X∗nk
)− f∗ ≥ ε] = P [f(X∗nk
) ≥ f∗ + ε] = P [X∗nk6∈ Sε]
≤ P [Xn1 6∈ Sε, Xn2 6∈ Sε, . . . , Xnk6∈ Sε] ≤ (1− L(ε))k ,
and so,
limk→∞
P [f(X∗nk
)− f∗ ≥ ε] = 0,
i.e. f(X∗nk
) −→ f∗ in probability. By a standard result in probability theory (e.g. see Resnick
1999, Theorem 6.3.1(b)) it follows that f(X∗nk(i)
) −→ f∗ almost surely (a.s.) as i → ∞ for some
subsequence nk(i)i≥1.
Next, since f∗ > −∞ and f(X∗n)n≥1 is monotonically nonincreasing, it follows that lim
n→∞ f(X∗n(ω))
exists for each underlying sample point ω. Finally, since the subsequence f(X∗nk(i)
) −→ f∗ a.s., it
follows that f(X∗n) −→ f∗ a.s.
Note that the global search condition in the above theorem is expressed in terms of the random
vectors Xnkk≥1, which are the images of Ynk
k≥1 under the map ρD. It would be more convenient
if the global search condition is expressed in terms of the random vectors Ynkk≥1 since these are
the ones that are randomly generated by the algorithm. The following corollary gives the result
that we want.
Corollary 1. Let f be a real-valued function on D ⊆ Rd such that f∗ := infx∈D f(x) > −∞.
Suppose that a GARS algorithm satisfies the following property: For any ε > 0, there exists 0 <
L(ε) < 1 such that
(Global Search Condition 2) P [Ynk∈ D : f(Xnk
) < f∗ + ε | σ(E(nk)−1)] ≥ L(ε) (2)
for some subsequence nkk≥1. Then, f(X∗n) −→ f∗ almost surely (a.s.).
Proof. Using the notation in the proof of Theorem 1, the event [Xnk∈ Sε] contains [Ynk
∈ Sε] for
all k ≥ 1. Hence, for all k ≥ 1, we have
P [Xnk∈ Sε | σ(E(nk)−1)] ≥ P [Ynk
∈ Sε | σ(E(nk)−1)] ≥ L(ε),
and so, the result follows from Theorem 1.
7
Note that the global search conditions in (1) and (2) are only required for some subsequence of
the iterations in order to guarantee almost sure convergence to the global minimum. This result
is important since many practical global optimization algorithms use a combination of global and
local search iterations. Local search iterations are meant to explore small regions of the search space
in order to refine the current best solution and cannot usually satisfy (1) or (2). Moreover, it is also
common to combine both stochastic and deterministic search strategies in a single algorithm. A
deterministic iteration corresponds to a degenerate random vector Yn whose mass is concentrated
at a single point in Rd. By imposing the above global search condition on only a subsequence
of the iterations, we still obtain a theoretical guarantee of almost sure convergence to the global
minimum.
The next theorem deals with the case when f has a unique global minimizer x∗ over D. In this
situation, the sequence of best solutions X∗nn≥1 converges to x∗ almost surely. This result was
mentioned on p. 40 of Spall (2003) but a proof was not given. We include a proof of this theorem
for the sake of completeness.
Theorem 2. Let f be a real-valued function on D ⊆ Rd and suppose that x∗ is the unique global
minimizer of f over D in the sense that f(x∗) = infx∈D
f(x) > −∞ and infx∈D
‖x−x∗‖≥η
f(x) > f(x∗) for all
η > 0. Furthermore, suppose that a GARS algorithm satisfies the property that f(X∗n) −→ f(x∗)
a.s. Then, X∗n −→ x∗ a.s.
Proof. Fix ε > 0 and let f := inf x∈D‖x−x∗‖≥ε
f(x). Since f(X∗n) −→ f(x∗) a.s., it follows that there
exists a set N ⊆ Ω with P (N ) = 0 such that f(X∗n(ω)) −→ f(x∗) for all ω ∈ N c. By assumption,
f − f(x∗) > 0. Hence, for each ω ∈ N c, there is an integer N such that for all n ≥ N , we have
f(X∗n(ω))− f(x∗) = |f(X∗
n(ω))− f(x∗)| < f − f(x∗),
or equivalently, f(X∗n(ω)) < f . If ‖X∗
n(ω)− x∗‖ ≥ ε, then f(X∗n(ω)) ≥ inf x∈D
‖x−x∗‖≥εf(x) = f , which
is a contradiction. Hence, we must have ‖X∗n(ω) − x∗‖ < ε. This shows that X∗
n(ω) −→ x∗ for
each ω ∈ N c. Thus, X∗n −→ x∗ a.s.
The next theorem presents a convergence condition that is easier to verify than the one provided
by Theorem 1 or Corollary 1 in the case where f is continuous at a global minimizer over its
domain D.
Theorem 3. Let f be a real-valued function on D ⊆ Rd such that f∗ := infx∈D f(x) > −∞. More-
over, let x∗ be a global minimizer of f over D and suppose that f is continuous at x∗. Furthermore,
8
suppose that a GARS algorithm satisfies the following property: For any z ∈ D and δ > 0, there
exists 0 < ν(z, δ) < 1 such that
(Global Search Condition 3) P [Ynk∈ B(z, δ) ∩ D | σ(E(nk)−1)] ≥ ν(z, δ), (3)
for some subsequence nkk≥1. Here, B(z, δ) is the open ball centered at z with radius δ. Then,
f(X∗n) −→ f∗ a.s. Moreover, if x∗ is the unique global minimizer of f over D in the sense of
Theorem 2, then X∗n −→ x∗ a.s.
Note that in the above theorem, we again allow for the possibility that f has multiple global
minimizers over D. In this case, the hypothesis of the theorem only requires that f be continuous
at one of these global minimizers.
Proof. Fix ε > 0 and k ≥ 1. Since f is continuous at x∗, there exists δ(ε) > 0 such that |f(x) −f(x∗)| < ε whenever ‖x − x∗‖ < δ(ε). Hence, the event [Xnk
∈ D : f(Xnk) < f(x∗) + ε] = [Xnk
∈D : |f(Xnk
)− f(x∗)| < ε] ⊇ [Xnk∈ D : ‖Xnk
− x∗‖ < δ(ε)], and so,
P [Xnk∈ D : f(Xnk
) < f(x∗) + ε | σ(E(nk)−1)] ≥ P [Xnk∈ D : ‖Xnk
− x∗‖ < δ(ε) | σ(E(nk)−1)]
= P [Xnk∈ B(x∗, δ(ε))∩D | σ(E(nk)−1)] ≥ P [Ynk
∈ B(x∗, δ(ε))∩D | σ(E(nk)−1)] ≥ ν(x∗, δ(ε)) =: L(ε).
Clearly, L(ε) > 0 since δ(ε) > 0. By Theorem 1, it follows that f(X∗n) −→ f∗ a.s.
Theorem 3 essentially says that the algorithm must be able to reach any neighborhood of any
point in D with positive probability that is bounded away from zero for all random vector iterates
in the subsequence. However, in the above proof, we only use the assumption in Global Search
Condition 3 at a global minimizer x∗. The reason for requiring the condition for all z ∈ D is that
the location of the global minimizer x∗ is not known. The algorithm is only guaranteed to reach
x∗ (in the most general case) if the algorithm is capable of reaching any neighborhood of any point
in D when it is run indefinitely.
Note also that the requirements of Theorem 3 are somewhat mild. It only requires continuity
of f at one of its global minimizers over the domain. Hence, the convergence of the algorithm is
still guaranteed even on problems where certain regions of the search space are discontinuous. This
property is important in practice since there are many real-world optimization problems for which
the search space contains discontinuities.
The next theorem presents a sufficient condition that guarantees the convergence of a GARS
algorithm in terms of the infimum of the conditional densities of the trial random vector iterates.
9
It also applies to the case where f is continuous at a global minimizer of f over a compact set D.
Again, f might have multiple global minimizers over D and the theorem only requires that f be
continuous at one of these global minimizers.
Theorem 4. Let D be a subset of Rd such that ψD(δ) := infz∈D µ(B(z, δ) ∩ D) > 0 for all δ > 0,
where µ is the Lebesgue measure on Rd. Let f be a real-valued function on D such that f∗ :=
infx∈D f(x) > −∞. Moreover, suppose that f is continuous at a global minimizer x∗ of f over
D. Consider a GARS algorithm and suppose that there is a subsequence nkk≥1 such that for
each k ≥ 1, Ynkhas a conditional density gnk
(y | σ(E(nk)−1)) satisfying the following condition:
µ(y ∈ D : h(y) = 0) = 0, where h(y) := infk≥1 gnk(y | σ(E(nk)−1)). Then f(X∗
n) −→ f∗
a.s. Again, if x∗ is the unique global minimizer of f over D in the sense of Theorem 2, then
X∗n −→ x∗ a.s.
Proof. Fix δ > 0 and z ∈ D. For all k ≥ 1, we have
P [Ynk∈ B(z, δ) ∩ D | σ(E(nk)−1)] =
∫
B(z,δ)∩Dgnk
(y|σ(E(nk)−1)) dy ≥∫
B(z,δ)∩Dh(y) dy =: ν(z, δ).
Since h(y) is a nonnegative function on D, µ(y ∈ D : h(y) = 0) = 0 and µ(B(z, δ)∩D) ≥ ψD(δ) >
0, it follows that ν(z, δ) > 0. By Theorem 3, we have f(X∗n) −→ f∗ a.s.
Theorem 4 basically requires that the infimum of the conditional densities of all random candi-
date vectors over some subsequence of iterations to be strictly positive except on a set of Lebesgue
measure zero. This result is useful since the requirement is easy to verify for many distributions
used in practice.
Next, consider the situation where the random vector iterates are generated using different
probability distributions on the different components of an iterate. More precisely, for each j =
1, . . . , d, let v(j) denote the jth component of a point v ∈ Rd, i.e. v = (v(1), . . . , v(d))T . Moreover, for
a given random vector Y : (Ω,B) → (Rd,B(Rd)), let Y (j) be the random variable that corresponds
to the jth component of Y . Also, for any D ⊆ Rd, define D(j) := y ∈ R : y = v(j) for some v ∈ D.
Theorem 5. Let D be a subset of Rd such that ψD(δ) := infz∈D µ(B(z, δ) ∩ D) > 0 for all δ > 0,
where µ is the Lebesgue measure on Rd. Let f be a real-valued function defined on D such that
f∗ := infx∈D f(x) > −∞. Moreover, suppose that f is continuous at a global minimizer x∗ of f
over D. Consider a GARS algorithm and suppose that there is a subsequence nkk≥1 such that
the following properties hold:
10
[A] For each k ≥ 1, the random variables Y(1)nk , . . . , Y
(d)nk are conditionally independent given the
random elements in E(nk)−1; and
[B] For each k ≥ 1 and for each 1 ≤ j ≤ d, the random variable Y(j)nk has a conditional density
g(j)nk (u | σ(E(nk)−1)) and that h(j)(u) := infk≥1 g
(j)nk (u | σ(E(nk)−1)) satisfies:
µ(u ∈ D(j) : h(j)(u) = 0) = 0.
Here, µ is the Lebesgue measure on R.
Then f(X∗n) −→ f∗ a.s. In addition, if x∗ is the unique global minimizer of f on D in the sense of
Theorem 2, then X∗n −→ x∗ a.s.
Proof. By Property [A] above, it follows that for each k ≥ 1, Ynkhas a conditional density given
by:
gnk(y | σ(E(nk)−1)) =
d∏
j=1
g(j)nk
(y(j) | σ(E(nk)−1)).
As in Theorem 4, define h(y) := infk≥1 gnk(y | σ(E(nk)−1)), and note that
h(y) = infk≥1
d∏
j=1
g(j)nk
(y(j) | σ(E(nk)−1)) ≥d∏
j=1
(infk≥1
g(j)nk
(y(j) | σ(E(nk)−1)))
=d∏
j=1
h(j)(y(j))
Now
y ∈ D : h(y) = 0 ⊆d⋃
j=1
y ∈ D : h(j)(y(j)) = 0,
and so,
µ(y ∈ D : h(y) = 0) ≤ µ(d⋃
j=1
y ∈ D : h(j)(y(j)) = 0) ≤d∑
j=1
µ(y ∈ D : h(j)(y(j)) = 0).
Next, we have
y ∈ D : h(j)(y(j)) = 0 ⊆ D(1) × . . .×D(j−1) × v ∈ D(j) : h(j)(v) = 0 × D(j+1) × . . .×D(d),
and so,
µ(y ∈ D : h(j)(y(j)) = 0) ≤ µ(D(1)) . . . µ(D(j−1))µ(v ∈ D(j) : h(j)(v) = 0)µ(D(j−1)) . . . µ(D(d)) = 0.
Hence, µ(y ∈ D : h(y) = 0) = 0, and by Theorem 4, it follows that f(X∗n) −→ f∗ a.s.
11
4 Applications Involving Commonly Used Distributions
Throughout this section, we will assume that ψD(δ) := infz∈D µ(B(z, δ) ∩ D) > 0 for all δ > 0.
In particular, we will show that this assumption holds when D is a compact (i.e. closed and
bounded) hypercube in Rd. This special case is important since optimization over a compact
hypercube is common in many real-world applications. Note that if D is a compact hyperrectangle,
the optimization problem can be easily transformed into one where D = [0, 1]d by an appropriate
rescaling of the variables.
We will consider stochastic algorithms that follow the GARS framework in Section 3. Below, we
consider some possibilities for the distributions of the random vectors in a subsequence Ynkk≥1
of the trial random vector iterates Ynn≥1.
4.1 Uniform Distributions
The simplest case is when we have a subsequence Ynkk≥1 of random vectors where each Ynk
=
Λnk,1 has the uniform distribution over D. Here,
gnk(y | σ(E(nk)−1)) =
1µ(D)
, for all k ≥ 1,
and so,
h(y) := infk≥1
gnk(y | σ(E(nk)−1)) =
1µ(D)
.
Clearly, h satisfies µ(y ∈ D : h(y) = 0) = 0, and so, by Theorem 4, f(X∗n) −→ f∗ = infx∈D f(x)
a.s.
4.2 Elliptical Distributions
Let u ∈ Rd, let V be a symmetric nonnegative definite d × d matrix, and let φ : [0,∞) → R be a
function. A d-dimensional random vector Y is said to have an elliptical distribution with parameters
u, V and φ, written Y ∼ EC(u, V, φ), if its characteristic function has the form exp(itT u)φ(tT V t).
Let Y : (Ω,B) → (Rd,B(Rd)) be a random vector that has an elliptical distribution. If Y has a
density, then it has the form (Fang and Zhang 1990)
g(y) = γ [det(V )]−1/2 Ψ((y − u)T V −1(y − u)), y ∈ Rd (4)
where u ∈ Rd, V is a symmetric and positive definite matrix, Ψ is a nonnegative function over the
positive reals such that∫ ∞
0y(d/2)−1Ψ(y) dy < ∞, and γ is the normalizing constant given by
γ =Γ(d/2)
2πd/2
∫ ∞
0yd−1Ψ(y2) dy
(5)
12
Elliptical distributions include some of the more important distributions used in the design of
practical stochastic algorithms. For example, if we choose Ψ(y) = e−y/2 in the above definition,
then we get a nondegenerate multivariate Normal distribution, which is widely used in evolutionary
algorithms such as evolution strategies and evolutionary programming. Moreover, if we choose
Ψ(y) = (1+y)−( d+12 ), then we get the multivariate Cauchy distribution. An example of a stochastic
algorithm that uses the Cauchy distribution is given in Yao et al. (1999). The following theorem
shows that practical algorithms that arise out of elliptical distributions, where Ψ is monotonically
nonincreasing, converge to the global minimum almost surely.
Theorem 6. Let D be a bounded subset of Rd such that ψD(δ) := infz∈D µ(B(z, δ)∩D) > 0 for all
δ > 0, where µ is the Lebesgue measure on Rd. Let f be a real-valued function defined on D such that
f∗ := infx∈D f(x) > −∞. Moreover, suppose that f is continuous at a global minimizer x∗ of f over
D. Consider a GARS algorithm and suppose that there is a subsequence nkk≥1 such that for each
k ≥ 1, we have Ynk= Uk + Wk, where Uk = Φk(E(nk)−1) for some deterministic function Φk and
Wk is a random vector whose conditional distribution given σ(E(nk)−1) is an elliptical distribution
with conditional density given by
qk(w | σ(E(nk)−1)) = γ[det(Vk)]−1/2 Ψ(wT V −1k w), z ∈ Rd,
where γ is defined in (5). For each k ≥ 1, let λk be the smallest eigenvalue of Vk. Furthermore,
suppose that the following properties are satisfied:
[P1] Ψ is monotonically nonincreasing; and
[P2] infk≥1 λk > 0.
Then f(X∗n) −→ f∗ a.s. In addition, if x∗ is the unique global minimizer of f on D in the sense of
Theorem 2, then X∗n −→ x∗ a.s.
Before we prove the theorem, we note that in practice we typically have Uk = Φk(E(nk)−1) =
X∗(nk)−1, which is the random vector representing the best solution found after (nk) − 1 function
evaluations. However, in the above theorem, Uk can be any random vector that is a function of the
random elements in E(nk)−1 whose realizations are in Rd but do not have to be in D. In addition,
Uk may be any fixed vector in Rd.
Proof. For each k ≥ 1, observe that the conditional distribution of Ynkgiven σ(E(nk)−1) is an
elliptical distribution with conditional density
gnk(y | σ(E(nk)−1)) = γ[det(Vk)]−1/2 Ψ((y − uk)T V −1
k (y − uk)), y ∈ Rd,
13
where uk = Φk(λi,j : i = 1, 2, . . . , (nk) − 1; j = 1, 2, . . . , ki) and the λi,j ’s are the realizations of
the random elements in E(nk)−1. Now for each k ≥ 1 and y ∈ D, we have
(y − uk)T V −1k (y − uk) ≤ ‖y − uk‖2‖V −1
k (y − uk)‖2 ≤ ‖y − uk‖22‖V −1
k ‖2 ≤ diam(D)2
λk,
where diam(D) = supx,y∈D ‖x− y‖. Since D is bounded, it follows that diam(D) < ∞. Moreover,
since Ψ is monotonically nonincreasing, we obtain
Ψ((y − uk)T V −1k (y − uk)) ≥ Ψ
(diam(D)2
λk
).
Moreover, since the determinant of a matrix is a product of its eigenvalues, it follows that det(Vk) ≤(λ∗k)
d, where λ∗k is the largest eigenvalue of Vk. Thus, for each y ∈ D,
gnk(y | σ(E(nk)−1)) ≥ γ(λ∗k)
−d/2Ψ(
diam(D)2
λk
)≥ γ
(supk≥1
λ∗k
)−d/2
Ψ(
diam(D)2
infk≥1 λk
)> 0.
Hence, we have
h(y) := infk≥1
gnk(y | σ(E(nk)−1)) ≥ γ
(supk≥1
λ∗k
)−d/2
Ψ(
diam(D)2
infk≥1 λk
)> 0.
By Theorem 4, it follows that f(X∗n) −→ f∗ a.s.
It is easy to check that Ψ is monotonically nonincreasing for the multivariate normal (Ψ(y) =
e−y/2) and multivariate Cauchy (Ψ(y) = (1+y)−( d+12 )) distributions. Hence, Theorem 6 holds when
these distributions are used. Finally, we note that Ynkin Theorem 6 could yield a realization that
is outside of D. This is why it is important to apply some absorbing transformation ρD : Rd → Dto ensure that the new iterate will be inside D.
Corollary 2. Let D be a bounded subset of Rd such that ψD(δ) := infz∈D µ(B(z, δ) ∩ D) > 0 for
all δ > 0, where µ is the Lebesgue measure on Rd. Let f be a real-valued function defined on Dsuch that f∗ := infx∈D f(x) > −∞. Moreover, suppose that f is continuous at a global minimizer
x∗ of f over D. Consider a GARS algorithm and suppose that there is a subsequence nkk≥1 such
that for each k ≥ 1, we have Ynk= X∗
(nk)−1 + Wk, where Wk is a random vector whose conditional
distribution given σ(E(nk)−1) is multivariate Normal with mean vector 0 and covariance matrix Vk.
For each k ≥ 1, let λk be the smallest eigenvalue of Vk. If infk≥1 λk > 0, then f(X∗n) −→ f∗
a.s. In addition, if x∗ is the unique global minimizer of f on D in the sense of Theorem 2, then
X∗n −→ x∗ a.s.
14
Proof. As mentioned earlier, the multivariate Normal distribution is a special case of the elliptical
distribution. Moreover, note that
X∗(nk)−1 =
(nk)−1∑
i=1
Xi 1Ei ,
where 1Ei is an indicator function and Ei is the event defined by
Ei = [f(Xi) ≤ f(Xj) for all j = 1, . . . , (nk)− 1 and i is the smallest index with this property]
= [f(ρD(Yi)) ≤ f(ρD(Yj)) for all j = 1, . . . , (nk)− 1 and i is the smallest index with this property]
For each i = 1, 2, . . . , (nk)− 1, Yi is a deterministic function of the random elements in Ei. Hence,
X∗(nk)−1 is a deterministic function of the random elements in E(nk)−1. Since infk≥1 λk > 0, it follows
that for all k ≥ 1, the matrix Vk is invertible and Wk has an elliptical distribution with conditional
density given σ(E(nk)−1) given by
qk(w | σ(E(nk)−1)) = γ[det(Vk)]−1/2 Ψ(wT V −1k w), z ∈ Rd
where Ψ(y) = e−y/2 and γ = (2π)−d/2. Clearly, Ψ(y) = e−y/2 is monotonically nonincreasing and
we have infk≥1 λk > 0 by assumption. The conclusion now follows from Theorem 6.
4.3 Hypercube Domains
The succeeding result shows that the condition on ψD(δ) in Theorems 4–6 is easily satisfied in the
special case where D is a closed and bounded hypercube in Rd. Note that optimization over a
closed hypercube is typical in many applications.
Proposition 1. Let D = [a, b] ⊆ Rd be a closed hypercube and let `(D) be the length of one side of D.
If 0 < δ ≤ 12`(D), then ψD(δ) =
(δ
2
)d πd/2
Γ(d2 + 1)
. If δ > 12`(D), then ψD(δ) ≥
(`(D)
4
)d πd/2
Γ(d2 + 1)
.
Here, Γ is the gamma function defined by Γ(n) :=∫ ∞
0xn−1e−xdx.
Proof. First, consider the case where 0 < δ ≤ 12`(D). Let ei be the ith unit vector in Rd, i.e.
e(i)i = 1 and e
(j)i = 0 for all j 6= i. Fix z ∈ D. Since δ ≤ 1
2`(D), there exists α1, . . . , αd ∈−1, +1 such that z + δαiei ∈ D for all i = 1, . . . , d. Let D be the hypercube determined by the
points z, z + δα1e1, . . . , z + δαded. Note that D ⊆ D, and so, µ(B(z, δ) ∩ D) ≥ µ(B(z, δ) ∩ D) =12d
µ(B(z, δ)) =(
δ
2
)d πd/2
Γ(d2 + 1)
. Hence, ψD(δ) ≥(
δ
2
)d πd/2
Γ(d2 + 1)
. Finally, note that when z is a
corner point of D, then µ(B(z, δ) ∩ D) =(
δ
2
)d πd/2
Γ(d2 + 1)
.
15
Next, consider the case where δ > 12 . Again, fix z ∈ D. In this case, there exists α1, . . . , αd ∈
−1, +1 such that z + 12`Dαiei ∈ D for all i = 1, . . . , d. The proof now follows in a similar manner
as before.
4.4 Triangular Distributions
A random variable Y is said to have a triangular distribution with lower limit a, upper limit b, and
mode c if its density function is given by
g(y) =
2(y − a)(b− a)(c− a)
if a ≤ y ≤ c
2(b− y)(b− a)(b− c)
if c ≤ y ≤ b
Now suppose D = [a, b] ⊆ Rd is a compact hypercube and that there is a subsequence Ynkk≥1
of random vectors where each Ynkhas the following properties:
[A] The random variables Y(1)nk , . . . , Y
(d)nk (which are the components of Ynk
) are conditionally
independent given the random elements in E(nk)−1; and
[B] For each 1 ≤ j ≤ d, the random variable Y(j)nk has a triangular distribution with lower limit
a(j), upper limit b(j), and mode (x∗(nk)−1)(j). (Recall that x∗(nk)−1 is the best solution found
so far after (nk)− 1 function evaluations.)
In this case, each Y(j)nk has conditional density
g(j)nk
(u | σ(E(nk)−1)) =
2(u− a(j))
(b(j) − a(j))((x∗(nk)−1)
(j) − a(j)) if a(j) ≤ u ≤ (x∗(nk)−1)
(j)
2(b(j) − u)
(b(j) − a(j))(b(j) − (x∗(nk)−1)
(j)) if (x∗(nk)−1)
(j) ≤ u ≤ b(j)
Let h(j)(u) := infk≥1 g(j)nk (u | σ(E(nk)−1)). Now for each 1 ≤ j ≤ d, define
q(j)(u) =
2(u− a(j))(b(j) − a(j))2
if a(j) ≤ u ≤ a(j) + b(j)
22(b(j) − u)
(b(j) − a(j))2if
a(j) + b(j)
2≤ u ≤ b(j)
It is easy to check that h(j)(u) ≥ q(j)(u) for all 1 ≤ j ≤ d and for all a(j) ≤ u ≤ b(j). Since
µ(u ∈ D(j) : q(j)(u) = 0) = µ(u ∈ [a(j), b(j)] : q(j)(u) = 0) = µ(a(j), b(j)) = 0, it follows that
µ(u ∈ D(j) : h(j)(u) = 0) = 0. Hence, by Theorem 5, f(X∗n) −→ f∗ a.s.
16
5 Application to Practical Stochastic Search Algorithms
5.1 Localized Random Search
The following general algorithm for finding the global minimum of the function f(x) over D ⊆ Rd is
given on page 44 of Spall (2003) but a convergence proof was not provided in that book. Here, we
consider a special case that involves the multivariate Normal distribution and provide a convergence
proof using the results from the previous sections.
Algorithm B: Localized Random Search
Inputs:
(1) The objective function f : D → R, where D ⊆ Rd.
(2) A deterministic absorbing transformation ρD : Rd → D, i.e. ρD(x) = x for all x ∈ D.
Step 1. Pick an initial guess Y1 either randomly or deterministically and set X1 = ρD(Y1). Set
n = 2.
Step 2. Generate a new candidate iterate Yn = X∗n−1 + Zn, where X∗
n−1 is the best solution
after n − 1 function evaluations and Zn is a random vector whose conditional distribution
given σ(En−1) is a Normal distribution with mean vector 0 and diagonal covariance matrix
defined by
Cov(Zn) =
(σ
(1)n
)20 . . . 0
0(σ
(2)n
)2. . . 0
......
. . ....
0 0 . . .(σ
(d)n
)2
Here, E1 = Y1 and En−1 = Y1, Z2, . . . , Zn−1 for n > 2.
Step 3. Set Xn = ρD(Yn) and evaluate f(Xn).
Step 4. Increment n = n + 1 and go back to Step 2.
In the above algorithm, Step 2 is equivalent to
Y (j)n = (X∗
n−1)(j) + Z(j)
n , j = 1, . . . , d.
17
(Recall that Y (j) is the jth component of the random vector Y .) Since Zn has a Normal distri-
bution and Cov(Zn) is a diagonal matrix, it follows that the random variables Z(1)n , . . . , Z
(d)n are
conditionally independent given σ(En−1) and each Z(j)n has a normal distribution with mean 0 and
standard deviation σ(j)n . That is, σT
n = (σ(1)n , . . . , σ
(d)n ) is the vector of mutations for the individual
components of X∗n−1.
Corollary 3. Let D be a bounded subset of Rd such that ψD(δ) > 0 for all δ > 0. Suppose
the above localized random search algorithm is applied to a real-valued function f on D such that
f∗ := infx∈D f(x) > −∞. Moreover, suppose that f is continuous at a global minimizer x∗ of f
over D. Furthermore, suppose that there exists a subsequence nkk≥1 such that infk≥1
min1≤j≤d
σ(j)nk
> 0.
(Equivalently, lim supn→∞
min1≤j≤d
σ(j)n > 0.) Then f(X∗
n) −→ f∗ a.s.
Proof. First, we check that the Localized Random Search algorithm above follows the GARS frame-
work. We have k1 = 1 and Λ1,1 = Y1, and for n ≥ 2, we have kn = 1 and Λn,1 = Zn. Moreover, as
in the proof of Corollary 2, X∗n−1 is a deterministic function of the random elements in En−1, and
so, Yn is a deterministic function of the random elements in En = En−1⋃Zn for all n ≥ 2. Now
for the Localized Random Search algorithm, suppose that there exists a subsequence nkk≥1 such
that infk≥1
min1≤j≤d
σ(j)nk
> 0. We have
Ynk= X∗
(nk)−1 + Znk, for all k ≥ 1.
where Znkis a random vector whose conditional distribution given σ(E(nk)−1) is the Normal distri-
bution with mean vector 0 and covariance matrix
Cov(Znk) = diag
((σ(1)
nk
)2, . . . ,
(σ(1)
nk
)2)
.
Define Wk = Znkand Vk = Cov(Wk) for all k ≥ 1. Note that we have
Ynk= X∗
(nk)−1 + Wk, for all k ≥ 1.
The eigenvalues of Vk are the variances(σ
(1)nk
)2, . . . ,
(σ
(1)nk
)2of the normal random perturbations
for the different components of X∗(nk)−1. Hence, the smallest eigenvalue of Vk is λk := min
1≤j≤d
(σ(j)
nk
)2.
Since infk≥1
min1≤j≤d
σ(j)nk
> 0, it follows that infk≥1 λk > 0. The result now follows from Corollary 2.
5.2 Evolutionary Algorithms
There are three main types of evolutionary algorithms for global optimization (Back 1996): genetic
algorithms, evolution strategies and evolutionary programming algorithms. Below, we apply one of
18
the convergence results from the previous sections on a simple evolutionary programming algorithm
where the selection of parents in each generation is done in a greedy manner. The results in this
paper do not directly apply to a standard genetic algorithm for continuous optimization. However,
it may be possible to extend some of the results in this paper to prove the convergence of more
complex evolutionary programming algorithms and also evolution strategies but this topic is beyond
the scope of this paper and it will be the focus of future work.
Algorithm C: Evolutionary Programming with Greedy Parent Selection
Inputs:
(1) The objective function f : D → R, where D ⊆ Rd.
(2) A nonnegative fitness function F : D → R+.
(3) The number of offspring in every generation, denoted by µ.
(4) A deterministic absorbing transformation ρD : Rd → D, i.e. ρD(x) = x for all x ∈ D.
Step 1. (Initialization) Set t = 0 and for each i = 1, 2, . . . , µ, generate Yi according to some prob-
ability distribution whose realizations are on Rd, where Yi possibly depends on Y1, . . . , Yi−1,
and set Xi = ρD(Yi). Moreover, for each i = 1, 2, . . . , µ, set Pi(0) = Xi. The set P(0) =
P1(0), P2(0), . . . , Pµ(0) = X1, X2, . . . , Xµ is the initial parent population.
Step 2. (Evaluate the Initial Parent Population) For each i = 1, 2, . . . , µ, evaluate the objective
function value f(Xi) and the fitness value F(Xi).
Step 3. (Iterate) While termination condition is not satisfied do
Step 3.1. (Update Generation Counter) Reset t := t + 1.
Step 3.2. (Generate Offspring by Mutation) For each i = 1, 2, . . . , µ, set Ytµ+i = Mut(Pi(t−1)) and Xtµ+i = ρD(Ytµ+i). The set M(t) := Xtµ+1, Xtµ+2, . . . , Xtµ+µ constitutes the
offspring for the current generation.
Step 3.3. (Evaluate the Offspring) For each i = 1, 2, . . . , µ, evaluate f(Xtµ+i) and the fitness
value F(Xtµ+i).
Step 3.4. (Select New Parent Population) Select the parent population for the next genera-
tion: P(t) = Sel(P(t− 1)⋃M(t)).
End.
19
In Step 3.1, the mutation operator is defined as follows: For each t ≥ 1 and i = 1, 2, . . . , µ,
Ytµ+i = Mut(Pi(t− 1)) = Pi(t− 1) + Ztµ+i,
where Ztµ+i is a random vector whose conditional distribution given σ(Etµ+i−1) is a Normal distri-
bution with mean vector 0 and diagonal covariance matrix
Cov(Ztµ+i) =
(σ
(1)tµ+i
)20 . . . 0
0(σ
(2)tµ+i
)2. . . 0
......
. . ....
0 0 . . .(σ
(d)tµ+i
)2
.
Here, Etµ+i−1 = Y1, . . . , Yµ, Zµ+1, Zµ+2, . . . , Ztµ+i−1. Moreover, σTtµ+i = (σ(1)
tµ+i, . . . , σ(d)tµ+i) is the
vector of mutations for the individual components of the parent vector Pi(t − 1). That is, for
each j = 1, 2, . . . , d, the conditional distribution of the random variable Z(j)tµ+i given σ(Etµ+i−1) is
a Normal distribution with mean 0 and standard deviation σ(j)tµ+i. Moreover, we set the algorithm
parameter σ(j)tµ+i =
√F(Pi(t− 1)) as noted in Back (1993).
In Step 3.3, the selection of the parent solutions for the next generation is usually accom-
plished by probabilistic q-tournament selection as described in Back (1993). As q increases, this
q-tournament selection procedure becomes more and more greedy. For simplicity, we assume that
the selection of parent solutions proceeds in a completely greedy manner. That is, P(t) is simply
the collection of µ solutions from P(t − 1)⋃M(t) with the best fitness values. The more general
case of q-tournament selection will be addressed in future work.
Corollary 4. Let D be a bounded subset of Rd such that ψD(δ) > 0 for all δ > 0. Suppose the
above EP algorithm is applied to a real-valued function f on D such that f∗ := infx∈D f(x) > −∞.
Moreover, suppose that f is continuous at a global minimizer x∗ of f over D. Furthermore, suppose
that the fitness function F : D → R+ is always defined so that F(x) ≥ τ for some τ > 0 for all
x ∈ D. Then f(X∗n) −→ f∗ a.s.
Proof. As before, we first check that the above EP algorithm follows the GARS framework. For
n = 1, 2, . . . , µ, we have kn = 1 and Λn,1 = Yn. Moreover, for all n ≥ µ + 1, we have kn = 1 and
Λn,1 = Zn. Since the selection of the new parent population in Step 3.3 is done in a greedy manner,
it follows that for each integer t ≥ 1 and i = 1, 2, . . . , µ, Pi(t − 1) is a deterministic function of
Y1, Y2, . . . , Ytµ. This also implies that for each integer t ≥ 1 and i = 1, 2, . . . , µ, Pi(t − 1) is also a
20
deterministic function of Y1, Y2, . . . , Ytµ+i−1. Hence, for each integer t ≥ 1 and i = 1, 2, . . . , µ, we
have
Ytµ+i = Φ(t−1)µ+i(Etµ+i−1) + Ztµ+i,
for some deterministic function Φ(t−1)µ+i. Note that this implies that
Yµ+k = Φk(Eµ+k−1) + Zµ+k, for all k ≥ 1
where Zµ+k is a random vector whose conditional distribution given σ(Eµ+k−1) is a Normal distri-
bution with mean vector 0 and diagonal covariance matrix
Cov(Zµ+k) =
(σ
(1)µ+k
)20 . . . 0
0(σ
(2)µ+k
)2. . . 0
......
. . ....
0 0 . . .(σ
(d)µ+k
)2
For each integer k ≥ 1 and j = 1, 2, . . . , d, we have(σ
(j)µ+k
)2= F(Pi(t − 1)) ≥ τ , where i and t
are the unique integers such that i = 1, . . . , µ, t ≥ 1 and k = (t− 1)µ + i. Define the subsequence
nkk≥1 by nk = µ + k for all k ≥ 1. Then we have
Ynk= Φk(E(nk)−1) + Wk, for all k ≥ 1,
where Wk = Znk. Let λk be the smallest eigenvalue of Cov(Wk). Since the eigenvalues of Cov(Wk)
are(σ
(1)µ+k
)2, . . . ,
(σ
(d)µ+k
)2, we have
λk = min1≤j≤d
(σ
(j)µ+k
)2≥ τ,
and so, infk≥1 λk ≥ τ > 0. Moreover, the conditional distribution of Wk given σ(E(nk)−1) is an
elliptical distribution with conditional density given by
qk(w | σ(E(nk)−1)) = γ[det(Vk)]−1/2 Ψ(wT V −1k w), z ∈ Rd
where Ψ(y) = e−y/2 and γ = (2π)−d/2. Again, Ψ(y) = e−y/2 is monotonically nonincreasing, and
so, the conclusion follows from Theorem 6.
6 Related Convergence Results
The purpose of this section is to explore some connections between the results in Section 3 and
the results in a paper by Stephens and Baritompa (1998). Recall the notation in Section 3. Let
21
D be a subset of Rd and let C(D) be the set of all continuous functions f : D → R. Moreover,
let X ∗f be the set of all global minimizers of f : D → R. If D is compact, it is well-known that
X ∗f 6= ∅ for all f ∈ C(D). Now suppose that a stochastic global minimization algorithm applied to
f : D → R generates the random vector iterates Xnn≥1. The range of the sequence of random
vectors Xnn≥1 is denoted by range(Xnn≥1), the set of subsequential limit points of Xnn≥1
is denoted by slp(Xnn≥1), and the closure of Xnn≥1 is denoted by cl(Xnn≥1). Note that
range(Xnn≥1), slp(Xnn≥1) and cl(Xnn≥1) are random sets of points in D and cl(Xnn≥1) =
range(Xnn≥1)⋃
slp(Xnn≥1). Following the definition in Stephens and Baritompa (1998), the
algorithm is said to see the global minimum if cl(Xnn≥1)∩X ∗f 6= ∅. In addition, this algorithm is
said to see the point x ∈ D if x ∈ cl(Xnn≥1).
Stephens and Baritompa (1998) introduced the notion of a deterministic (or stochastic) sequen-
tial sampling algorithm as an algorithm where the next iterate (or sample point) depends only on
what they call local information that can be obtained from the previous iterates and on an instance
of a random element (in the stochastic case). Examples of local information include the best sam-
ple point, the maximum slope between sample points, or an interpolating polynomial through the
sample points. On the other hand, examples of nonlocal information include the Lipschitz constant,
level set associated with a function value, the number of local minima, or the global minimum itself.
The GARS framework (Algorithm A, Section 3) is more general than the idea of a stochastic se-
quential sampling algorithm since it could incorporate nonlocal information through the parameters
of the probability distributions in Step 1.1 or through the deterministic function Θn in Step 1.2.
Moreover, in Step 1.2 of the GARS framework, the current iterate possibly depends not only on
the previous iterates and their function values (as in stochastic sequential sampling algorithms) but
also on the intermediate random elements that were used to generate the previous iterates. Hence,
the results in this paper apply to a wider class of algorithms. The following theorem is a special
case of Theorem 3.3 in Stephens and Baritompa (1998) when restricted to continuous functions.
Note that this also applies to GARS algorithms that satisfy the definition of a stochastic sequential
sampling algorithm.
Theorem (Stephens and Baritompa 1998) For any probability p and for any stochastic se-
quential sampling algorithm,
P (algorithm sees the global minimum of g) ≥ p, ∀g ∈ C(D)
if and only if P (x ∈ cl(Xnn≥1) ≥ p, ∀x ∈ D, ∀f ∈ C(D)
The above theorem provides a necessary and sufficient condition for an algorithm to see the
22
global minimum of f with a specified probability. In particular, the special case where p = 1
provides a similar result to those we proved in Section 3. The above result provides a more general
convergence criterion in that p could be less than 1. However, given a practical stochastic algorithm,
it is typically not straightforward to check whether P (x ∈ cl(Xnn≥1) ≥ p, ∀x ∈ D,∀f ∈ C(D) for
a given value of p, especially for the case where p is strictly between 0 and 1. The next theorem
says that for stochastic algorithms applied to continuous functions defined over compact domains,
converging to the global minimum almost surely is equivalent to seeing the global minimum almost
surely. Recall that the global search conditions in Section 3 imply convergence to the global
minimum almost surely. Hence, these same conditions can be used to guarantee that the algorithm
sees the global minimum almost surely.
Theorem 7. Let D be a compact subset of Rd and let f : D → R be a continuous function.
Suppose that a stochastic global minimization algorithm applied to f generates a sequence of random
vector iterates Xnn≥1 whose sequence of best points found so far is given by X∗nn≥1. Then
f(X∗n) −→ f∗ := infx∈D f(x) > −∞ a.s. if and only if cl(Xnn≥1)
⋂X ∗f 6= ∅ a.s. (i.e., the
algorithm sees the global minimum of f a.s.)
Proof. First, assume that f(X∗n) −→ f∗ a.s. Then there exists a set N such that P (N ) = 0 and
f(X∗n(ω)) −→ f∗ for all ω ∈ N c. Fix ω ∈ N c. We wish to show that cl(Xn(ω)n≥1) ∩ X ∗
f 6= ∅.Suppose this is not the case. Let f := inf
x∈cl(Xn(ω)n≥1)f(x). Since cl(Xn(ω)n≥1) is a closed
subset of the compact set D, it follows that cl(Xn(ω)n≥1) is also compact. Moreover, since f
is a continuous function, it follows that f = f(x) for some x ∈ cl(Xn(ω)n≥1). By assumption,
cl(Xn(ω)n≥1) ∩ X ∗f = ∅, and so, f = f(x) > f∗. Next,
range(X∗n(ω)n≥1) ⊆ range(Xn(ω)n≥1) ⊆ cl(Xn(ω)n≥1).
Hence, f(X∗n(ω)) ≥ f > f∗ for all n ≥ 1, and so, f(X∗
n(ω)) cannot converge to f∗, which is a
contradiction. Hence, cl(Xn(ω)n≥1)⋂X ∗
f 6= ∅ for this ω ∈ N c. Thus, the algorithm sees the
global minimum of f a.s.
To prove the converse, assume that cl(Xnn≥1)⋂X ∗
f 6= ∅ a.s. Then there exists a set Nsuch that P (N ) = 0 and cl(Xn(ω)n≥1)
⋂X ∗f 6= ∅ for all ω ∈ N c. Fix ω ∈ N c and let x∗ ∈
cl(Xn(ω)n≥1)⋂X ∗
f . Note that x∗ ∈ range(Xn(ω)n≥1) or x∗ ∈ slp(Xn(ω)n≥1), or both.
Suppose x∗ ∈ range(Xn(ω)n≥1). Then x∗ = Xn(ω) for some integer n. Since x∗ ∈ X ∗f , it
follows that X∗n(ω) = x∗ for all n ≥ n. Hence, f(X∗
n(ω)) = f(x∗) = f∗ for all n ≥ n, and so,
f(X∗n(ω)) −→ f∗.
23
On the other hand, suppose x∗ ∈ slp(Xn(ω)n≥1). Then there exists a subsequence Xnk(ω)k≥1
such that Xnk(ω) −→ x∗ as k →∞. Since f is continuous, f(Xnk
(ω)) −→ f(x∗) = f∗ as k →∞.
Moreover, 0 ≤ f(X∗nk
(ω)) − f∗ ≤ f(Xnk(ω)) − f∗ −→ 0 as k → ∞, and so, f(X∗
nk(ω)) −→ f∗ as
k → ∞. Next, since f∗ > −∞ and f(X∗n(ω))n≥1 is monotonically nonincreasing, it follows that
limn→∞ f(X∗n(ω)) exists, and so, limn→∞ f(X∗
n(ω)) = f∗.
In either case, f(X∗n(ω)) −→ f∗ for the given ω ∈ N c. Thus, f(X∗
n) −→ f∗ a.s.
Torn and Zilinskas (1989) proved that a deterministic global minimization algorithm converges
to the global minimum of any continuous function on a compact set D ⊆ Rd if and only if the
sequence of iterates of the algorithm is dense in D. Stephens and Baritompa (1998) extended this
result to any deterministic sequential sampling algorithm on what they call a sufficiently rich class
of functions (including continuous functions) and they also proved a stochastic version of their
result. Theorem 8 below is another stochastic version of the result by Torn and Zilinskas (1989)
that is different from the theorem proved by Stephens and Baritompa (1998).
Consider a sequence of random vectors Xnn≥1 whose realizations are in D ⊆ Rd. We say that
Xnn≥1 is probabilistically dense in D with guarantee p if
P (range(Xnn≥1) ∩B(x, δ) 6= ∅) ≥ p, ∀x ∈ D, ∀δ > 0
As before, B(x, δ) is the open ball in Rd centered at x with radius δ. If the above condition holds
when p = 1, we simply say that Xnn≥1 is probabilistically dense in D.
The next theorem shows that the sequence of random vector iterates of a stochastic global
minimization algorithm is probabilistically dense in D with guarantee p if and only if the algorithm
sees any point x ∈ D (including the global minimum of any function f on D) with probability at
least p. To prove the next theorem, we need the following lemma.
Lemma 1. Let Xnn≥1 be a sequence of random vectors whose realizations are in D ⊆ Rd. For
any x ∈ D, the following events are equal:
[x ∈ cl(Xnn≥1)] = [range(Xnn≥1) ∩B(x, δ) 6= ∅ ∀δ > 0].
Proof. Since cl(Xnn≥1) = range(Xnn≥1)⋃
slp(Xnn≥1), we have [x ∈ cl(Xnn≥1)] ⊆[range(Xnn≥1) ∩B(x, δ) 6= ∅ ∀δ > 0]. Next, suppose ω ∈ [range(Xnn≥1) ∩B(x, δ) 6= ∅ ∀δ > 0].
Then range(Xn(ω)n≥1) ∩ B(x, δ) 6= ∅ ∀δ > 0. Note that either x ∈ range(Xn(ω)n≥1) ⊆cl(Xn(ω)n≥1) or x 6∈ range(Xn(ω)n≥1). If x 6∈ range(Xn(ω)n≥1), then there are infinitely
24
many elements of range(Xn(ω)n≥1) that are contained in any open ball around x. This im-
plies that there exists an integer n1 such that Xn1(ω) ∈ B(x, 1). Moreover, there exists an in-
teger n2 > n1 such that Xn2(ω) ∈ B(x, 12). In general, for any integer k > 1, there exists an
integer nk > nk−1 such that Xnk(ω) ∈ B(x, 1
k ). Clearly, Xnk(ω) −→ x as k → ∞, and so,
x ∈ slp(Xn(ω)n≥1) ⊆ cl(Xn(ω)n≥1). In either case, we have ω ∈ [x ∈ cl(Xnn≥1)], and so,
[range(Xnn≥1) ∩ B(x, δ) 6= ∅ ∀ δ > 0] ⊆ [x ∈ cl(Xnn≥1)].
Theorem 8. Let D ⊆ Rd and suppose that a stochastic global minimization algorithm applied to the
function f : D → R generates the sequence of random vector iterates Xnn≥1 whose realizations
are in D. Then Xnn≥1 is probabilistically dense in D with guarantee p if and only if P (x ∈cl(Xnn≥1)) ≥ p ∀x ∈ D (i.e., the algorithm sees any point of D with probability at least p).
Proof. First, suppose that Xnn≥1 is probabilistically dense in D with guarantee p. Fix x ∈ D.
From the previous lemma, we have
[x ∈ cl(Xnn≥1)] = [range(Xnn≥1) ∩B(x, δ) 6= ∅ ∀δ > 0].
Moreover, it is easy to check that
[range(Xnn≥1) ∩B(x, δ) 6= ∅ ∀δ > 0] =[range(Xnn≥1) ∩B
(x,
1k
)6= ∅ for all integers k ≥ 1
]
Hence,
P (x ∈ cl(Xnn≥1)) = P
(range(Xnn≥1) ∩B
(x,
1k
)6= ∅ for all integers k ≥ 1
).
For every integer k ≥ 1, define the event
Sk =[
range(Xnn≥1)⋂
B
(x,
1k
)6= ∅
].
Since Xnn≥1 is probabilistically dense in D with guarantee p, it follows that P (Sk) ≥ p for all
k ≥ 1. Moreover, since Sk ⊇ Sk+1 for all k ≥ 1, it follows that P (Sk)k≥1 is a nonincreasing
sequence that is bounded below, and so, limk→∞ P (Sk) exists. Hence,
P (x ∈ cl(Xnn≥1)) = P
( ∞⋂
i=1
Si
)= P
(lim
k→∞
k⋂
i=1
Si
)= P
(lim
k→∞Sk
)= lim
k→∞P (Sk) ≥ p.
To prove the converse, suppose that P (x ∈ cl(Xnn≥1)) ≥ p ∀x ∈ D. Fix x ∈ D and δ > 0.
Note that
P (range(Xnn≥1) ∩B(x, δ) 6= ∅) ≥ P (range(Xnn≥1) ∩B(x, δ) 6= ∅ ∀δ > 0)
= P (x ∈ cl(Xnn≥1)) ≥ p,
25
where the above equality holds because of the previous lemma. Hence, Xnn≥1 is probabilistically
dense in D with guarantee p.
Corollary 5. Let D ⊆ Rd and suppose that a stochastic global minimization algorithm applied to the
function f : D → R generates the sequence of random vector iterates Xnn≥1 whose realizations are
in D. Moreover, suppose that X ∗f 6= ∅ and Xnn≥1 is probabilistically dense in D with guarantee p.
Then P (algorithm sees the global minimum of f) ≥ p.
Proof. Let x∗ ∈ X ∗f . Then
P (algorithm sees the global minimum of f) = P (cl(Xnn≥1) ∩ X ∗f 6= ∅)
≥ P (x∗ ∈ cl(Xnn≥1) = P (algorithm sees the point x∗) ≥ p,
where the last inequality follows from Theorem 8.
The next theorem (Theorem 9) shows that if a GARS algorithm satisfies the global search
condition in Theorem 3 of Section 3, then the resulting sequence of iterates will be probabilistically
dense in D, and so, by Corollary 5, the algorithm sees the global minimum of f with probability 1.
Note that this end result can also be obtained from Theorems 3 and 7. Hence, Theorem 9 and
Corollary 5 provide an alternative proof that any GARS algorithm satisfying the global search
conditions in Theorem 3 also sees the global minimum of f with probability 1.
Theorem 9. Let f be a real-valued function on D ⊆ Rd. Suppose that a GARS algorithm applied
to f satisfies Global Search Condition 3 from Theorem 3: For any z ∈ D and δ > 0, there exists
0 < ν(z, δ) < 1 such that
P [Ynk∈ B(z, δ) ∩ D | σ(E(nk)−1)] ≥ ν(z, δ),
for some subsequence nkk≥1. Then Xnn≥1 is probabilistically dense in D, and consequently,
P (x ∈ cl(Xnn≥1)) = 1 ∀x ∈ D.
Proof. Fix x ∈ D and δ > 0. We have
P (range(Xnn≥1) ∩B(x, δ) = ∅) = P (Xn 6∈ B(x, δ) ∀n ≥ 1)
= P (Xn 6∈ (B(x, δ) ∩ D) ∀n ≥ 1) (since Xnn≥1 ⊆ D for a GARS algorithm)
≤ P (Yn 6∈ (B(x, δ) ∩ D) ∀n ≥ 1) ≤ P (Ynk6∈ (B(x, δ) ∩ D) ∀k ≥ 1) = P
( ∞⋂
i=1
[Yni 6∈ (B(x, δ) ∩ D)]
)
= P
(lim
k→∞
k⋂
i=1
[Yni 6∈ (B(x, δ) ∩ D)]
)= lim
k→∞P
(k⋂
i=1
[Yni 6∈ (B(x, δ) ∩ D)]
)
26
= limk→∞
k∏
i=1
P(Yni 6∈ (B(x, δ) ∩ D) | Yn1 6∈ (B(x, δ) ∩ D), . . . , Yni−1 6∈ (B(x, δ) ∩ D)
)
= limk→∞
k∏
i=1
(1− P
(Yni ∈ (B(x, δ) ∩ D) | Yn1 6∈ (B(x, δ) ∩ D), . . . , Yni−1 6∈ (B(x, δ) ∩ D)
))
≤ limk→∞
k∏
i=1
(1− ν(x, δ)) = limk→∞
(1− ν(x, δ))k = 0 (since 0 < ν(x, δ) < 1)
Hence, P (range(Xnn≥1) ∩B(x, δ) = ∅) = 0, and so, P (range(Xnn≥1) ∩B(x, δ) 6= ∅) = 1 for
all x ∈ D and for all δ > 0, i.e., Xnn≥1 is probabilistically dense in D. Moreover, by Theorem 8,
P (x ∈ cl(Xnn≥1)) = 1 ∀x ∈ D
Following the definition in Stephens and Baritompa (1998), we say that an algorithm localizes
the global minimizers of a function f : D → R if its sequence of iterates Xnn≥1 satisfies ∅ 6=slp(Xnn≥1) ⊆ X ∗
f .
Theorem 10. Let D be a compact subset of Rd and suppose a stochastic global minimization algo-
rithm applied to the function f : D → R generates the sequence of random vectors Xnn≥1 whose
realizations are in the compact set D. If Xnn≥1 is probabilistically dense in D and each of the ran-
dom vectors in Xnn≥1 has a continuous probability distribution, then P (x ∈ slp(Xnn≥1)) = 1
for all x ∈ D and P (algorithm does not localize the global minimizers of f) = 1.
Proof. From Theorem 8, we have P (x ∈ cl(Xnn≥1)) = 1 ∀x ∈ D. Fix x ∈ D. Note that
P (x ∈ cl(Xnn≥1)) = P (x ∈ slp(Xnn≥1)) + P (x 6∈ slp(Xnn≥1) and x = Xn for some integer n).
Now P (x 6∈ slp(Xnn≥1) and x = Xn for some integer n) ≤ P (x = Xn for some integer n)
= P
( ∞⋃
n=1
[x = Xn]
)≤
∞∑
n=1
P (x = Xn) = 0 (since each Xn has a continuous distribution).
Hence, P (x ∈ slp(Xnn≥1)) = P (x ∈ cl(Xnn≥1)) = 1. In addition, if x ∈ D \ X ∗f , then
P (algorithm does not localize the global minimizers of f)
= P (slp(Xnn≥1) 6⊆ X ∗f ) + P (slp(Xnn≥1) = ∅) = P (slp(Xnn≥1) 6⊆ X ∗
f ) (since D is compact)
≥ P (x ∈ slp(Xnn≥1)) = 1.
27
Consider any GARS algorithm that satisfies the global search condition from Theorem 3. From
Theorem 9, the sequence of iterates Xnn≥1 will be probabilistically dense in D. Hence, in the
case where each Xn has a continuous probability distribution, Theorem 10 implies that each point
of D is a subsequential limit of the sequence of iterates of the algorithm with probability 1, and so,
the algorithm does not localize the global minimizers of D with probability 1. This suggests that
the worst-case convergence rate of the algorithm will be slow (and this will be confirmed in the
next section) since it essentially searches the whole domain to find the global minimum. However,
if we focus on the sequence of best points X∗nn≥1, the next theorem says that any subsequential
limit of X∗nn≥1 is a global minimizer almost surely.
Theorem 11. Let D be a compact subset of Rd and let f : D → R be a continuous function. Suppose
that a stochastic global minimization algorithm applied to f generates a sequence of iterates Xnn≥1
whose sequence of best points found so far X∗nn≥1 satisfies the condition f(X∗
n) −→ f∗ :=
infx∈D f(x) > −∞ a.s. Then slp(X∗nn≥1) ⊆ X ∗
f a.s.
Proof. Since f(X∗n) −→ f∗ a.s., it follows that there exists a set N such that P (N ) = 0 and
f(X∗n(ω)) −→ f∗ for all ω ∈ N c. Fix ω ∈ N c and let x ∈ D \ X ∗
f . We wish to show that
x 6∈ slp(X∗n(ω)n≥1).
From the proof of Theorem 7, there exists x∗ ∈ X ∗f such that x∗ ∈ cl(Xn(ω)n≥1). Since
x 6∈ X ∗f , we have f(x) − f(x∗) > 0. Now, since f is continuous over D, it follows that there is a
δ > 0 such that whenever x ∈ B(x∗, δ), we have
|f(x)− f(x∗)| = f(x)− f(x∗) ≤ 12(f(x)− f(x∗)), or equivalently, f(x) ≤ 1
2(f(x) + f(x∗)).
Furthermore, since x∗ ∈ cl(Xn(ω)n≥1), it follows that there is an integer n such that Xn(ω) ∈B(x∗, δ). Note that for any integer n ≥ n, we have
f(X∗n(ω)) ≤ f(X∗
n(ω)) ≤ f(Xn(ω)) ≤ 12(f(x) + f(x∗)) < f(x),
and so, lim supn→∞
f(X∗n(ω)) < f(x). Suppose x ∈ slp(X∗
n(ω)n≥1). Then there is a subsequence
X∗nk
(ω)k≥1 that converges to x. Since f is continuous, it follows that f(X∗nk
(ω)) → f(x)
as k → ∞, and so, lim supn→∞ f(X∗n(ω)) ≥ f(x), which yields a contradiction. Hence, x 6∈
slp(X∗n(ω)n≥1). This implies that
D \ X ∗f ⊆ D \ slp(X∗
n(ω)n≥1),
or equivalently, slp(X∗n(ω)n≥1) ⊆ X ∗
f . Thus, slp(X∗nn≥1) ⊆ X ∗
f a.s.
28
7 Convergence Rates
In the GARS framework, we allowed for the possibility that the trial random vector iterates Ynn≥1
are dependent. Moreover, in the convergence results for the GARS framework, the global search
conditions that are necessary for almost sure convergence to the global minimum are only required
for some subsequence of the iterations. Because of these, it is hard to perform any kind of con-
vergence analysis for the GARS framework. However, in the special case of simple random search,
where the Yn’s are all independent and identically distributed random vectors, there had been some
convergence analyses (e.g. Spall (2003)). Below, we provide a simple result that applies to the case
when the Yn’s are not necessarily independent and not necessarily identically distributed but the
result assumes that Global Search Condition 3 in Theorem 3 is satisfied by all the iterates.
Theorem 12. Let f be a real-valued function that has a unique global minimizer x∗ on a set
D ⊆ Rd in the sense of Theorem 2 and let f be continuous at x∗. Suppose that a GARS algorithm
satisfies Global Search Condition 3 from Theorem 3 for all iterations: For any z ∈ D and δ > 0,
there exists 0 < ν(z, δ) < 1 such that
P [Yn ∈ B(z, δ) ∩ D | σ(En−1)] ≥ ν(z, δ), (6)
Then P [Xi ∈ B(x∗, δ) ∩ D for some 1 ≤ i ≤ n] ≥ 1− (1− ν(x∗, δ))n.
Proof. By Theorem 3, X∗n −→ x∗ a.s. Moreover, the probability that the algorithm lands in the
region B(x∗, δ) ∩ D within n function evaluations is given by
P [Xi ∈ B(x∗, δ) ∩ D for some 1 ≤ i ≤ n]
= 1− P [X1 6∈ B(x∗, δ) ∩ D, X2 6∈ B(x∗, δ) ∩ D, . . . , Xn 6∈ B(x∗, δ) ∩ D]
= 1−n∏
i=1
P [Xi 6∈ B(x∗, δ) ∩ D|X1 6∈ B(x∗, δ) ∩ D, X2 6∈ B(x∗, δ) ∩ D, . . . , Xi−1 6∈ B(x∗, δ) ∩ D]
= 1−n∏
i=1
(1−P [Xi ∈ B(x∗, δ)∩D|X1 6∈ B(x∗, δ)∩D, X2 6∈ B(x∗, δ)∩D, . . . , Xi−1 6∈ B(x∗, δ)∩D]),
Since
P [Xi ∈ B(x∗, δ) ∩ D | σ(Ei−1)] ≥ P [Yi ∈ B(x∗, δ) ∩ D | σ(Ei−1)] ≥ ν(x∗, δ),
for all i = 1, . . . , n, it follows that
P [Xi ∈ B(x∗, δ) ∩ D for some 1 ≤ i ≤ n] ≥ 1− (1− ν(x∗, δ))n.
29
The above theorem provides a lower bound for the probability that the algorithm lands within
a δ-neighborhood of the global minimizer with n iterations (or function evaluations). Now fix
the neighborhood radius δ > 0 and the probability requirement 0 < ξ < 1. If we set n =
dlog(1− ξ)/ log(1− ν(x∗, δ))e, then P [Xi ∈ B(x∗, δ) ∩ D for some 1 ≤ i ≤ n] ≥ ξ. We then
determine how large this particular n would be for special cases.
Corollary 6. Let f be a real-valued function on D = [0, 1]d such that f∗ := infx∈D
f(x) > −∞.
Moreover, suppose that f has a unique global minimizer x∗ on D in the sense of Theorem 2 and that
f is continuous at x∗. Consider a GARS algorithm and suppose that for each n ≥ 1, Yn has a condi-
tional density gn(y | σ(En−1)) satisfying the following condition: h(y) := infn≥1 gn(y | σ(En−1)) ≥ β
for all y ∈ D, where β > 0 is a constant. By Theorem 4, X∗n −→ x∗ a.s. Fix the neighborhood
radius 0 < δ ≤ 1/2 and the probability requirement 0 < ξ < 1. If we set
n =
⌈log(1− ξ)
/log
(1−
(δ
2
)d βπd/2
Γ(d2 + 1)
)⌉, (7)
then P [Xi ∈ B(x∗, δ) ∩ D for some 1 ≤ i ≤ n] ≥ ξ.
Proof. Fix z ∈ D and 0 < δ ≤ 1/2. For any n ≥ 1, we have
P [Yn ∈ B(z, δ) ∩ D | σ(En−1)] =∫
B(z,δ)∩Dgn(y|σ(En−1)) dy ≥
∫
B(z,δ)∩Dh(y) dy
≥∫
B(z,δ)∩Dβ dy = βµ(B(z, δ) ∩ D) ≥ βψD(δ) =
(δ
2
)d βπd/2
Γ(d2 + 1)
=: ν(z, δ),
where ψD(δ) is defined in Theorem 4 and its value in this case is given by Proposition 1. Hence,
(6) from Theorem 12 holds, and so, for the above choice of n, we get
P [Xi ∈ B(x∗, δ) ∩ D for some 1 ≤ i ≤ n] ≥ 1− (1− ν(x∗, δ))n ≥ ξ.
Consider the special case of Corollary 6 where Yn has the uniform distribution on the closed
hypercube D = [0, 1]d for all n ≥ 1. Then Xn ≡ Yn for all n ≥ 1. In this case, h(y) = 1 for all
y ∈ D and we can choose β = 1. Again, fix 0 < δ ≤ 1/2 and 0 < ξ < 1. If we set n equal to the
value given by (7) with β = 1, then P [Xi ∈ B(x∗, δ) ∩ D for some 1 ≤ i ≤ n] ≥ ξ. Hence, this
gives a value of n that can guarantee that the well-known random (“blind”) search algorithm lands
within a δ-neighborhood of the global minimizer. But how large is this value of n? That is, what
is the complexity of n in terms of the problem dimension d.
30
Proposition 2. Consider the assumptions of Corollary 6. The value of n given in (7) grows
exponentially in d.
Proof. From elementary calculus, it is easy to verify that −1log(1−x) ≥ 1
2x for all x ∈ (0, 12 ]. Now,
since 0 < δ ≤ 1/2 and d ≥ 1, it follows that(
δ2π1/2
)d ≤√
π4 and Γ(d
2 + 1) ≥√
π2 , and so,(
δ
2
)d πd/2
Γ(d2 + 1)
≤ 12. Moreover, observe that 0 < β ≤ 1. (If β > 1, then gn(y | σ(En−1)) in
Corollary 6 cannot be a conditional density for each n ≥ 1.) Hence,(
δ
2
)d βπd/2
Γ(d2 + 1)
≤ 12
and so, it
follows that
−1
log
(1−
(δ
2
)d βπd/2
Γ(d2 + 1)
) ≥ 2dΓ(d2 + 1)
2δdβπd/2≥√
π
4β
(4√π
)d
Multiplying the above inequality by − log(1 − ξ) > 0, it follows that the value of n in (7) above
satisfies
n ≥ log(1− ξ)
log
(1−
(δ
2
)d βπd/2
Γ(d2 + 1)
) ≥ −√π
4βlog(1− ξ)
(4√π
)d
Since −√π4β log(1− ξ) is a strictly positive constant and 4√
π> 2, this shows that the value of n given
by (7) grows exponentially in d.
Hence, this theorem shows that convergence quickly becomes very poor as the dimension of the
problem increases. However, as pointed out by Spall (2003), the no free lunch theorems (Wolpert
and Macready 1997) indicate that this simple GARS algorithm is no worse than any other algorithm
when performance is averaged over the entire range of possible optimization problems. In practice
though, some algorithms are tailored to perform better than others on certain classes of problems
with specific characteristics.
8 Summary
We proved some results that guarantee almost sure convergence to the global minimum for a general
class of adaptive stochastic search algorithms that follow the GARS (Generalized Adaptive Ran-
dom Search) framework. The GARS framework is an extension of the well-known simple (“blind”)
random search algorithm where the random iterates are not necessarily independent. In the GARS
framework, we allow for the possibility that a number of intermediate random elements are first
31
generated before the trial random vector iterate is computed. Moreover, if the trial iterate falls out-
side the domain of the problem, then it is mapped to the domain via an absorbing transformation.
By imposing some conditions on the random vector iterates and the probability distributions that
generate them (i.e., the global search conditions in Theorems 1 and 3 and subsequent theorems),
the convergence of a GARS algorithm that satisfies these conditions is guaranteed. In addition,
the global search condition only needs to be satisfied by a subsequence of the iterations in order
to guarantee convergence. This makes the results applicable to a wide range of practical stochastic
global optimization algorithms, including those that perform both local and global search and also
including those that combine both stochastic and deterministic search strategies.
We also proved convergence results (Theorems 4 and 5) that involve simple requirements on the
conditional densities of the trial random vector iterates that are easy to verify in practice. More-
over, in Theorem 6, we provided some simple conditions that guarantee convergence when using
an elliptical distribution, such as the multivariate Normal or Cauchy distribution, to generate the
trial random vector iterates. In Section 5, we provided a convergence proof for some practical
stochastic global optimization algorithms, including an evolutionary programming algorithm with
greedy parent selection. In Section 6, we explored some connections with the results by Stephens
and Baritompa (1998). In particular, we showed that for stochastic global minimization algorithms
applied to continuous functions defined over compact domains, converging almost surely to the
global minimum is equivalent to seeing the global minimum as defined by Stephens and Barito-
mpa (1998). Moreover, we introduced the notion of a sequence of random vector iterates being
probabilistically dense in the domain and showed that this is also equivalent to seeing the global
minimum with probability 1 under the usual assumptions. In addition, we proved that a GARS
algorithm satisfying the global search condition in Theorem 3 generates a sequence of iterates that
is probabilistically dense in the domain, and consequently, the algorithm sees any point of the do-
main (including the global minimizers) with probability 1. Finally, in Section 7, we proved some
simple results on the convergence rate of a GARS algorithm.
Acknowledgements
I would like to thank Prof. Shane Henderson from the School of Operations Research & Information
Engineering at Cornell University for some helpful comments during the early stages of this paper.
I would also like to thank the anonymous referees for their helpful comments and suggestions.
32
References
1. Back, T. Evolutionary Algorithms in Theory and Practice. Oxford University Press: New
York; 1996.
2. Back, T., Rudolph, G., and Schwefel, H.-P. Evolutionary programming and evolution strate-
gies: similarities and differences. In: Fogel, D.B. and Atmar, J.W. (Eds.), Proceedings of
the Second Annual Conference on Evolutionary Programming. Evolutionary Programming
Society: La Jolla, CA; 1993. pp. 11–22.
3. Chin, D.C. Comparative study of stochastic algorithms for system optimization based on
gradient approximations. IEEE Transactions on Systems, Man, and Cybernetics - B 1997;
27; 244–249.
4. Fang, K.-T. and Zhang, Y.-T. Generalized Multivariate Analysis. Science Press: Beijing.
Springer-Verlag: Berlin; 1990.
5. Kiefer, J. and Wolfowitz, J. Stochastic estimation of the maximum of a regression function.
Annals of Mathematical Statistics 1952; 23(3); 462–466.
6. Maryak, J.L. and Chin, D.C. Global random optimization by simultaneous perturbation
stochastic approximation. IEEE Transactions on Automatic Control 2008; 53(3); 780–783.
7. Pinter, J.D. Global Optimization in Action. Kluwer Academic Publishers: Dordrecht; 1996.
8. Resnick, S.I. A Probability Path. Birkhauser: Boston; 1999.
9. Solis, F.J. and Wets, R.J.-B. Minimization by random search techniques. Mathematics of
Operations Research 1981; 6(1); 19–30.
10. Spall, J.C. Introduction to Stochastic Search and Optimization. John Wiley & Sons, Inc.:
New Jersey; 2003.
11. Spall, J.C. Multivariate stochastic approximation using a simultaneous perturbation gradient
approximation. IEEE Transactions on Automatic Control 1992; 37; 332–341.
12. Stephens, C.P. and Baritompa, W. Global optimization requires global information. Journal
of Optimization Theory and Applications 1998; 96(3); 575–588.
33
13. Wolpert, D.H. and Macready, W.G. No free lunch theorems for optimization. IEEE Trans-
actions on Evolutionary Computation 1997; 1(1); 67–82.
14. Yao, X., Liu, Y., and Lin, G. Evolutionary programming made faster. IEEE Transactions on
Evolutionary Computation 1999; 3(2); 82–102.
15. Zabinsky, Z.B. Stochastic Adaptive Search in Global Optimization. Kluwer Academic Pub-
lishers; 2003.
16. Zabinsky, Z.B. and Smith, R.B. Pure adaptive search in global optimization. Mathematical
Programming 1992; 53; 323–338.
17. Zhigljavsky, A.A. Theory of Global Random Search, Kluwer Academic Publishers: Dordrecht;
1991.
18. Torn, A. and Zilinskas, A. Global Optimization, Springer-Verlag: Berlin, Germany; 1989.
34