+ All Categories
Home > Documents > Convergence Guarantees for Generalized Adaptive Stochastic

Convergence Guarantees for Generalized Adaptive Stochastic

Date post: 12-Feb-2022
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
34
Convergence Guarantees for Generalized Adaptive Stochastic Search Methods for Continuous Global Optimization Rommel G. Regis Mathematics Department Saint Joseph’s University, Philadelphia, PA 19131, USA, [email protected] June 23, 2010 This paper presents some simple technical conditions that guarantee the convergence of a general class of adaptive stochastic global optimization algorithms. By imposing some conditions on the probability distributions that generate the iterates, these stochastic algorithms can be shown to converge to the global optimum in a probabilistic sense. These results also apply to global optimiza- tion algorithms that combine local and global stochastic search strategies and also those algorithms that combine deterministic and stochastic search strategies. This makes the results applicable to a wide range of global optimization algorithms that are useful in practice. Moreover, this paper provides convergence conditions involving the conditional densities of the random vector iterates that are easy to verify in practice. It also provides some convergence conditions in the special case when the iterates are generated by elliptical distributions such as the multivariate Normal and Cauchy distributions. These results are then used to prove the convergence of some practical stochastic global optimization algorithms, including an evolutionary programming algorithm. In addition, this paper introduces the notion of a stochastic algorithm being probabilistically dense in the domain of the function and shows that, under simple assumptions, this is equivalent to seeing any point in the domain with probability 1. This, in turn, is equivalent to almost sure convergence to the global minimum. Finally, some simple results on convergence rates are also proved. Key words: global optimization; stochastic search; random search; global search; local search; convergence; elliptical distribution; evolutionary algorithm; evolutionary programming 1 Introduction Let f be a deterministic function defined on a set D⊆ R d . A point x * ∈D such that f (x * ) f (x) for all x ∈D is said to be a global minimum point (or global minimizer) for f on D. If f is continuous and D is compact, then a global minimum point for f on D is guaranteed to exist. However, a global minimizer may also exist even if D is not compact or even if f is discontinuous on certain regions in D. We shall prove simple conditions that guarantee the convergence of a general class of adaptive stochastic algorithms for finding the global minimum of f on D. In particular, we prove a theorem that extends the result given on page 40 of Spall (2003) and derive consequences of this theorem for stochastic algorithms that are used in practice. 1
Transcript
Page 1: Convergence Guarantees for Generalized Adaptive Stochastic

Convergence Guarantees for Generalized Adaptive StochasticSearch Methods for Continuous Global Optimization

Rommel G. RegisMathematics Department

Saint Joseph’s University, Philadelphia, PA 19131, USA, [email protected]

June 23, 2010This paper presents some simple technical conditions that guarantee the convergence of a general

class of adaptive stochastic global optimization algorithms. By imposing some conditions on the

probability distributions that generate the iterates, these stochastic algorithms can be shown to

converge to the global optimum in a probabilistic sense. These results also apply to global optimiza-

tion algorithms that combine local and global stochastic search strategies and also those algorithms

that combine deterministic and stochastic search strategies. This makes the results applicable to

a wide range of global optimization algorithms that are useful in practice. Moreover, this paper

provides convergence conditions involving the conditional densities of the random vector iterates

that are easy to verify in practice. It also provides some convergence conditions in the special

case when the iterates are generated by elliptical distributions such as the multivariate Normal

and Cauchy distributions. These results are then used to prove the convergence of some practical

stochastic global optimization algorithms, including an evolutionary programming algorithm. In

addition, this paper introduces the notion of a stochastic algorithm being probabilistically dense in

the domain of the function and shows that, under simple assumptions, this is equivalent to seeing

any point in the domain with probability 1. This, in turn, is equivalent to almost sure convergence

to the global minimum. Finally, some simple results on convergence rates are also proved.

Key words: global optimization; stochastic search; random search; global search; local search;

convergence; elliptical distribution; evolutionary algorithm; evolutionary programming

1 Introduction

Let f be a deterministic function defined on a set D ⊆ Rd. A point x∗ ∈ D such that f(x∗) ≤ f(x)

for all x ∈ D is said to be a global minimum point (or global minimizer) for f on D. If f is continuous

and D is compact, then a global minimum point for f on D is guaranteed to exist. However, a

global minimizer may also exist even if D is not compact or even if f is discontinuous on certain

regions in D. We shall prove simple conditions that guarantee the convergence of a general class of

adaptive stochastic algorithms for finding the global minimum of f on D. In particular, we prove

a theorem that extends the result given on page 40 of Spall (2003) and derive consequences of this

theorem for stochastic algorithms that are used in practice.

1

Page 2: Convergence Guarantees for Generalized Adaptive Stochastic

We mention that there is a substantial body of literature on stochastic methods for local op-

timization of noisy loss functions (e.g. see Chin (1997) or Spall (2003)). These methods are

called stochastic approximation algorithms and they typically use approximations of the gradient

of the loss function. Examples of these methods include the standard finite-difference stochastic

approximation (FDSA) algorithm (Kiefer and Wolfowitz 1952) and the Simultaneous Perturbation

Stochastic Approximation (SPSA) algorithm (Spall 1992). However, the focus of this paper is on

stochastic search methods for global optimization of a deterministic function.

Many results have been provided on the convergence of stochastic search algorithms for global

optimization (e.g. Solis and Wets 1981, Pinter 1996, Stephens and Baritompa 1998, Maryak and

Chin 2001, Zabinsky 2003). However, many of these convergence conditions are cumbersome to

verify for algorithms that are used in practice. Moreover, some of these convergence results apply

only to a specific type of stochastic search algorithm. For example, Maryak and Chin (2001) showed

that under certain conditions, the Simultaneous Perturbation Stochastic Approximation (SPSA)

algorithm converges in probability to the global optimum. In addition, many of these convergence

conditions are usually applied to the uniform distribution or its variants but seldom applied to

other distributions that are used in practical stochastic global optimization algorithms like the

Normal, Cauchy or even the Triangular distributions. For example, the Pure Adaptive Search (PAS)

algorithm by Zabinsky and Smith (1992) uses the uniform distribution on the improving level set

in each iteration. However, in many heuristic optimization algorithms such as evolution strategies

and evolutionary programming algorithms, the multivariate Normal distribution is typically used.

Section 3 will provide a very simple and very general framework that can capture a wide range

of stochastic global optimization algorithms that can be designed in practice. The main goal of

this paper is to provide a set of convergence conditions for this general framework that are easy to

verify and that apply to commonly used probability distributions in practice.

2 Notations

Because the algorithms are stochastic, we treat the iterates as d-dimensional random vectors whose

realizations are in D ⊆ Rd. Consider a stochastic algorithm whose iterates are given by the

sequence of random vectors Ynn≥1 defined on a probability space (Ω,B, P ), where the random

vector Yn : (Ω,B) → (D,B(D)) represents the nth function evaluation point. Here, Ω is the sample

space, B is a σ-field of subsets of Ω, and B(D) are the Borel sets in D. For maximum generality, we

also focus on adaptive algorithms. Here, adaptive means that Yn possibly depends on the random

vectors Y1, . . . , Yn−1 for all n > 1. Our use of the term adaptive is more general than the one used

2

Page 3: Convergence Guarantees for Generalized Adaptive Stochastic

by Zabinsky (2003) in the Pure Adaptive Search (PAS) algorithm. In fact, in PAS, each Yn has

the uniform distribution on the improving level set x ∈ D : f(x) < f(Yi), i = 1, . . . , n− 1.In practical global optimization algorithms, it is not uncommon to combine both deterministic

and stochastic strategies for the selection of function evaluation points. For example, a practitioner

might want to start with a set of predetermined function evaluation points that he or she believes

would be good starting guesses for the location of the global minimum. If D is a closed hypercube,

another possibility is to begin with an optimal space-filling experimental design. Moreover, one

might also have a sequence of deterministically selected points in between sequences of randomly

selected points. For example, after doing some stochastic search, one might refine the current best

solution by running a deterministic gradient-based local minimization solver from it. To capture

global optimization algorithms in practice, we also allow for the possibility that some of the Yn’s

are degenerate random vectors. Here, a degenerate random vector in Rd is one whose mass is

concentrated at a single point in Rd. Note that when a particular Yn is degenerate this essentially

means that Yn is deterministically selected. Note that in this case, it is still possible that Yn depends

on the realizations of the previous random vectors in the sequence Y1, . . . , Yn−1.In practice, it could happen that a randomly generated point falls outside the domain D. For

example, if Y is a multivariate Normal random vector with a positive definite covariance matrix,

then theoretically its realization can be anywhere on Rd even if its mean vector is restricted to

D. In this case, the runaway random point is typically replaced by a suitable point in D. More

precisely, let Y : (Ω,B) → (Rd,B(Rd)) be a random vector and let D ⊆ Rd. It would be useful to

have a deterministic function ρD : Rd → D with the property that ρD(x) = x for all x ∈ D. We

refer to such a function as an absorbing transformation for D since it “absorbs” any point of Rd

into D. For example, if D is compact, ρD : Rd → D may be defined such that ρD(x) is a point in

D with ‖x − ρD(x)‖ = infy∈D ‖x − y‖. That is, ρD(x) is a point in D that is as close as possible

to x ∈ Rd. When D = [a, b] ⊆ Rd is a hypercube, we have ρD(x) = min(max(a, x), b), where the

max and min are taken componentwise. Another example of ρD might denote reflecting across a

boundary of D if it has one.

3 Main Convergence Results

Our goal is to provide simple conditions that guarantee the convergence of stochastic algorithms

that follow the following framework, which generalizes the well-known Simple (or “Blind”) Random

Search that is given on page 38 of Spall (2003):

3

Page 4: Convergence Guarantees for Generalized Adaptive Stochastic

Algorithm A: Generalized Adaptive Random Search (GARS) Framework

Inputs:

(1) The objective function f : D → R, where D ⊆ Rd.

(2) A deterministic absorbing transformation ρD : Rd → D, i.e. ρD(x) = x for all x ∈ D.

(3) A collection of intermediate random elements Λi,j : (Ω,B) → (Ωi,j ,Bi,j) : i ≥ 1 and j =

1, 2, . . . , ki that are used to determine the trial random vector iterates. These Λi,j ’s could

be random variables, random vectors or other types of random elements defined on the same

probability space (Ω,B, P ).

For convenience, we define the following notation. Define E0 = ∅, and for each n ≥ 1, define the

collection of random elements

En := Λi,j : i = 1, 2, . . . , n; j = 1, 2, . . . , kn = En−1

⋃Λn,1, Λn,2, . . . ,Λn,kn.

Step 0. Set n = 1.

Step 1. Generate a realization of the random vector Yn : (Ω,B) → (Rd,B(Rd)) as follows:

Step 1.1 For each j = 1, . . . , kn, generate a realization of the intermediate random element

Λn,j : (Ω,B) → (Ωn,j ,Bn,j) according to some probability distribution.

Step 1.2 Set Yn = Θn(En) for some deterministic function Θn. That is, Yn is a determin-

istic function of the random elements in En. (Hence, Y1, Y2 . . . , Yn are not necessarily

independent.)

Step 2. Set Xn = ρD(Yn) and evaluate f(Xn).

Step 3. Increment n = n + 1 and go back to Step 1.

We refer to a stochastic algorithm that follows the GARS framework above as a GARS algorithm.

In a GARS algorithm, the actual sequence of iterates (i.e. function evaluation points) is given by

Xnn≥1. Note that the realization of the random vector Xn is the same as that of Yn if the

realization of Yn belongs to D. Now we can also define the sequence X∗nn≥1 where X∗

1 = X1, and

for n > 1, X∗n = Xn if f(Xn) < f(X∗

n−1) while X∗n = X∗

n−1 otherwise. Note that X∗nn≥1 is the

sequence of best points encountered by the algorithm. We say that a GARS algorithm converges

to the global minimum of f on D in probability (or almost surely) if f(X∗n)n≥1 converges to

4

Page 5: Convergence Guarantees for Generalized Adaptive Stochastic

f∗ := infx∈D f(x) in probability (or almost surely). In this paper, we are interested in conditions

that guarantee the convergence of a GARS algorithm in a probabilistic sense. In the convergence

theorems below, we shall assume that the GARS algorithm under consideration is allowed to run

indefinitely.

In Step 1 of the GARS framework, each Λi,j could be a random variable, random vector or any

type of random element. If Λi,j is a random variable, then Ωi,j = R and Bi,j = B(R). If Λi,j is an

m-dimensional random vector, then Ωi,j = Rm and Bi,j = B(Rm). Here, we allow for the possibility

that we first generate the realizations of possibly several intermediate random elements before we

determine the realization of the trial random vector iterate Yn. Moreover, the realization of Yn

possibly depends on the realizations of the current and previous intermediate random elements

Λi,j : i = 1, 2, . . . , n and j = 1, 2, . . . , ki through a deterministic function Θn. The introduction

of these intermediate random elements provides flexibility to the framework so that it can capture

some practical stochastic global optimization algorithms.

The GARS framework also covers the special case where the Yn’s are generated independently.

In this case, kn = 1 and Yn = Λn,1 for all n ≥ 1, and the d-dimensional random vectors Λn,1n≥1

are independently generated. This shows that the GARS framework is indeed an extension of the

well-known Simple Random (“Blind”) Search algorithm, which is given on page 38 of Spall (2003).

However, we are more interested in the general case where the random vectors Ynn≥1 are possibly

dependent. For example, Yn may be defined by adding random perturbations to the components of

X∗n−1 (the best solution after n − 1 function evaluations) that are Normally distributed with zero

mean.

There are many convergence results on random search (e.g. Pinter 1996, Solis and Wets 1981,

Spall 2003, Zhigljavsky 1991). However, many of these convergence results (e.g. Solis and Wets

1981) are hard to verify in practice. In addition, except for the trivial case of the uniform distribu-

tion, previous results on random search did not include a verification of the convergence conditions

for other commonly used distributions in practice such as the multivariate Normal and Cauchy

distributions.

The following theorem is an extension of the theorem in page 40 of Spall (2003). It presents

a condition that guarantees the convergence of a GARS algorithm. The main differences between

this theorem and one given by Spall (2003) are: (i) the sampling in one iteration is not necessarily

independent of the previous iterations; (ii) the convergence condition is only required for some

subsequence of the iterations; and (iii) the objective function might have multiple global minima

(not just multiple local minima) over the given domain. Having a convergence theorem that can

5

Page 6: Convergence Guarantees for Generalized Adaptive Stochastic

handle (i) and (ii) is important because these conditions hold in many practical stochastic algo-

rithms. Moreover, (iii) is also important since in the most general case, the objective function in a

global optimization problem might have multiple global minima, possibly even an infinite number

of global minima.

Before we present this theorem, we first introduce additional notation. Recall the definition of

En in the GARS framework. For each n ≥ 0, we also define σ(En) to be the σ-field generated by

the random elements in En. We can think of σ(En) as representing all the information that can be

derived from the random elements in En.

Theorem 1. Let f be a real-valued function on D ⊆ Rd such that f∗ := infx∈D f(x) > −∞.

Suppose that a GARS algorithm satisfies the following property: For any ε > 0, there exists 0 <

L(ε) < 1 such that

(Global Search Condition 1) P [Xnk∈ D : f(Xnk

) < f∗ + ε | σ(E(nk)−1)] ≥ L(ε) (1)

for some subsequence nkk≥1. Then, f(X∗n) −→ f∗ almost surely (a.s.).

The proof is similar to the one given in Spall (2003) but we include it to illustrate that it is

not necessary to assume independence among the Xn’s and that we do not need to require the

above global search condition on every Xn. In addition, we also allow for the possibility that f has

multiple global minimizers on D.

Proof. Fix ε > 0 and define Sε := x ∈ D : f(x) < f∗ + ε. By assumption,

P [Xnk∈ Sε | σ(E(nk)−1)] ≥ L(ε), for any k ≥ 1.

Now for each k ≥ 1, we have

P [Xn1 6∈ Sε, Xn2 6∈ Sε, . . . , Xnk6∈ Sε] =

k∏

i=1

P [Xni 6∈ Sε|Xn1 6∈ Sε, Xn2 6∈ Sε, . . . , Xn(i−1)6∈ Sε].

By conditioning on the random elements in E(ni)−1, it is easy to check that for each ε > 0, we have

P [Xni ∈ Sε|Xn1 6∈ Sε, Xn2 6∈ Sε, . . . , Xn(i−1)6∈ Sε] ≥ L(ε). Thus,

P [Xn1 6∈ Sε, Xn2 6∈ Sε, . . . , Xnk6∈ Sε] =

k∏

i=1

P [Xni 6∈ Sε|Xn1 6∈ Sε, Xn2 6∈ Sε, . . . , Xn(i−1)6∈ Sε]

=k∏

i=1

(1− P [Xni ∈ Sε|Xn1 6∈ Sε, Xn2 6∈ Sε, . . . , Xn(i−1)

6∈ Sε])≤ (1− L(ε))k .

6

Page 7: Convergence Guarantees for Generalized Adaptive Stochastic

Observe that if i is the smallest index such that Xi ∈ Sε, it follows that X∗i = Xi and X∗

n ∈ Sε

for all n ≥ i. Consequently, if X∗nk6∈ Sε, then Xn1 6∈ Sε, Xn2 6∈ Sε, . . . , Xnk

6∈ Sε. Hence, for each

k ≥ 1, we have

P [f(X∗nk

)− f∗ ≥ ε] = P [f(X∗nk

) ≥ f∗ + ε] = P [X∗nk6∈ Sε]

≤ P [Xn1 6∈ Sε, Xn2 6∈ Sε, . . . , Xnk6∈ Sε] ≤ (1− L(ε))k ,

and so,

limk→∞

P [f(X∗nk

)− f∗ ≥ ε] = 0,

i.e. f(X∗nk

) −→ f∗ in probability. By a standard result in probability theory (e.g. see Resnick

1999, Theorem 6.3.1(b)) it follows that f(X∗nk(i)

) −→ f∗ almost surely (a.s.) as i → ∞ for some

subsequence nk(i)i≥1.

Next, since f∗ > −∞ and f(X∗n)n≥1 is monotonically nonincreasing, it follows that lim

n→∞ f(X∗n(ω))

exists for each underlying sample point ω. Finally, since the subsequence f(X∗nk(i)

) −→ f∗ a.s., it

follows that f(X∗n) −→ f∗ a.s.

Note that the global search condition in the above theorem is expressed in terms of the random

vectors Xnkk≥1, which are the images of Ynk

k≥1 under the map ρD. It would be more convenient

if the global search condition is expressed in terms of the random vectors Ynkk≥1 since these are

the ones that are randomly generated by the algorithm. The following corollary gives the result

that we want.

Corollary 1. Let f be a real-valued function on D ⊆ Rd such that f∗ := infx∈D f(x) > −∞.

Suppose that a GARS algorithm satisfies the following property: For any ε > 0, there exists 0 <

L(ε) < 1 such that

(Global Search Condition 2) P [Ynk∈ D : f(Xnk

) < f∗ + ε | σ(E(nk)−1)] ≥ L(ε) (2)

for some subsequence nkk≥1. Then, f(X∗n) −→ f∗ almost surely (a.s.).

Proof. Using the notation in the proof of Theorem 1, the event [Xnk∈ Sε] contains [Ynk

∈ Sε] for

all k ≥ 1. Hence, for all k ≥ 1, we have

P [Xnk∈ Sε | σ(E(nk)−1)] ≥ P [Ynk

∈ Sε | σ(E(nk)−1)] ≥ L(ε),

and so, the result follows from Theorem 1.

7

Page 8: Convergence Guarantees for Generalized Adaptive Stochastic

Note that the global search conditions in (1) and (2) are only required for some subsequence of

the iterations in order to guarantee almost sure convergence to the global minimum. This result

is important since many practical global optimization algorithms use a combination of global and

local search iterations. Local search iterations are meant to explore small regions of the search space

in order to refine the current best solution and cannot usually satisfy (1) or (2). Moreover, it is also

common to combine both stochastic and deterministic search strategies in a single algorithm. A

deterministic iteration corresponds to a degenerate random vector Yn whose mass is concentrated

at a single point in Rd. By imposing the above global search condition on only a subsequence

of the iterations, we still obtain a theoretical guarantee of almost sure convergence to the global

minimum.

The next theorem deals with the case when f has a unique global minimizer x∗ over D. In this

situation, the sequence of best solutions X∗nn≥1 converges to x∗ almost surely. This result was

mentioned on p. 40 of Spall (2003) but a proof was not given. We include a proof of this theorem

for the sake of completeness.

Theorem 2. Let f be a real-valued function on D ⊆ Rd and suppose that x∗ is the unique global

minimizer of f over D in the sense that f(x∗) = infx∈D

f(x) > −∞ and infx∈D

‖x−x∗‖≥η

f(x) > f(x∗) for all

η > 0. Furthermore, suppose that a GARS algorithm satisfies the property that f(X∗n) −→ f(x∗)

a.s. Then, X∗n −→ x∗ a.s.

Proof. Fix ε > 0 and let f := inf x∈D‖x−x∗‖≥ε

f(x). Since f(X∗n) −→ f(x∗) a.s., it follows that there

exists a set N ⊆ Ω with P (N ) = 0 such that f(X∗n(ω)) −→ f(x∗) for all ω ∈ N c. By assumption,

f − f(x∗) > 0. Hence, for each ω ∈ N c, there is an integer N such that for all n ≥ N , we have

f(X∗n(ω))− f(x∗) = |f(X∗

n(ω))− f(x∗)| < f − f(x∗),

or equivalently, f(X∗n(ω)) < f . If ‖X∗

n(ω)− x∗‖ ≥ ε, then f(X∗n(ω)) ≥ inf x∈D

‖x−x∗‖≥εf(x) = f , which

is a contradiction. Hence, we must have ‖X∗n(ω) − x∗‖ < ε. This shows that X∗

n(ω) −→ x∗ for

each ω ∈ N c. Thus, X∗n −→ x∗ a.s.

The next theorem presents a convergence condition that is easier to verify than the one provided

by Theorem 1 or Corollary 1 in the case where f is continuous at a global minimizer over its

domain D.

Theorem 3. Let f be a real-valued function on D ⊆ Rd such that f∗ := infx∈D f(x) > −∞. More-

over, let x∗ be a global minimizer of f over D and suppose that f is continuous at x∗. Furthermore,

8

Page 9: Convergence Guarantees for Generalized Adaptive Stochastic

suppose that a GARS algorithm satisfies the following property: For any z ∈ D and δ > 0, there

exists 0 < ν(z, δ) < 1 such that

(Global Search Condition 3) P [Ynk∈ B(z, δ) ∩ D | σ(E(nk)−1)] ≥ ν(z, δ), (3)

for some subsequence nkk≥1. Here, B(z, δ) is the open ball centered at z with radius δ. Then,

f(X∗n) −→ f∗ a.s. Moreover, if x∗ is the unique global minimizer of f over D in the sense of

Theorem 2, then X∗n −→ x∗ a.s.

Note that in the above theorem, we again allow for the possibility that f has multiple global

minimizers over D. In this case, the hypothesis of the theorem only requires that f be continuous

at one of these global minimizers.

Proof. Fix ε > 0 and k ≥ 1. Since f is continuous at x∗, there exists δ(ε) > 0 such that |f(x) −f(x∗)| < ε whenever ‖x − x∗‖ < δ(ε). Hence, the event [Xnk

∈ D : f(Xnk) < f(x∗) + ε] = [Xnk

∈D : |f(Xnk

)− f(x∗)| < ε] ⊇ [Xnk∈ D : ‖Xnk

− x∗‖ < δ(ε)], and so,

P [Xnk∈ D : f(Xnk

) < f(x∗) + ε | σ(E(nk)−1)] ≥ P [Xnk∈ D : ‖Xnk

− x∗‖ < δ(ε) | σ(E(nk)−1)]

= P [Xnk∈ B(x∗, δ(ε))∩D | σ(E(nk)−1)] ≥ P [Ynk

∈ B(x∗, δ(ε))∩D | σ(E(nk)−1)] ≥ ν(x∗, δ(ε)) =: L(ε).

Clearly, L(ε) > 0 since δ(ε) > 0. By Theorem 1, it follows that f(X∗n) −→ f∗ a.s.

Theorem 3 essentially says that the algorithm must be able to reach any neighborhood of any

point in D with positive probability that is bounded away from zero for all random vector iterates

in the subsequence. However, in the above proof, we only use the assumption in Global Search

Condition 3 at a global minimizer x∗. The reason for requiring the condition for all z ∈ D is that

the location of the global minimizer x∗ is not known. The algorithm is only guaranteed to reach

x∗ (in the most general case) if the algorithm is capable of reaching any neighborhood of any point

in D when it is run indefinitely.

Note also that the requirements of Theorem 3 are somewhat mild. It only requires continuity

of f at one of its global minimizers over the domain. Hence, the convergence of the algorithm is

still guaranteed even on problems where certain regions of the search space are discontinuous. This

property is important in practice since there are many real-world optimization problems for which

the search space contains discontinuities.

The next theorem presents a sufficient condition that guarantees the convergence of a GARS

algorithm in terms of the infimum of the conditional densities of the trial random vector iterates.

9

Page 10: Convergence Guarantees for Generalized Adaptive Stochastic

It also applies to the case where f is continuous at a global minimizer of f over a compact set D.

Again, f might have multiple global minimizers over D and the theorem only requires that f be

continuous at one of these global minimizers.

Theorem 4. Let D be a subset of Rd such that ψD(δ) := infz∈D µ(B(z, δ) ∩ D) > 0 for all δ > 0,

where µ is the Lebesgue measure on Rd. Let f be a real-valued function on D such that f∗ :=

infx∈D f(x) > −∞. Moreover, suppose that f is continuous at a global minimizer x∗ of f over

D. Consider a GARS algorithm and suppose that there is a subsequence nkk≥1 such that for

each k ≥ 1, Ynkhas a conditional density gnk

(y | σ(E(nk)−1)) satisfying the following condition:

µ(y ∈ D : h(y) = 0) = 0, where h(y) := infk≥1 gnk(y | σ(E(nk)−1)). Then f(X∗

n) −→ f∗

a.s. Again, if x∗ is the unique global minimizer of f over D in the sense of Theorem 2, then

X∗n −→ x∗ a.s.

Proof. Fix δ > 0 and z ∈ D. For all k ≥ 1, we have

P [Ynk∈ B(z, δ) ∩ D | σ(E(nk)−1)] =

B(z,δ)∩Dgnk

(y|σ(E(nk)−1)) dy ≥∫

B(z,δ)∩Dh(y) dy =: ν(z, δ).

Since h(y) is a nonnegative function on D, µ(y ∈ D : h(y) = 0) = 0 and µ(B(z, δ)∩D) ≥ ψD(δ) >

0, it follows that ν(z, δ) > 0. By Theorem 3, we have f(X∗n) −→ f∗ a.s.

Theorem 4 basically requires that the infimum of the conditional densities of all random candi-

date vectors over some subsequence of iterations to be strictly positive except on a set of Lebesgue

measure zero. This result is useful since the requirement is easy to verify for many distributions

used in practice.

Next, consider the situation where the random vector iterates are generated using different

probability distributions on the different components of an iterate. More precisely, for each j =

1, . . . , d, let v(j) denote the jth component of a point v ∈ Rd, i.e. v = (v(1), . . . , v(d))T . Moreover, for

a given random vector Y : (Ω,B) → (Rd,B(Rd)), let Y (j) be the random variable that corresponds

to the jth component of Y . Also, for any D ⊆ Rd, define D(j) := y ∈ R : y = v(j) for some v ∈ D.

Theorem 5. Let D be a subset of Rd such that ψD(δ) := infz∈D µ(B(z, δ) ∩ D) > 0 for all δ > 0,

where µ is the Lebesgue measure on Rd. Let f be a real-valued function defined on D such that

f∗ := infx∈D f(x) > −∞. Moreover, suppose that f is continuous at a global minimizer x∗ of f

over D. Consider a GARS algorithm and suppose that there is a subsequence nkk≥1 such that

the following properties hold:

10

Page 11: Convergence Guarantees for Generalized Adaptive Stochastic

[A] For each k ≥ 1, the random variables Y(1)nk , . . . , Y

(d)nk are conditionally independent given the

random elements in E(nk)−1; and

[B] For each k ≥ 1 and for each 1 ≤ j ≤ d, the random variable Y(j)nk has a conditional density

g(j)nk (u | σ(E(nk)−1)) and that h(j)(u) := infk≥1 g

(j)nk (u | σ(E(nk)−1)) satisfies:

µ(u ∈ D(j) : h(j)(u) = 0) = 0.

Here, µ is the Lebesgue measure on R.

Then f(X∗n) −→ f∗ a.s. In addition, if x∗ is the unique global minimizer of f on D in the sense of

Theorem 2, then X∗n −→ x∗ a.s.

Proof. By Property [A] above, it follows that for each k ≥ 1, Ynkhas a conditional density given

by:

gnk(y | σ(E(nk)−1)) =

d∏

j=1

g(j)nk

(y(j) | σ(E(nk)−1)).

As in Theorem 4, define h(y) := infk≥1 gnk(y | σ(E(nk)−1)), and note that

h(y) = infk≥1

d∏

j=1

g(j)nk

(y(j) | σ(E(nk)−1)) ≥d∏

j=1

(infk≥1

g(j)nk

(y(j) | σ(E(nk)−1)))

=d∏

j=1

h(j)(y(j))

Now

y ∈ D : h(y) = 0 ⊆d⋃

j=1

y ∈ D : h(j)(y(j)) = 0,

and so,

µ(y ∈ D : h(y) = 0) ≤ µ(d⋃

j=1

y ∈ D : h(j)(y(j)) = 0) ≤d∑

j=1

µ(y ∈ D : h(j)(y(j)) = 0).

Next, we have

y ∈ D : h(j)(y(j)) = 0 ⊆ D(1) × . . .×D(j−1) × v ∈ D(j) : h(j)(v) = 0 × D(j+1) × . . .×D(d),

and so,

µ(y ∈ D : h(j)(y(j)) = 0) ≤ µ(D(1)) . . . µ(D(j−1))µ(v ∈ D(j) : h(j)(v) = 0)µ(D(j−1)) . . . µ(D(d)) = 0.

Hence, µ(y ∈ D : h(y) = 0) = 0, and by Theorem 4, it follows that f(X∗n) −→ f∗ a.s.

11

Page 12: Convergence Guarantees for Generalized Adaptive Stochastic

4 Applications Involving Commonly Used Distributions

Throughout this section, we will assume that ψD(δ) := infz∈D µ(B(z, δ) ∩ D) > 0 for all δ > 0.

In particular, we will show that this assumption holds when D is a compact (i.e. closed and

bounded) hypercube in Rd. This special case is important since optimization over a compact

hypercube is common in many real-world applications. Note that if D is a compact hyperrectangle,

the optimization problem can be easily transformed into one where D = [0, 1]d by an appropriate

rescaling of the variables.

We will consider stochastic algorithms that follow the GARS framework in Section 3. Below, we

consider some possibilities for the distributions of the random vectors in a subsequence Ynkk≥1

of the trial random vector iterates Ynn≥1.

4.1 Uniform Distributions

The simplest case is when we have a subsequence Ynkk≥1 of random vectors where each Ynk

=

Λnk,1 has the uniform distribution over D. Here,

gnk(y | σ(E(nk)−1)) =

1µ(D)

, for all k ≥ 1,

and so,

h(y) := infk≥1

gnk(y | σ(E(nk)−1)) =

1µ(D)

.

Clearly, h satisfies µ(y ∈ D : h(y) = 0) = 0, and so, by Theorem 4, f(X∗n) −→ f∗ = infx∈D f(x)

a.s.

4.2 Elliptical Distributions

Let u ∈ Rd, let V be a symmetric nonnegative definite d × d matrix, and let φ : [0,∞) → R be a

function. A d-dimensional random vector Y is said to have an elliptical distribution with parameters

u, V and φ, written Y ∼ EC(u, V, φ), if its characteristic function has the form exp(itT u)φ(tT V t).

Let Y : (Ω,B) → (Rd,B(Rd)) be a random vector that has an elliptical distribution. If Y has a

density, then it has the form (Fang and Zhang 1990)

g(y) = γ [det(V )]−1/2 Ψ((y − u)T V −1(y − u)), y ∈ Rd (4)

where u ∈ Rd, V is a symmetric and positive definite matrix, Ψ is a nonnegative function over the

positive reals such that∫ ∞

0y(d/2)−1Ψ(y) dy < ∞, and γ is the normalizing constant given by

γ =Γ(d/2)

2πd/2

∫ ∞

0yd−1Ψ(y2) dy

(5)

12

Page 13: Convergence Guarantees for Generalized Adaptive Stochastic

Elliptical distributions include some of the more important distributions used in the design of

practical stochastic algorithms. For example, if we choose Ψ(y) = e−y/2 in the above definition,

then we get a nondegenerate multivariate Normal distribution, which is widely used in evolutionary

algorithms such as evolution strategies and evolutionary programming. Moreover, if we choose

Ψ(y) = (1+y)−( d+12 ), then we get the multivariate Cauchy distribution. An example of a stochastic

algorithm that uses the Cauchy distribution is given in Yao et al. (1999). The following theorem

shows that practical algorithms that arise out of elliptical distributions, where Ψ is monotonically

nonincreasing, converge to the global minimum almost surely.

Theorem 6. Let D be a bounded subset of Rd such that ψD(δ) := infz∈D µ(B(z, δ)∩D) > 0 for all

δ > 0, where µ is the Lebesgue measure on Rd. Let f be a real-valued function defined on D such that

f∗ := infx∈D f(x) > −∞. Moreover, suppose that f is continuous at a global minimizer x∗ of f over

D. Consider a GARS algorithm and suppose that there is a subsequence nkk≥1 such that for each

k ≥ 1, we have Ynk= Uk + Wk, where Uk = Φk(E(nk)−1) for some deterministic function Φk and

Wk is a random vector whose conditional distribution given σ(E(nk)−1) is an elliptical distribution

with conditional density given by

qk(w | σ(E(nk)−1)) = γ[det(Vk)]−1/2 Ψ(wT V −1k w), z ∈ Rd,

where γ is defined in (5). For each k ≥ 1, let λk be the smallest eigenvalue of Vk. Furthermore,

suppose that the following properties are satisfied:

[P1] Ψ is monotonically nonincreasing; and

[P2] infk≥1 λk > 0.

Then f(X∗n) −→ f∗ a.s. In addition, if x∗ is the unique global minimizer of f on D in the sense of

Theorem 2, then X∗n −→ x∗ a.s.

Before we prove the theorem, we note that in practice we typically have Uk = Φk(E(nk)−1) =

X∗(nk)−1, which is the random vector representing the best solution found after (nk) − 1 function

evaluations. However, in the above theorem, Uk can be any random vector that is a function of the

random elements in E(nk)−1 whose realizations are in Rd but do not have to be in D. In addition,

Uk may be any fixed vector in Rd.

Proof. For each k ≥ 1, observe that the conditional distribution of Ynkgiven σ(E(nk)−1) is an

elliptical distribution with conditional density

gnk(y | σ(E(nk)−1)) = γ[det(Vk)]−1/2 Ψ((y − uk)T V −1

k (y − uk)), y ∈ Rd,

13

Page 14: Convergence Guarantees for Generalized Adaptive Stochastic

where uk = Φk(λi,j : i = 1, 2, . . . , (nk) − 1; j = 1, 2, . . . , ki) and the λi,j ’s are the realizations of

the random elements in E(nk)−1. Now for each k ≥ 1 and y ∈ D, we have

(y − uk)T V −1k (y − uk) ≤ ‖y − uk‖2‖V −1

k (y − uk)‖2 ≤ ‖y − uk‖22‖V −1

k ‖2 ≤ diam(D)2

λk,

where diam(D) = supx,y∈D ‖x− y‖. Since D is bounded, it follows that diam(D) < ∞. Moreover,

since Ψ is monotonically nonincreasing, we obtain

Ψ((y − uk)T V −1k (y − uk)) ≥ Ψ

(diam(D)2

λk

).

Moreover, since the determinant of a matrix is a product of its eigenvalues, it follows that det(Vk) ≤(λ∗k)

d, where λ∗k is the largest eigenvalue of Vk. Thus, for each y ∈ D,

gnk(y | σ(E(nk)−1)) ≥ γ(λ∗k)

−d/2Ψ(

diam(D)2

λk

)≥ γ

(supk≥1

λ∗k

)−d/2

Ψ(

diam(D)2

infk≥1 λk

)> 0.

Hence, we have

h(y) := infk≥1

gnk(y | σ(E(nk)−1)) ≥ γ

(supk≥1

λ∗k

)−d/2

Ψ(

diam(D)2

infk≥1 λk

)> 0.

By Theorem 4, it follows that f(X∗n) −→ f∗ a.s.

It is easy to check that Ψ is monotonically nonincreasing for the multivariate normal (Ψ(y) =

e−y/2) and multivariate Cauchy (Ψ(y) = (1+y)−( d+12 )) distributions. Hence, Theorem 6 holds when

these distributions are used. Finally, we note that Ynkin Theorem 6 could yield a realization that

is outside of D. This is why it is important to apply some absorbing transformation ρD : Rd → Dto ensure that the new iterate will be inside D.

Corollary 2. Let D be a bounded subset of Rd such that ψD(δ) := infz∈D µ(B(z, δ) ∩ D) > 0 for

all δ > 0, where µ is the Lebesgue measure on Rd. Let f be a real-valued function defined on Dsuch that f∗ := infx∈D f(x) > −∞. Moreover, suppose that f is continuous at a global minimizer

x∗ of f over D. Consider a GARS algorithm and suppose that there is a subsequence nkk≥1 such

that for each k ≥ 1, we have Ynk= X∗

(nk)−1 + Wk, where Wk is a random vector whose conditional

distribution given σ(E(nk)−1) is multivariate Normal with mean vector 0 and covariance matrix Vk.

For each k ≥ 1, let λk be the smallest eigenvalue of Vk. If infk≥1 λk > 0, then f(X∗n) −→ f∗

a.s. In addition, if x∗ is the unique global minimizer of f on D in the sense of Theorem 2, then

X∗n −→ x∗ a.s.

14

Page 15: Convergence Guarantees for Generalized Adaptive Stochastic

Proof. As mentioned earlier, the multivariate Normal distribution is a special case of the elliptical

distribution. Moreover, note that

X∗(nk)−1 =

(nk)−1∑

i=1

Xi 1Ei ,

where 1Ei is an indicator function and Ei is the event defined by

Ei = [f(Xi) ≤ f(Xj) for all j = 1, . . . , (nk)− 1 and i is the smallest index with this property]

= [f(ρD(Yi)) ≤ f(ρD(Yj)) for all j = 1, . . . , (nk)− 1 and i is the smallest index with this property]

For each i = 1, 2, . . . , (nk)− 1, Yi is a deterministic function of the random elements in Ei. Hence,

X∗(nk)−1 is a deterministic function of the random elements in E(nk)−1. Since infk≥1 λk > 0, it follows

that for all k ≥ 1, the matrix Vk is invertible and Wk has an elliptical distribution with conditional

density given σ(E(nk)−1) given by

qk(w | σ(E(nk)−1)) = γ[det(Vk)]−1/2 Ψ(wT V −1k w), z ∈ Rd

where Ψ(y) = e−y/2 and γ = (2π)−d/2. Clearly, Ψ(y) = e−y/2 is monotonically nonincreasing and

we have infk≥1 λk > 0 by assumption. The conclusion now follows from Theorem 6.

4.3 Hypercube Domains

The succeeding result shows that the condition on ψD(δ) in Theorems 4–6 is easily satisfied in the

special case where D is a closed and bounded hypercube in Rd. Note that optimization over a

closed hypercube is typical in many applications.

Proposition 1. Let D = [a, b] ⊆ Rd be a closed hypercube and let `(D) be the length of one side of D.

If 0 < δ ≤ 12`(D), then ψD(δ) =

2

)d πd/2

Γ(d2 + 1)

. If δ > 12`(D), then ψD(δ) ≥

(`(D)

4

)d πd/2

Γ(d2 + 1)

.

Here, Γ is the gamma function defined by Γ(n) :=∫ ∞

0xn−1e−xdx.

Proof. First, consider the case where 0 < δ ≤ 12`(D). Let ei be the ith unit vector in Rd, i.e.

e(i)i = 1 and e

(j)i = 0 for all j 6= i. Fix z ∈ D. Since δ ≤ 1

2`(D), there exists α1, . . . , αd ∈−1, +1 such that z + δαiei ∈ D for all i = 1, . . . , d. Let D be the hypercube determined by the

points z, z + δα1e1, . . . , z + δαded. Note that D ⊆ D, and so, µ(B(z, δ) ∩ D) ≥ µ(B(z, δ) ∩ D) =12d

µ(B(z, δ)) =(

δ

2

)d πd/2

Γ(d2 + 1)

. Hence, ψD(δ) ≥(

δ

2

)d πd/2

Γ(d2 + 1)

. Finally, note that when z is a

corner point of D, then µ(B(z, δ) ∩ D) =(

δ

2

)d πd/2

Γ(d2 + 1)

.

15

Page 16: Convergence Guarantees for Generalized Adaptive Stochastic

Next, consider the case where δ > 12 . Again, fix z ∈ D. In this case, there exists α1, . . . , αd ∈

−1, +1 such that z + 12`Dαiei ∈ D for all i = 1, . . . , d. The proof now follows in a similar manner

as before.

4.4 Triangular Distributions

A random variable Y is said to have a triangular distribution with lower limit a, upper limit b, and

mode c if its density function is given by

g(y) =

2(y − a)(b− a)(c− a)

if a ≤ y ≤ c

2(b− y)(b− a)(b− c)

if c ≤ y ≤ b

Now suppose D = [a, b] ⊆ Rd is a compact hypercube and that there is a subsequence Ynkk≥1

of random vectors where each Ynkhas the following properties:

[A] The random variables Y(1)nk , . . . , Y

(d)nk (which are the components of Ynk

) are conditionally

independent given the random elements in E(nk)−1; and

[B] For each 1 ≤ j ≤ d, the random variable Y(j)nk has a triangular distribution with lower limit

a(j), upper limit b(j), and mode (x∗(nk)−1)(j). (Recall that x∗(nk)−1 is the best solution found

so far after (nk)− 1 function evaluations.)

In this case, each Y(j)nk has conditional density

g(j)nk

(u | σ(E(nk)−1)) =

2(u− a(j))

(b(j) − a(j))((x∗(nk)−1)

(j) − a(j)) if a(j) ≤ u ≤ (x∗(nk)−1)

(j)

2(b(j) − u)

(b(j) − a(j))(b(j) − (x∗(nk)−1)

(j)) if (x∗(nk)−1)

(j) ≤ u ≤ b(j)

Let h(j)(u) := infk≥1 g(j)nk (u | σ(E(nk)−1)). Now for each 1 ≤ j ≤ d, define

q(j)(u) =

2(u− a(j))(b(j) − a(j))2

if a(j) ≤ u ≤ a(j) + b(j)

22(b(j) − u)

(b(j) − a(j))2if

a(j) + b(j)

2≤ u ≤ b(j)

It is easy to check that h(j)(u) ≥ q(j)(u) for all 1 ≤ j ≤ d and for all a(j) ≤ u ≤ b(j). Since

µ(u ∈ D(j) : q(j)(u) = 0) = µ(u ∈ [a(j), b(j)] : q(j)(u) = 0) = µ(a(j), b(j)) = 0, it follows that

µ(u ∈ D(j) : h(j)(u) = 0) = 0. Hence, by Theorem 5, f(X∗n) −→ f∗ a.s.

16

Page 17: Convergence Guarantees for Generalized Adaptive Stochastic

5 Application to Practical Stochastic Search Algorithms

5.1 Localized Random Search

The following general algorithm for finding the global minimum of the function f(x) over D ⊆ Rd is

given on page 44 of Spall (2003) but a convergence proof was not provided in that book. Here, we

consider a special case that involves the multivariate Normal distribution and provide a convergence

proof using the results from the previous sections.

Algorithm B: Localized Random Search

Inputs:

(1) The objective function f : D → R, where D ⊆ Rd.

(2) A deterministic absorbing transformation ρD : Rd → D, i.e. ρD(x) = x for all x ∈ D.

Step 1. Pick an initial guess Y1 either randomly or deterministically and set X1 = ρD(Y1). Set

n = 2.

Step 2. Generate a new candidate iterate Yn = X∗n−1 + Zn, where X∗

n−1 is the best solution

after n − 1 function evaluations and Zn is a random vector whose conditional distribution

given σ(En−1) is a Normal distribution with mean vector 0 and diagonal covariance matrix

defined by

Cov(Zn) =

(1)n

)20 . . . 0

0(σ

(2)n

)2. . . 0

......

. . ....

0 0 . . .(σ

(d)n

)2

Here, E1 = Y1 and En−1 = Y1, Z2, . . . , Zn−1 for n > 2.

Step 3. Set Xn = ρD(Yn) and evaluate f(Xn).

Step 4. Increment n = n + 1 and go back to Step 2.

In the above algorithm, Step 2 is equivalent to

Y (j)n = (X∗

n−1)(j) + Z(j)

n , j = 1, . . . , d.

17

Page 18: Convergence Guarantees for Generalized Adaptive Stochastic

(Recall that Y (j) is the jth component of the random vector Y .) Since Zn has a Normal distri-

bution and Cov(Zn) is a diagonal matrix, it follows that the random variables Z(1)n , . . . , Z

(d)n are

conditionally independent given σ(En−1) and each Z(j)n has a normal distribution with mean 0 and

standard deviation σ(j)n . That is, σT

n = (σ(1)n , . . . , σ

(d)n ) is the vector of mutations for the individual

components of X∗n−1.

Corollary 3. Let D be a bounded subset of Rd such that ψD(δ) > 0 for all δ > 0. Suppose

the above localized random search algorithm is applied to a real-valued function f on D such that

f∗ := infx∈D f(x) > −∞. Moreover, suppose that f is continuous at a global minimizer x∗ of f

over D. Furthermore, suppose that there exists a subsequence nkk≥1 such that infk≥1

min1≤j≤d

σ(j)nk

> 0.

(Equivalently, lim supn→∞

min1≤j≤d

σ(j)n > 0.) Then f(X∗

n) −→ f∗ a.s.

Proof. First, we check that the Localized Random Search algorithm above follows the GARS frame-

work. We have k1 = 1 and Λ1,1 = Y1, and for n ≥ 2, we have kn = 1 and Λn,1 = Zn. Moreover, as

in the proof of Corollary 2, X∗n−1 is a deterministic function of the random elements in En−1, and

so, Yn is a deterministic function of the random elements in En = En−1⋃Zn for all n ≥ 2. Now

for the Localized Random Search algorithm, suppose that there exists a subsequence nkk≥1 such

that infk≥1

min1≤j≤d

σ(j)nk

> 0. We have

Ynk= X∗

(nk)−1 + Znk, for all k ≥ 1.

where Znkis a random vector whose conditional distribution given σ(E(nk)−1) is the Normal distri-

bution with mean vector 0 and covariance matrix

Cov(Znk) = diag

((σ(1)

nk

)2, . . . ,

(σ(1)

nk

)2)

.

Define Wk = Znkand Vk = Cov(Wk) for all k ≥ 1. Note that we have

Ynk= X∗

(nk)−1 + Wk, for all k ≥ 1.

The eigenvalues of Vk are the variances(σ

(1)nk

)2, . . . ,

(1)nk

)2of the normal random perturbations

for the different components of X∗(nk)−1. Hence, the smallest eigenvalue of Vk is λk := min

1≤j≤d

(σ(j)

nk

)2.

Since infk≥1

min1≤j≤d

σ(j)nk

> 0, it follows that infk≥1 λk > 0. The result now follows from Corollary 2.

5.2 Evolutionary Algorithms

There are three main types of evolutionary algorithms for global optimization (Back 1996): genetic

algorithms, evolution strategies and evolutionary programming algorithms. Below, we apply one of

18

Page 19: Convergence Guarantees for Generalized Adaptive Stochastic

the convergence results from the previous sections on a simple evolutionary programming algorithm

where the selection of parents in each generation is done in a greedy manner. The results in this

paper do not directly apply to a standard genetic algorithm for continuous optimization. However,

it may be possible to extend some of the results in this paper to prove the convergence of more

complex evolutionary programming algorithms and also evolution strategies but this topic is beyond

the scope of this paper and it will be the focus of future work.

Algorithm C: Evolutionary Programming with Greedy Parent Selection

Inputs:

(1) The objective function f : D → R, where D ⊆ Rd.

(2) A nonnegative fitness function F : D → R+.

(3) The number of offspring in every generation, denoted by µ.

(4) A deterministic absorbing transformation ρD : Rd → D, i.e. ρD(x) = x for all x ∈ D.

Step 1. (Initialization) Set t = 0 and for each i = 1, 2, . . . , µ, generate Yi according to some prob-

ability distribution whose realizations are on Rd, where Yi possibly depends on Y1, . . . , Yi−1,

and set Xi = ρD(Yi). Moreover, for each i = 1, 2, . . . , µ, set Pi(0) = Xi. The set P(0) =

P1(0), P2(0), . . . , Pµ(0) = X1, X2, . . . , Xµ is the initial parent population.

Step 2. (Evaluate the Initial Parent Population) For each i = 1, 2, . . . , µ, evaluate the objective

function value f(Xi) and the fitness value F(Xi).

Step 3. (Iterate) While termination condition is not satisfied do

Step 3.1. (Update Generation Counter) Reset t := t + 1.

Step 3.2. (Generate Offspring by Mutation) For each i = 1, 2, . . . , µ, set Ytµ+i = Mut(Pi(t−1)) and Xtµ+i = ρD(Ytµ+i). The set M(t) := Xtµ+1, Xtµ+2, . . . , Xtµ+µ constitutes the

offspring for the current generation.

Step 3.3. (Evaluate the Offspring) For each i = 1, 2, . . . , µ, evaluate f(Xtµ+i) and the fitness

value F(Xtµ+i).

Step 3.4. (Select New Parent Population) Select the parent population for the next genera-

tion: P(t) = Sel(P(t− 1)⋃M(t)).

End.

19

Page 20: Convergence Guarantees for Generalized Adaptive Stochastic

In Step 3.1, the mutation operator is defined as follows: For each t ≥ 1 and i = 1, 2, . . . , µ,

Ytµ+i = Mut(Pi(t− 1)) = Pi(t− 1) + Ztµ+i,

where Ztµ+i is a random vector whose conditional distribution given σ(Etµ+i−1) is a Normal distri-

bution with mean vector 0 and diagonal covariance matrix

Cov(Ztµ+i) =

(1)tµ+i

)20 . . . 0

0(σ

(2)tµ+i

)2. . . 0

......

. . ....

0 0 . . .(σ

(d)tµ+i

)2

.

Here, Etµ+i−1 = Y1, . . . , Yµ, Zµ+1, Zµ+2, . . . , Ztµ+i−1. Moreover, σTtµ+i = (σ(1)

tµ+i, . . . , σ(d)tµ+i) is the

vector of mutations for the individual components of the parent vector Pi(t − 1). That is, for

each j = 1, 2, . . . , d, the conditional distribution of the random variable Z(j)tµ+i given σ(Etµ+i−1) is

a Normal distribution with mean 0 and standard deviation σ(j)tµ+i. Moreover, we set the algorithm

parameter σ(j)tµ+i =

√F(Pi(t− 1)) as noted in Back (1993).

In Step 3.3, the selection of the parent solutions for the next generation is usually accom-

plished by probabilistic q-tournament selection as described in Back (1993). As q increases, this

q-tournament selection procedure becomes more and more greedy. For simplicity, we assume that

the selection of parent solutions proceeds in a completely greedy manner. That is, P(t) is simply

the collection of µ solutions from P(t − 1)⋃M(t) with the best fitness values. The more general

case of q-tournament selection will be addressed in future work.

Corollary 4. Let D be a bounded subset of Rd such that ψD(δ) > 0 for all δ > 0. Suppose the

above EP algorithm is applied to a real-valued function f on D such that f∗ := infx∈D f(x) > −∞.

Moreover, suppose that f is continuous at a global minimizer x∗ of f over D. Furthermore, suppose

that the fitness function F : D → R+ is always defined so that F(x) ≥ τ for some τ > 0 for all

x ∈ D. Then f(X∗n) −→ f∗ a.s.

Proof. As before, we first check that the above EP algorithm follows the GARS framework. For

n = 1, 2, . . . , µ, we have kn = 1 and Λn,1 = Yn. Moreover, for all n ≥ µ + 1, we have kn = 1 and

Λn,1 = Zn. Since the selection of the new parent population in Step 3.3 is done in a greedy manner,

it follows that for each integer t ≥ 1 and i = 1, 2, . . . , µ, Pi(t − 1) is a deterministic function of

Y1, Y2, . . . , Ytµ. This also implies that for each integer t ≥ 1 and i = 1, 2, . . . , µ, Pi(t − 1) is also a

20

Page 21: Convergence Guarantees for Generalized Adaptive Stochastic

deterministic function of Y1, Y2, . . . , Ytµ+i−1. Hence, for each integer t ≥ 1 and i = 1, 2, . . . , µ, we

have

Ytµ+i = Φ(t−1)µ+i(Etµ+i−1) + Ztµ+i,

for some deterministic function Φ(t−1)µ+i. Note that this implies that

Yµ+k = Φk(Eµ+k−1) + Zµ+k, for all k ≥ 1

where Zµ+k is a random vector whose conditional distribution given σ(Eµ+k−1) is a Normal distri-

bution with mean vector 0 and diagonal covariance matrix

Cov(Zµ+k) =

(1)µ+k

)20 . . . 0

0(σ

(2)µ+k

)2. . . 0

......

. . ....

0 0 . . .(σ

(d)µ+k

)2

For each integer k ≥ 1 and j = 1, 2, . . . , d, we have(σ

(j)µ+k

)2= F(Pi(t − 1)) ≥ τ , where i and t

are the unique integers such that i = 1, . . . , µ, t ≥ 1 and k = (t− 1)µ + i. Define the subsequence

nkk≥1 by nk = µ + k for all k ≥ 1. Then we have

Ynk= Φk(E(nk)−1) + Wk, for all k ≥ 1,

where Wk = Znk. Let λk be the smallest eigenvalue of Cov(Wk). Since the eigenvalues of Cov(Wk)

are(σ

(1)µ+k

)2, . . . ,

(d)µ+k

)2, we have

λk = min1≤j≤d

(j)µ+k

)2≥ τ,

and so, infk≥1 λk ≥ τ > 0. Moreover, the conditional distribution of Wk given σ(E(nk)−1) is an

elliptical distribution with conditional density given by

qk(w | σ(E(nk)−1)) = γ[det(Vk)]−1/2 Ψ(wT V −1k w), z ∈ Rd

where Ψ(y) = e−y/2 and γ = (2π)−d/2. Again, Ψ(y) = e−y/2 is monotonically nonincreasing, and

so, the conclusion follows from Theorem 6.

6 Related Convergence Results

The purpose of this section is to explore some connections between the results in Section 3 and

the results in a paper by Stephens and Baritompa (1998). Recall the notation in Section 3. Let

21

Page 22: Convergence Guarantees for Generalized Adaptive Stochastic

D be a subset of Rd and let C(D) be the set of all continuous functions f : D → R. Moreover,

let X ∗f be the set of all global minimizers of f : D → R. If D is compact, it is well-known that

X ∗f 6= ∅ for all f ∈ C(D). Now suppose that a stochastic global minimization algorithm applied to

f : D → R generates the random vector iterates Xnn≥1. The range of the sequence of random

vectors Xnn≥1 is denoted by range(Xnn≥1), the set of subsequential limit points of Xnn≥1

is denoted by slp(Xnn≥1), and the closure of Xnn≥1 is denoted by cl(Xnn≥1). Note that

range(Xnn≥1), slp(Xnn≥1) and cl(Xnn≥1) are random sets of points in D and cl(Xnn≥1) =

range(Xnn≥1)⋃

slp(Xnn≥1). Following the definition in Stephens and Baritompa (1998), the

algorithm is said to see the global minimum if cl(Xnn≥1)∩X ∗f 6= ∅. In addition, this algorithm is

said to see the point x ∈ D if x ∈ cl(Xnn≥1).

Stephens and Baritompa (1998) introduced the notion of a deterministic (or stochastic) sequen-

tial sampling algorithm as an algorithm where the next iterate (or sample point) depends only on

what they call local information that can be obtained from the previous iterates and on an instance

of a random element (in the stochastic case). Examples of local information include the best sam-

ple point, the maximum slope between sample points, or an interpolating polynomial through the

sample points. On the other hand, examples of nonlocal information include the Lipschitz constant,

level set associated with a function value, the number of local minima, or the global minimum itself.

The GARS framework (Algorithm A, Section 3) is more general than the idea of a stochastic se-

quential sampling algorithm since it could incorporate nonlocal information through the parameters

of the probability distributions in Step 1.1 or through the deterministic function Θn in Step 1.2.

Moreover, in Step 1.2 of the GARS framework, the current iterate possibly depends not only on

the previous iterates and their function values (as in stochastic sequential sampling algorithms) but

also on the intermediate random elements that were used to generate the previous iterates. Hence,

the results in this paper apply to a wider class of algorithms. The following theorem is a special

case of Theorem 3.3 in Stephens and Baritompa (1998) when restricted to continuous functions.

Note that this also applies to GARS algorithms that satisfy the definition of a stochastic sequential

sampling algorithm.

Theorem (Stephens and Baritompa 1998) For any probability p and for any stochastic se-

quential sampling algorithm,

P (algorithm sees the global minimum of g) ≥ p, ∀g ∈ C(D)

if and only if P (x ∈ cl(Xnn≥1) ≥ p, ∀x ∈ D, ∀f ∈ C(D)

The above theorem provides a necessary and sufficient condition for an algorithm to see the

22

Page 23: Convergence Guarantees for Generalized Adaptive Stochastic

global minimum of f with a specified probability. In particular, the special case where p = 1

provides a similar result to those we proved in Section 3. The above result provides a more general

convergence criterion in that p could be less than 1. However, given a practical stochastic algorithm,

it is typically not straightforward to check whether P (x ∈ cl(Xnn≥1) ≥ p, ∀x ∈ D,∀f ∈ C(D) for

a given value of p, especially for the case where p is strictly between 0 and 1. The next theorem

says that for stochastic algorithms applied to continuous functions defined over compact domains,

converging to the global minimum almost surely is equivalent to seeing the global minimum almost

surely. Recall that the global search conditions in Section 3 imply convergence to the global

minimum almost surely. Hence, these same conditions can be used to guarantee that the algorithm

sees the global minimum almost surely.

Theorem 7. Let D be a compact subset of Rd and let f : D → R be a continuous function.

Suppose that a stochastic global minimization algorithm applied to f generates a sequence of random

vector iterates Xnn≥1 whose sequence of best points found so far is given by X∗nn≥1. Then

f(X∗n) −→ f∗ := infx∈D f(x) > −∞ a.s. if and only if cl(Xnn≥1)

⋂X ∗f 6= ∅ a.s. (i.e., the

algorithm sees the global minimum of f a.s.)

Proof. First, assume that f(X∗n) −→ f∗ a.s. Then there exists a set N such that P (N ) = 0 and

f(X∗n(ω)) −→ f∗ for all ω ∈ N c. Fix ω ∈ N c. We wish to show that cl(Xn(ω)n≥1) ∩ X ∗

f 6= ∅.Suppose this is not the case. Let f := inf

x∈cl(Xn(ω)n≥1)f(x). Since cl(Xn(ω)n≥1) is a closed

subset of the compact set D, it follows that cl(Xn(ω)n≥1) is also compact. Moreover, since f

is a continuous function, it follows that f = f(x) for some x ∈ cl(Xn(ω)n≥1). By assumption,

cl(Xn(ω)n≥1) ∩ X ∗f = ∅, and so, f = f(x) > f∗. Next,

range(X∗n(ω)n≥1) ⊆ range(Xn(ω)n≥1) ⊆ cl(Xn(ω)n≥1).

Hence, f(X∗n(ω)) ≥ f > f∗ for all n ≥ 1, and so, f(X∗

n(ω)) cannot converge to f∗, which is a

contradiction. Hence, cl(Xn(ω)n≥1)⋂X ∗

f 6= ∅ for this ω ∈ N c. Thus, the algorithm sees the

global minimum of f a.s.

To prove the converse, assume that cl(Xnn≥1)⋂X ∗

f 6= ∅ a.s. Then there exists a set Nsuch that P (N ) = 0 and cl(Xn(ω)n≥1)

⋂X ∗f 6= ∅ for all ω ∈ N c. Fix ω ∈ N c and let x∗ ∈

cl(Xn(ω)n≥1)⋂X ∗

f . Note that x∗ ∈ range(Xn(ω)n≥1) or x∗ ∈ slp(Xn(ω)n≥1), or both.

Suppose x∗ ∈ range(Xn(ω)n≥1). Then x∗ = Xn(ω) for some integer n. Since x∗ ∈ X ∗f , it

follows that X∗n(ω) = x∗ for all n ≥ n. Hence, f(X∗

n(ω)) = f(x∗) = f∗ for all n ≥ n, and so,

f(X∗n(ω)) −→ f∗.

23

Page 24: Convergence Guarantees for Generalized Adaptive Stochastic

On the other hand, suppose x∗ ∈ slp(Xn(ω)n≥1). Then there exists a subsequence Xnk(ω)k≥1

such that Xnk(ω) −→ x∗ as k →∞. Since f is continuous, f(Xnk

(ω)) −→ f(x∗) = f∗ as k →∞.

Moreover, 0 ≤ f(X∗nk

(ω)) − f∗ ≤ f(Xnk(ω)) − f∗ −→ 0 as k → ∞, and so, f(X∗

nk(ω)) −→ f∗ as

k → ∞. Next, since f∗ > −∞ and f(X∗n(ω))n≥1 is monotonically nonincreasing, it follows that

limn→∞ f(X∗n(ω)) exists, and so, limn→∞ f(X∗

n(ω)) = f∗.

In either case, f(X∗n(ω)) −→ f∗ for the given ω ∈ N c. Thus, f(X∗

n) −→ f∗ a.s.

Torn and Zilinskas (1989) proved that a deterministic global minimization algorithm converges

to the global minimum of any continuous function on a compact set D ⊆ Rd if and only if the

sequence of iterates of the algorithm is dense in D. Stephens and Baritompa (1998) extended this

result to any deterministic sequential sampling algorithm on what they call a sufficiently rich class

of functions (including continuous functions) and they also proved a stochastic version of their

result. Theorem 8 below is another stochastic version of the result by Torn and Zilinskas (1989)

that is different from the theorem proved by Stephens and Baritompa (1998).

Consider a sequence of random vectors Xnn≥1 whose realizations are in D ⊆ Rd. We say that

Xnn≥1 is probabilistically dense in D with guarantee p if

P (range(Xnn≥1) ∩B(x, δ) 6= ∅) ≥ p, ∀x ∈ D, ∀δ > 0

As before, B(x, δ) is the open ball in Rd centered at x with radius δ. If the above condition holds

when p = 1, we simply say that Xnn≥1 is probabilistically dense in D.

The next theorem shows that the sequence of random vector iterates of a stochastic global

minimization algorithm is probabilistically dense in D with guarantee p if and only if the algorithm

sees any point x ∈ D (including the global minimum of any function f on D) with probability at

least p. To prove the next theorem, we need the following lemma.

Lemma 1. Let Xnn≥1 be a sequence of random vectors whose realizations are in D ⊆ Rd. For

any x ∈ D, the following events are equal:

[x ∈ cl(Xnn≥1)] = [range(Xnn≥1) ∩B(x, δ) 6= ∅ ∀δ > 0].

Proof. Since cl(Xnn≥1) = range(Xnn≥1)⋃

slp(Xnn≥1), we have [x ∈ cl(Xnn≥1)] ⊆[range(Xnn≥1) ∩B(x, δ) 6= ∅ ∀δ > 0]. Next, suppose ω ∈ [range(Xnn≥1) ∩B(x, δ) 6= ∅ ∀δ > 0].

Then range(Xn(ω)n≥1) ∩ B(x, δ) 6= ∅ ∀δ > 0. Note that either x ∈ range(Xn(ω)n≥1) ⊆cl(Xn(ω)n≥1) or x 6∈ range(Xn(ω)n≥1). If x 6∈ range(Xn(ω)n≥1), then there are infinitely

24

Page 25: Convergence Guarantees for Generalized Adaptive Stochastic

many elements of range(Xn(ω)n≥1) that are contained in any open ball around x. This im-

plies that there exists an integer n1 such that Xn1(ω) ∈ B(x, 1). Moreover, there exists an in-

teger n2 > n1 such that Xn2(ω) ∈ B(x, 12). In general, for any integer k > 1, there exists an

integer nk > nk−1 such that Xnk(ω) ∈ B(x, 1

k ). Clearly, Xnk(ω) −→ x as k → ∞, and so,

x ∈ slp(Xn(ω)n≥1) ⊆ cl(Xn(ω)n≥1). In either case, we have ω ∈ [x ∈ cl(Xnn≥1)], and so,

[range(Xnn≥1) ∩ B(x, δ) 6= ∅ ∀ δ > 0] ⊆ [x ∈ cl(Xnn≥1)].

Theorem 8. Let D ⊆ Rd and suppose that a stochastic global minimization algorithm applied to the

function f : D → R generates the sequence of random vector iterates Xnn≥1 whose realizations

are in D. Then Xnn≥1 is probabilistically dense in D with guarantee p if and only if P (x ∈cl(Xnn≥1)) ≥ p ∀x ∈ D (i.e., the algorithm sees any point of D with probability at least p).

Proof. First, suppose that Xnn≥1 is probabilistically dense in D with guarantee p. Fix x ∈ D.

From the previous lemma, we have

[x ∈ cl(Xnn≥1)] = [range(Xnn≥1) ∩B(x, δ) 6= ∅ ∀δ > 0].

Moreover, it is easy to check that

[range(Xnn≥1) ∩B(x, δ) 6= ∅ ∀δ > 0] =[range(Xnn≥1) ∩B

(x,

1k

)6= ∅ for all integers k ≥ 1

]

Hence,

P (x ∈ cl(Xnn≥1)) = P

(range(Xnn≥1) ∩B

(x,

1k

)6= ∅ for all integers k ≥ 1

).

For every integer k ≥ 1, define the event

Sk =[

range(Xnn≥1)⋂

B

(x,

1k

)6= ∅

].

Since Xnn≥1 is probabilistically dense in D with guarantee p, it follows that P (Sk) ≥ p for all

k ≥ 1. Moreover, since Sk ⊇ Sk+1 for all k ≥ 1, it follows that P (Sk)k≥1 is a nonincreasing

sequence that is bounded below, and so, limk→∞ P (Sk) exists. Hence,

P (x ∈ cl(Xnn≥1)) = P

( ∞⋂

i=1

Si

)= P

(lim

k→∞

k⋂

i=1

Si

)= P

(lim

k→∞Sk

)= lim

k→∞P (Sk) ≥ p.

To prove the converse, suppose that P (x ∈ cl(Xnn≥1)) ≥ p ∀x ∈ D. Fix x ∈ D and δ > 0.

Note that

P (range(Xnn≥1) ∩B(x, δ) 6= ∅) ≥ P (range(Xnn≥1) ∩B(x, δ) 6= ∅ ∀δ > 0)

= P (x ∈ cl(Xnn≥1)) ≥ p,

25

Page 26: Convergence Guarantees for Generalized Adaptive Stochastic

where the above equality holds because of the previous lemma. Hence, Xnn≥1 is probabilistically

dense in D with guarantee p.

Corollary 5. Let D ⊆ Rd and suppose that a stochastic global minimization algorithm applied to the

function f : D → R generates the sequence of random vector iterates Xnn≥1 whose realizations are

in D. Moreover, suppose that X ∗f 6= ∅ and Xnn≥1 is probabilistically dense in D with guarantee p.

Then P (algorithm sees the global minimum of f) ≥ p.

Proof. Let x∗ ∈ X ∗f . Then

P (algorithm sees the global minimum of f) = P (cl(Xnn≥1) ∩ X ∗f 6= ∅)

≥ P (x∗ ∈ cl(Xnn≥1) = P (algorithm sees the point x∗) ≥ p,

where the last inequality follows from Theorem 8.

The next theorem (Theorem 9) shows that if a GARS algorithm satisfies the global search

condition in Theorem 3 of Section 3, then the resulting sequence of iterates will be probabilistically

dense in D, and so, by Corollary 5, the algorithm sees the global minimum of f with probability 1.

Note that this end result can also be obtained from Theorems 3 and 7. Hence, Theorem 9 and

Corollary 5 provide an alternative proof that any GARS algorithm satisfying the global search

conditions in Theorem 3 also sees the global minimum of f with probability 1.

Theorem 9. Let f be a real-valued function on D ⊆ Rd. Suppose that a GARS algorithm applied

to f satisfies Global Search Condition 3 from Theorem 3: For any z ∈ D and δ > 0, there exists

0 < ν(z, δ) < 1 such that

P [Ynk∈ B(z, δ) ∩ D | σ(E(nk)−1)] ≥ ν(z, δ),

for some subsequence nkk≥1. Then Xnn≥1 is probabilistically dense in D, and consequently,

P (x ∈ cl(Xnn≥1)) = 1 ∀x ∈ D.

Proof. Fix x ∈ D and δ > 0. We have

P (range(Xnn≥1) ∩B(x, δ) = ∅) = P (Xn 6∈ B(x, δ) ∀n ≥ 1)

= P (Xn 6∈ (B(x, δ) ∩ D) ∀n ≥ 1) (since Xnn≥1 ⊆ D for a GARS algorithm)

≤ P (Yn 6∈ (B(x, δ) ∩ D) ∀n ≥ 1) ≤ P (Ynk6∈ (B(x, δ) ∩ D) ∀k ≥ 1) = P

( ∞⋂

i=1

[Yni 6∈ (B(x, δ) ∩ D)]

)

= P

(lim

k→∞

k⋂

i=1

[Yni 6∈ (B(x, δ) ∩ D)]

)= lim

k→∞P

(k⋂

i=1

[Yni 6∈ (B(x, δ) ∩ D)]

)

26

Page 27: Convergence Guarantees for Generalized Adaptive Stochastic

= limk→∞

k∏

i=1

P(Yni 6∈ (B(x, δ) ∩ D) | Yn1 6∈ (B(x, δ) ∩ D), . . . , Yni−1 6∈ (B(x, δ) ∩ D)

)

= limk→∞

k∏

i=1

(1− P

(Yni ∈ (B(x, δ) ∩ D) | Yn1 6∈ (B(x, δ) ∩ D), . . . , Yni−1 6∈ (B(x, δ) ∩ D)

))

≤ limk→∞

k∏

i=1

(1− ν(x, δ)) = limk→∞

(1− ν(x, δ))k = 0 (since 0 < ν(x, δ) < 1)

Hence, P (range(Xnn≥1) ∩B(x, δ) = ∅) = 0, and so, P (range(Xnn≥1) ∩B(x, δ) 6= ∅) = 1 for

all x ∈ D and for all δ > 0, i.e., Xnn≥1 is probabilistically dense in D. Moreover, by Theorem 8,

P (x ∈ cl(Xnn≥1)) = 1 ∀x ∈ D

Following the definition in Stephens and Baritompa (1998), we say that an algorithm localizes

the global minimizers of a function f : D → R if its sequence of iterates Xnn≥1 satisfies ∅ 6=slp(Xnn≥1) ⊆ X ∗

f .

Theorem 10. Let D be a compact subset of Rd and suppose a stochastic global minimization algo-

rithm applied to the function f : D → R generates the sequence of random vectors Xnn≥1 whose

realizations are in the compact set D. If Xnn≥1 is probabilistically dense in D and each of the ran-

dom vectors in Xnn≥1 has a continuous probability distribution, then P (x ∈ slp(Xnn≥1)) = 1

for all x ∈ D and P (algorithm does not localize the global minimizers of f) = 1.

Proof. From Theorem 8, we have P (x ∈ cl(Xnn≥1)) = 1 ∀x ∈ D. Fix x ∈ D. Note that

P (x ∈ cl(Xnn≥1)) = P (x ∈ slp(Xnn≥1)) + P (x 6∈ slp(Xnn≥1) and x = Xn for some integer n).

Now P (x 6∈ slp(Xnn≥1) and x = Xn for some integer n) ≤ P (x = Xn for some integer n)

= P

( ∞⋃

n=1

[x = Xn]

)≤

∞∑

n=1

P (x = Xn) = 0 (since each Xn has a continuous distribution).

Hence, P (x ∈ slp(Xnn≥1)) = P (x ∈ cl(Xnn≥1)) = 1. In addition, if x ∈ D \ X ∗f , then

P (algorithm does not localize the global minimizers of f)

= P (slp(Xnn≥1) 6⊆ X ∗f ) + P (slp(Xnn≥1) = ∅) = P (slp(Xnn≥1) 6⊆ X ∗

f ) (since D is compact)

≥ P (x ∈ slp(Xnn≥1)) = 1.

27

Page 28: Convergence Guarantees for Generalized Adaptive Stochastic

Consider any GARS algorithm that satisfies the global search condition from Theorem 3. From

Theorem 9, the sequence of iterates Xnn≥1 will be probabilistically dense in D. Hence, in the

case where each Xn has a continuous probability distribution, Theorem 10 implies that each point

of D is a subsequential limit of the sequence of iterates of the algorithm with probability 1, and so,

the algorithm does not localize the global minimizers of D with probability 1. This suggests that

the worst-case convergence rate of the algorithm will be slow (and this will be confirmed in the

next section) since it essentially searches the whole domain to find the global minimum. However,

if we focus on the sequence of best points X∗nn≥1, the next theorem says that any subsequential

limit of X∗nn≥1 is a global minimizer almost surely.

Theorem 11. Let D be a compact subset of Rd and let f : D → R be a continuous function. Suppose

that a stochastic global minimization algorithm applied to f generates a sequence of iterates Xnn≥1

whose sequence of best points found so far X∗nn≥1 satisfies the condition f(X∗

n) −→ f∗ :=

infx∈D f(x) > −∞ a.s. Then slp(X∗nn≥1) ⊆ X ∗

f a.s.

Proof. Since f(X∗n) −→ f∗ a.s., it follows that there exists a set N such that P (N ) = 0 and

f(X∗n(ω)) −→ f∗ for all ω ∈ N c. Fix ω ∈ N c and let x ∈ D \ X ∗

f . We wish to show that

x 6∈ slp(X∗n(ω)n≥1).

From the proof of Theorem 7, there exists x∗ ∈ X ∗f such that x∗ ∈ cl(Xn(ω)n≥1). Since

x 6∈ X ∗f , we have f(x) − f(x∗) > 0. Now, since f is continuous over D, it follows that there is a

δ > 0 such that whenever x ∈ B(x∗, δ), we have

|f(x)− f(x∗)| = f(x)− f(x∗) ≤ 12(f(x)− f(x∗)), or equivalently, f(x) ≤ 1

2(f(x) + f(x∗)).

Furthermore, since x∗ ∈ cl(Xn(ω)n≥1), it follows that there is an integer n such that Xn(ω) ∈B(x∗, δ). Note that for any integer n ≥ n, we have

f(X∗n(ω)) ≤ f(X∗

n(ω)) ≤ f(Xn(ω)) ≤ 12(f(x) + f(x∗)) < f(x),

and so, lim supn→∞

f(X∗n(ω)) < f(x). Suppose x ∈ slp(X∗

n(ω)n≥1). Then there is a subsequence

X∗nk

(ω)k≥1 that converges to x. Since f is continuous, it follows that f(X∗nk

(ω)) → f(x)

as k → ∞, and so, lim supn→∞ f(X∗n(ω)) ≥ f(x), which yields a contradiction. Hence, x 6∈

slp(X∗n(ω)n≥1). This implies that

D \ X ∗f ⊆ D \ slp(X∗

n(ω)n≥1),

or equivalently, slp(X∗n(ω)n≥1) ⊆ X ∗

f . Thus, slp(X∗nn≥1) ⊆ X ∗

f a.s.

28

Page 29: Convergence Guarantees for Generalized Adaptive Stochastic

7 Convergence Rates

In the GARS framework, we allowed for the possibility that the trial random vector iterates Ynn≥1

are dependent. Moreover, in the convergence results for the GARS framework, the global search

conditions that are necessary for almost sure convergence to the global minimum are only required

for some subsequence of the iterations. Because of these, it is hard to perform any kind of con-

vergence analysis for the GARS framework. However, in the special case of simple random search,

where the Yn’s are all independent and identically distributed random vectors, there had been some

convergence analyses (e.g. Spall (2003)). Below, we provide a simple result that applies to the case

when the Yn’s are not necessarily independent and not necessarily identically distributed but the

result assumes that Global Search Condition 3 in Theorem 3 is satisfied by all the iterates.

Theorem 12. Let f be a real-valued function that has a unique global minimizer x∗ on a set

D ⊆ Rd in the sense of Theorem 2 and let f be continuous at x∗. Suppose that a GARS algorithm

satisfies Global Search Condition 3 from Theorem 3 for all iterations: For any z ∈ D and δ > 0,

there exists 0 < ν(z, δ) < 1 such that

P [Yn ∈ B(z, δ) ∩ D | σ(En−1)] ≥ ν(z, δ), (6)

Then P [Xi ∈ B(x∗, δ) ∩ D for some 1 ≤ i ≤ n] ≥ 1− (1− ν(x∗, δ))n.

Proof. By Theorem 3, X∗n −→ x∗ a.s. Moreover, the probability that the algorithm lands in the

region B(x∗, δ) ∩ D within n function evaluations is given by

P [Xi ∈ B(x∗, δ) ∩ D for some 1 ≤ i ≤ n]

= 1− P [X1 6∈ B(x∗, δ) ∩ D, X2 6∈ B(x∗, δ) ∩ D, . . . , Xn 6∈ B(x∗, δ) ∩ D]

= 1−n∏

i=1

P [Xi 6∈ B(x∗, δ) ∩ D|X1 6∈ B(x∗, δ) ∩ D, X2 6∈ B(x∗, δ) ∩ D, . . . , Xi−1 6∈ B(x∗, δ) ∩ D]

= 1−n∏

i=1

(1−P [Xi ∈ B(x∗, δ)∩D|X1 6∈ B(x∗, δ)∩D, X2 6∈ B(x∗, δ)∩D, . . . , Xi−1 6∈ B(x∗, δ)∩D]),

Since

P [Xi ∈ B(x∗, δ) ∩ D | σ(Ei−1)] ≥ P [Yi ∈ B(x∗, δ) ∩ D | σ(Ei−1)] ≥ ν(x∗, δ),

for all i = 1, . . . , n, it follows that

P [Xi ∈ B(x∗, δ) ∩ D for some 1 ≤ i ≤ n] ≥ 1− (1− ν(x∗, δ))n.

29

Page 30: Convergence Guarantees for Generalized Adaptive Stochastic

The above theorem provides a lower bound for the probability that the algorithm lands within

a δ-neighborhood of the global minimizer with n iterations (or function evaluations). Now fix

the neighborhood radius δ > 0 and the probability requirement 0 < ξ < 1. If we set n =

dlog(1− ξ)/ log(1− ν(x∗, δ))e, then P [Xi ∈ B(x∗, δ) ∩ D for some 1 ≤ i ≤ n] ≥ ξ. We then

determine how large this particular n would be for special cases.

Corollary 6. Let f be a real-valued function on D = [0, 1]d such that f∗ := infx∈D

f(x) > −∞.

Moreover, suppose that f has a unique global minimizer x∗ on D in the sense of Theorem 2 and that

f is continuous at x∗. Consider a GARS algorithm and suppose that for each n ≥ 1, Yn has a condi-

tional density gn(y | σ(En−1)) satisfying the following condition: h(y) := infn≥1 gn(y | σ(En−1)) ≥ β

for all y ∈ D, where β > 0 is a constant. By Theorem 4, X∗n −→ x∗ a.s. Fix the neighborhood

radius 0 < δ ≤ 1/2 and the probability requirement 0 < ξ < 1. If we set

n =

⌈log(1− ξ)

/log

(1−

2

)d βπd/2

Γ(d2 + 1)

)⌉, (7)

then P [Xi ∈ B(x∗, δ) ∩ D for some 1 ≤ i ≤ n] ≥ ξ.

Proof. Fix z ∈ D and 0 < δ ≤ 1/2. For any n ≥ 1, we have

P [Yn ∈ B(z, δ) ∩ D | σ(En−1)] =∫

B(z,δ)∩Dgn(y|σ(En−1)) dy ≥

B(z,δ)∩Dh(y) dy

≥∫

B(z,δ)∩Dβ dy = βµ(B(z, δ) ∩ D) ≥ βψD(δ) =

2

)d βπd/2

Γ(d2 + 1)

=: ν(z, δ),

where ψD(δ) is defined in Theorem 4 and its value in this case is given by Proposition 1. Hence,

(6) from Theorem 12 holds, and so, for the above choice of n, we get

P [Xi ∈ B(x∗, δ) ∩ D for some 1 ≤ i ≤ n] ≥ 1− (1− ν(x∗, δ))n ≥ ξ.

Consider the special case of Corollary 6 where Yn has the uniform distribution on the closed

hypercube D = [0, 1]d for all n ≥ 1. Then Xn ≡ Yn for all n ≥ 1. In this case, h(y) = 1 for all

y ∈ D and we can choose β = 1. Again, fix 0 < δ ≤ 1/2 and 0 < ξ < 1. If we set n equal to the

value given by (7) with β = 1, then P [Xi ∈ B(x∗, δ) ∩ D for some 1 ≤ i ≤ n] ≥ ξ. Hence, this

gives a value of n that can guarantee that the well-known random (“blind”) search algorithm lands

within a δ-neighborhood of the global minimizer. But how large is this value of n? That is, what

is the complexity of n in terms of the problem dimension d.

30

Page 31: Convergence Guarantees for Generalized Adaptive Stochastic

Proposition 2. Consider the assumptions of Corollary 6. The value of n given in (7) grows

exponentially in d.

Proof. From elementary calculus, it is easy to verify that −1log(1−x) ≥ 1

2x for all x ∈ (0, 12 ]. Now,

since 0 < δ ≤ 1/2 and d ≥ 1, it follows that(

δ2π1/2

)d ≤√

π4 and Γ(d

2 + 1) ≥√

π2 , and so,(

δ

2

)d πd/2

Γ(d2 + 1)

≤ 12. Moreover, observe that 0 < β ≤ 1. (If β > 1, then gn(y | σ(En−1)) in

Corollary 6 cannot be a conditional density for each n ≥ 1.) Hence,(

δ

2

)d βπd/2

Γ(d2 + 1)

≤ 12

and so, it

follows that

−1

log

(1−

2

)d βπd/2

Γ(d2 + 1)

) ≥ 2dΓ(d2 + 1)

2δdβπd/2≥√

π

(4√π

)d

Multiplying the above inequality by − log(1 − ξ) > 0, it follows that the value of n in (7) above

satisfies

n ≥ log(1− ξ)

log

(1−

2

)d βπd/2

Γ(d2 + 1)

) ≥ −√π

4βlog(1− ξ)

(4√π

)d

Since −√π4β log(1− ξ) is a strictly positive constant and 4√

π> 2, this shows that the value of n given

by (7) grows exponentially in d.

Hence, this theorem shows that convergence quickly becomes very poor as the dimension of the

problem increases. However, as pointed out by Spall (2003), the no free lunch theorems (Wolpert

and Macready 1997) indicate that this simple GARS algorithm is no worse than any other algorithm

when performance is averaged over the entire range of possible optimization problems. In practice

though, some algorithms are tailored to perform better than others on certain classes of problems

with specific characteristics.

8 Summary

We proved some results that guarantee almost sure convergence to the global minimum for a general

class of adaptive stochastic search algorithms that follow the GARS (Generalized Adaptive Ran-

dom Search) framework. The GARS framework is an extension of the well-known simple (“blind”)

random search algorithm where the random iterates are not necessarily independent. In the GARS

framework, we allow for the possibility that a number of intermediate random elements are first

31

Page 32: Convergence Guarantees for Generalized Adaptive Stochastic

generated before the trial random vector iterate is computed. Moreover, if the trial iterate falls out-

side the domain of the problem, then it is mapped to the domain via an absorbing transformation.

By imposing some conditions on the random vector iterates and the probability distributions that

generate them (i.e., the global search conditions in Theorems 1 and 3 and subsequent theorems),

the convergence of a GARS algorithm that satisfies these conditions is guaranteed. In addition,

the global search condition only needs to be satisfied by a subsequence of the iterations in order

to guarantee convergence. This makes the results applicable to a wide range of practical stochastic

global optimization algorithms, including those that perform both local and global search and also

including those that combine both stochastic and deterministic search strategies.

We also proved convergence results (Theorems 4 and 5) that involve simple requirements on the

conditional densities of the trial random vector iterates that are easy to verify in practice. More-

over, in Theorem 6, we provided some simple conditions that guarantee convergence when using

an elliptical distribution, such as the multivariate Normal or Cauchy distribution, to generate the

trial random vector iterates. In Section 5, we provided a convergence proof for some practical

stochastic global optimization algorithms, including an evolutionary programming algorithm with

greedy parent selection. In Section 6, we explored some connections with the results by Stephens

and Baritompa (1998). In particular, we showed that for stochastic global minimization algorithms

applied to continuous functions defined over compact domains, converging almost surely to the

global minimum is equivalent to seeing the global minimum as defined by Stephens and Barito-

mpa (1998). Moreover, we introduced the notion of a sequence of random vector iterates being

probabilistically dense in the domain and showed that this is also equivalent to seeing the global

minimum with probability 1 under the usual assumptions. In addition, we proved that a GARS

algorithm satisfying the global search condition in Theorem 3 generates a sequence of iterates that

is probabilistically dense in the domain, and consequently, the algorithm sees any point of the do-

main (including the global minimizers) with probability 1. Finally, in Section 7, we proved some

simple results on the convergence rate of a GARS algorithm.

Acknowledgements

I would like to thank Prof. Shane Henderson from the School of Operations Research & Information

Engineering at Cornell University for some helpful comments during the early stages of this paper.

I would also like to thank the anonymous referees for their helpful comments and suggestions.

32

Page 33: Convergence Guarantees for Generalized Adaptive Stochastic

References

1. Back, T. Evolutionary Algorithms in Theory and Practice. Oxford University Press: New

York; 1996.

2. Back, T., Rudolph, G., and Schwefel, H.-P. Evolutionary programming and evolution strate-

gies: similarities and differences. In: Fogel, D.B. and Atmar, J.W. (Eds.), Proceedings of

the Second Annual Conference on Evolutionary Programming. Evolutionary Programming

Society: La Jolla, CA; 1993. pp. 11–22.

3. Chin, D.C. Comparative study of stochastic algorithms for system optimization based on

gradient approximations. IEEE Transactions on Systems, Man, and Cybernetics - B 1997;

27; 244–249.

4. Fang, K.-T. and Zhang, Y.-T. Generalized Multivariate Analysis. Science Press: Beijing.

Springer-Verlag: Berlin; 1990.

5. Kiefer, J. and Wolfowitz, J. Stochastic estimation of the maximum of a regression function.

Annals of Mathematical Statistics 1952; 23(3); 462–466.

6. Maryak, J.L. and Chin, D.C. Global random optimization by simultaneous perturbation

stochastic approximation. IEEE Transactions on Automatic Control 2008; 53(3); 780–783.

7. Pinter, J.D. Global Optimization in Action. Kluwer Academic Publishers: Dordrecht; 1996.

8. Resnick, S.I. A Probability Path. Birkhauser: Boston; 1999.

9. Solis, F.J. and Wets, R.J.-B. Minimization by random search techniques. Mathematics of

Operations Research 1981; 6(1); 19–30.

10. Spall, J.C. Introduction to Stochastic Search and Optimization. John Wiley & Sons, Inc.:

New Jersey; 2003.

11. Spall, J.C. Multivariate stochastic approximation using a simultaneous perturbation gradient

approximation. IEEE Transactions on Automatic Control 1992; 37; 332–341.

12. Stephens, C.P. and Baritompa, W. Global optimization requires global information. Journal

of Optimization Theory and Applications 1998; 96(3); 575–588.

33

Page 34: Convergence Guarantees for Generalized Adaptive Stochastic

13. Wolpert, D.H. and Macready, W.G. No free lunch theorems for optimization. IEEE Trans-

actions on Evolutionary Computation 1997; 1(1); 67–82.

14. Yao, X., Liu, Y., and Lin, G. Evolutionary programming made faster. IEEE Transactions on

Evolutionary Computation 1999; 3(2); 82–102.

15. Zabinsky, Z.B. Stochastic Adaptive Search in Global Optimization. Kluwer Academic Pub-

lishers; 2003.

16. Zabinsky, Z.B. and Smith, R.B. Pure adaptive search in global optimization. Mathematical

Programming 1992; 53; 323–338.

17. Zhigljavsky, A.A. Theory of Global Random Search, Kluwer Academic Publishers: Dordrecht;

1991.

18. Torn, A. and Zilinskas, A. Global Optimization, Springer-Verlag: Berlin, Germany; 1989.

34


Recommended