Numerical methods for experimental design of large-scale ...horesh/publications/journal... ·...

IOP PUBLISHING INVERSE PROBLEMS

Inverse Problems 24 (2008) 055012 (17pp) doi:10.1088/0266-5611/24/5/055012

Numerical methods for experimental design oflarge-scale linear ill-posed inverse problems

E Haber1, L Horesh1 and L Tenorio2

1 Department of Mathematics and Computer Science, Emory University, Atlanta, GA, USA2 Department of Mathematics and Computer Science, Colorado School of Mines, Golden,CO, USA

Received 2 January 2008, in final form 5 August 2008Published 4 September 2008Online at stacks.iop.org/IP/24/055012

Abstract

While an experimental design for well-posed inverse linear problems hasbeen well studied, covering a vast range of well-established design criteriaand optimization algorithms, its ill-posed counterpart is a rather new topic.The ill-posed nature of the problem entails the incorporation of regularizationtechniques. The consequent non-stochastic error introduced by regularizationneeds to be taken into account when choosing an experimental design criterion.We discuss different ways to define an optimal design that controls both anaverage total error of regularized estimates and a measure of the total cost of thedesign. We also introduce a numerical framework that efficiently implementssuch designs and natively allows for the solution of large-scale problems. Toillustrate the possible applications of the methodology, we consider a boreholetomography example and a two-dimensional function recovery problem.

(Some figures in this article are in colour only in the electronic version)

1. Introduction

During the past decade, data collection and processing techniques have dramatically improved.Large amounts of data are now routinely collected and advances in optimization algorithmsallow for their inversion by harvesting high computational power. In addition, recent advancesin numerical PDEs and in the solution of integral equations have enabled us to better simulatecomplex processes. As a result, we are currently able to tackle high-dimensional inverseproblems which were never considered before.

However, even when it is possible to gather and process large amounts of data, it isnot always clear how such data should be collected. Quite often, data are obtained usingprotocols developed decades ago, protocols that may be neither optimal nor cost-effective.Such a suboptimal experiment design may lead to loss of important information and waste ofvaluable resources.

In some cases, a poor experiment design can be overcome by superior processingtechniques (the Hubble Space telescope being a typical example); nevertheless, by far the

0266-5611/08/055012+17$30.00 © 2008 IOP Publishing Ltd Printed in the UK 1

http://dx.doi.org/10.1088/0266-5611/24/5/055012

http://stacks.iop.org/IP/24/055012

Inverse Problems 24 (2008) 055012 E Haber et al

best way to achieve an ‘optimal’ design is by a proper design of the experimental setup. Thefield of experimental design has a long history. Its mathematical and statistical foundationswere pioneered by R A Fisher in the early 1900s. It is now routinely applied in the physical,biological and social sciences. However, almost all the work in the field has been developedfor the over-determined, well-posed case (e.g. [1, 5, 6, 18] and references therein). Theill-posed case has remained under-researched. In fact, this case is sometimes completelydismissed. For example, in [18] we read ‘Clearly, for any reasonable inference, the number nof observations must be at least equal to k + 1’ (k being the number of unknowns). Conversely,many practical problems, and in fact most of the geophysical and medical inverse problems,are either characterized by possessing a smaller number of observations than unknowns orill-posed in the usual sense. An experimental design for such problems has remained largelyunexplored. For an ill-posed problem, we are only aware of the work of [8, 14], who usedtechniques borrowed from the well-posed formulation, and the work of [19]. But none ofthese papers provides details of a coherent and systematic approach for experimental designof ill-posed problems. We are only aware of two papers that address this problem in a moresystematic way. Our approach shares some similarities with the very recent work of Bardow[3], but we consider a more general approach to control the mean square error of regularizedTikhonov estimators for large-scale problems. Another very recent work was published byStark [22]. His approach was based on generalizations of the Backus–Gilbert resolution. Inthis formulation, control over the mean squared error (MSE) is also considered. Our approachfocuses on the use of training models and on actual numerical implementations, especially forlarge-scale problems.

Many important inverse problems are large in practice; the number of parameters that needto be estimated can range from tens of thousands to millions. The computational techniquesfor an optimal design proposed so far in the literature were not suitable for such large-scaleproblems. Most of the studies that have investigated the ill-posed case, as mentioned above,employed computational techniques that were based on stochastic optimization. Such anapproach is prohibitively computationally expensive.

In this paper, we develop an optimal experiment design (OED) methodology for ill-posedproblems. This methodology can be applied to many practical inverse problems and, althoughonly the linear case is considered here, it can be generalized to nonlinear problems. Wedevelop mathematical tools and efficient algorithms to solve the optimization problems thatarise in the experimental design. In fact, the problem is formulated in a way that allows theapplication of standard constrained optimization methods.

The rest of the paper is organized as follows. The basic background on experimentaldesign for well- and ill-posed linear inverse problems is reviewed in section 2. In section3, we propose several experimental designs for ill-posed problems and discuss issues relatedto the convexity of the objective functions which define these designs. In section 4, wepresent numerical optimization algorithms for the solution of the optimization problemsthat arise from this formulation. In section 5, we present two numerical examples: aborehole tomographic inversion and a two-dimensional function recovery problem. Finally, insection 6, we summarize the paper and discuss future research.

2. From classical to ill-posed OED

We begin by reviewing the basic framework of OED in the context of well-posed inverseproblems of the following type. The data are modeled as

d = K(y)m + ε, (2.1)

2


where K(y) is an �× k matrix representation of the forward operator that acts upon the modelm and depends on a vector of experimental parameters y ∈ Y . The noise vector ε is assumedto be zero mean with iid entries of known variance σ 2. The experiment design questionis constructed into the selection of y that leads to an optimal estimate of m. A solution tothis problem for a continuous vector y can be very difficult. For example, in many practicalproblems the matrix K(y) is not continuously differentiable in y; in other cases where it does,its derivative may be expensive or hard to compute. This implies that typical optimizationmethods may not be applicable. We therefore reformulate the OED problem by discretizingthe space Y . This approach has also been used for the well-posed case (e.g. [18]).

Assume that y is discretized as y1, . . . , yn, which leads to n possible experiments tochoose from:

di = K�i m + εi (i = 1, . . . , n), (2.2)

where for simplicity each K�i is assumed to be a row vector representing a single experimental

setup. In the general case, each K(yi)� may be a matrix. The goal is to choose a number

p < n of experiments that provides an optimal estimate (in a sense yet to be defined) of m.(Of course, the well-posed case considered first requires p > k.)

We begin by reviewing the approach presented in [18]. Let qi be the number of timesKi is chosen (hence 1�q = p). Finding the least squares (LS) estimate of m based on qi

independent observations of K�i m (i = 1, . . . , n) is equivalent to solving for the weighted LS

estimate:

m = arg minn∑

i=1

qi

(di − K�

i m)2 = arg min(d − Km)�Q(d − Km),

where Var(di) = σ 2/qi (for qi �= 0), Q = diag{q1, . . . , qn} and K is the n × k matrix withrows K�

i (the reduction in the variance of di comes from averaging the observations whichcorrespond to the same experimental unit). Assuming that K�QK is nonsingular, we canwrite

m = (K�QK)−1K�Qd, (2.3)

which is an unbiased estimator of m with a covariance matrix σ 2C(q)−1, where

C(q) = K�QK. (2.4)

Since m is unbiased, it is common to assess its performance using different characteristicsof its covariance matrix. For example, the following optimization problem can be used tochoose q:

minq

trace C(q)−1

(2.5)s.t. qi ∈ N ∪ {0}, 1�q = p.

This is known as an A-optimal experiment design. If, instead of the trace, the determinant orthe L2 norm of C(q)−1 is used, then the design is known as D- or E-optimal, respectively.

Since the OED (2.5) is a difficult combinatorial problem, we may use instead anapproximation based on the relative frequency wi = qi/n that each Ki is chosen. Theweighted LS estimate and its covariance matrix are given by (2.3) and (2.4), respectively, withw in place of q. The optimization problem (2.5) then becomes

minw

trace C(w)−1

(2.6)s.t. w � 0, 1�w = 1.

3


Given a solution w of (2.6), the integer part of nw provides an approximate solution of (2.5)which improves as n increases.

A discussion of A-, D- and E-designs can be found in [5, 18]. As noted above, thesedesigns are defined by weighted LS solutions based on averages of the repeated observationsof different experiments. We shall refer to them as cost-controlled designs to distinguish themfrom the designs defined in the following section. For cost-controlled designs, the total cost(the sum of all the weights of the experiments) is controlled by the constraint 1�w = 1; thenew designs control the total cost in a different way.

2.1. Sparsity-controlled designs

We now propose a new formulation that in some applications may be more appropriate thanthe cost-controlled designs; we refer to this formulation as sparsity control design (SCD). Inthe applications we have in mind, the number of possible experimental units is very large butonly a few are needed or can be realistically implemented. This implies that w should haveonly a few nonzero entries (i.e. w is sparse). The formulation and solution of this problem canbe obtained by including a regularization term in (2.6) that penalizes w by using a norm thathas some control over the sparsity of w.

In order to obtain a sparse solution w, one would naturally use an L0 penalty (recall thatthe ‘zero-norm’ ‖w‖0 is defined as the number of nonzero entries of w). However, this typeof regularization leads to a difficult combinatorial problem. Instead, we use L1 regularization,which still promotes sparse solutions (e.g. sparser than an L2 penalty).

The merits for employing an L1 penalty with an L2 misfit can be found in [9]; thisstudy also includes a comprehensive analysis of the expected sparsity properties of solutionsacquired by this framework. However, the objective function of our problem is not an L2

misfit in w. In section 4, we consider L1 regularization and a heuristic approximation of theL0 approach for our OED problem. We show that the L0 heuristic may outperform the L1

design.In summary, an A-optimal sparsity-controlled design is defined as a solution to the

following optimization problem:

minw

trace C(w)−1 + β‖w‖p

(2.7)s.t. w � 0,

where 0 � p � 1 and β > 0 is selected so as to obtain a desired number of nonzero wi . Inthis paper, we will only consider p = 1 and an approximation to the p = 0 problem.

To a practitioner, a solution w of (2.7) means that all the observations that correspond toa nonzero wi should be carried out so as to provide a variance σ 2/wi . This can be achieved,for instance, by adjusting the measuring instrument or extending the observation time. Theestimate of m is then obtained by using weighted LS. In some cases, the experimenter mayhave a maximum allowable variance level. In such a case, a constraint w � wmax can be addedto (2.7).

Although problems (2.6) and (2.7) may seem different, it is easy to verify that a solutionof (2.7) with p = 1 and an appropriate choice of β is also a solution of (2.6). It is alsoimportant to note that p = 0 may lead to a design that achieves a smaller value of trace C(w)

for the same number of nonzero entries in w. We explore this issue further in the numericalexamples.

4


2.2. The ill-posed case

Since the designs discussed so far have only been based on the covariance matrix, they arenot appropriate for ill-posed problems where estimators of m are most likely biased. In fact,the bias may be the dominant component of the error. We now draw our focus toward theill-posed case.

Let W = diag{w1, . . . , wn} be again a diagonal matrix of non-negative weights andassume that the matrix K�WK is ill-conditioned. A regularized estimate of m can be obtainedby using penalized-weighted LS (Tikhonov regularization) with a smoothing penalty matrixL:

m = arg min(d − Km)�W(d − Km) + α‖Lm‖2,

where α > 0 is a fixed regularization parameter that controls the balance between the data misfitand the smoothness penalty. Assuming that K�WK + αL�L is nonsingular, the estimator ism = (

K�WK + αL�L)−1

K�Wd, whose bias can be written as

Bias(m) = Em − m = −α(K�WK + αL�L

)−1L�Lm.

Since the bias is independent of the noise level, it cannot be reduced by repeated observations.The effect of the noise level is manifested in the variability of m around its mean Em. Thus,this variability and the bias ought to be taken into account when choosing an estimator.

Define B(w,m) = ‖Bias(m)‖2 and V(w) = E‖m − Em‖2/σ 2. The sum of these twoerror terms provides an overall measure of the expected performance of m. This is essentiallythe MSE of m. More precisely, the MSE of m is defined as E‖m − m‖2, which can also bewritten as

MSE(m) = E‖m − Em + Em − m‖2 = ‖Em − m‖2 + E‖m − Em‖2

= ‖Bias(m)‖2 + E‖m − Em‖2 = α2B(w,m) + σ 2V(w).

The following lemma summarizes some of the characteristics of the Tikhonov estimatefor a general symmetric matrix W and correlated noise with the covariance matrix σ 2�.The proof follows from simple repeated applications of the formula for the expected valueof a quadratic form: if x is a random vector of mean μ and covariance matrix σ 2�, thenEx�Qx = μ�Qμ + σ 2 trace(Q�). More details can be found in [16, 23].

Lemma 1. Let d = Km + ε, where K is an n × k matrix and ε is a zero mean randomvector with the covariance matrix σ 2�. Let α > 0 be a fixed regularization parameter, W asymmetric matrix and L an r × k matrix such that K�WK + αL�L is nonsingular. Define

m = arg min(d − Km)�W(d − Km) + α‖Lm‖2, (2.8)

and the matrix

C(w) = K�WK + αL�L. (2.9)

Then,

(i) the Tikhonov estimate of m is

m = C(w)−1K�Wd; (2.10)

(ii) its MSE is

MSE(m) = E‖m − m‖2 = α2B(W,m) + σ 2V(W), (2.11)

where

B(W,m) = ‖C(w)−1L�Lm‖2, V(W) = trace[WKC(w)−2K�W�]. (2.12)

5


The idea is to define optimization problems similar to (2.7) but with an objective functionthat measures the performance of m, taking into account its bias and stochastic variability. Anatural choice would be the MSE; however, this measure depends on the unknown m. In thefollowing section, we consider different ways to control an approximation for the MSE thatare based on different assumptions on m.

3. Design criteria for ill-posed problems

As we have seen in section 2, to obtain an optimal design based on Tikhonov regularizationestimators, the deterministic and stochastic components of the error have to be taken intoaccount. This is done by controlling the overall MSE. We modify the designs presented insection 2 to use MSE(m) = α2B(w,m) + σ 2V(w) as the main component of the objectivefunction.

The problem is that the bias component B(w,m) depends on the unknown model m.There are different ways to control this term; these depend on the available information and onhow conservative we are willing to be. For example, if we know a priori that ‖L�Lm‖ � M ,then lemma 1 with W = diag{w1, . . . , wn} leads to the bound

B(w,m) = ‖C(w)−1L�Lm‖2 � ‖C(w)−1‖2M2.

But this bound may be too conservative. We will instead consider average measures ofB(w,m) that are expected to be less conservative.

(a) Average optimal design. Assuming that the model is in a bounded convex set M, weconsider the average of B(w,m) over M and define the approximation

AOD: MSE(w) = α2∫Mdm

∫M

B(w,m) dm + σ 2V(w). (3.1)

The downside of this design is that it does not give preference to ‘more reasonable’ modelsin M. Unless the set M is well constrained, such a design may be overly pessimistic.

(b) Bayesian optimal design (BOD). In order to assign more weight to more likely models, mis modeled as a random vector with a joint distribution function π . The average of B(w,m)

is now weighted by π :

BOD: MSE(w) = Eπ MSE(m) = α2EπB(w,m) + σ 2V(w). (3.2)

For example, if m is Gaussian N(0, �m), then (3.2) reduces to

MSE(w) = α2 trace[B(w)�mB(w)�] + σ 2V(w), (3.3)

with B(w) = C(w)−1L�L.Note that the distribution π is not required for computation of EπB(y,m) in (3.2); only

the covariance matrix of m is needed. Hence, whenever data are available for estimating thecovariance matrix, the design can be approximated. This rationale leads us to the empiricaldesign.

(c) Empirical Bayesian design (EBD). In many practical applications, it is possible to obtainexamples of plausible models. For example, often in geophysical applications, the Marmusimodel [24] is used to test inversion algorithms. In medical imaging, the Shepp–Logan modelis frequently used as a gold standard to test different designs and inversion algorithms [21].There are also geostatistical methods to generate realizations of a given medium from a singleimage (e.g. [20] and references therein). Finally, for some problems there are databases ofhistorical data that can be used as reference.

6


Let m1, . . . , ms be examples of plausible models which will be henceforth referred to astraining models. As in the Bayesian optimal design, it is assumed that these models are iidsamples from an unknown multivariate distribution π , only that this time we use the sampleaverage

EπB(w,m) = 1

s

s∑i=1

B(w,mi), (3.4)

which is an unbiased estimator of EπB(y,m). We thus define the following approximation:

EBD: MSE(w) = α2EπB(w,m) + σ 2V(w). (3.5)

This type of empirical approach is commonly used in machine learning where estimators aretrained using iid samples. We have previously used a similar approach to choose regularizationoperators for ill-posed inverse problems [11].

The computation of (3.4) can be further simplified when the covariance matrix of m ismodeled as a function of only a small set of parameters that can be estimated from the trainingmodels. Here, we focus on the more difficult case when no such covariance model is used.

Clearly, Bayes’ theorem has not been used in the definition of the ‘Bayesian’ designs. Tojustify the name, consider the cost-controlled design and assume that m is Gaussian N(0, �m)

with �m = σ 2

α(L�L)−1 (L can be defined so that L�L is nonsingular) and that the conditional

distribution of d given m is also Gaussian N(Km, σ 2�) with � = (W + δI )−1. Then, asδ → 0+, (3.3) converges to

MSE(w) = σ 2 trace[(K�WK + αL�L)C(w)−2] = σ 2 trace C(w)−1, (3.6)

where C(w) is defined in lemma 1. It is easy to see that (3.6) is precisely the trace of thecovariance matrix of the posterior distribution of m given d. Hence, this is why the term‘Bayesian’ is used in the name of the design. It is, in fact, an A-optimal Bayesian design [6].(Note that the AOD is a BOD with a flat prior and an EOD is a BOD with the empirical prior.)

These three designs lead to an optimization problem that generalizes (2.7) to the ill-posedcase:

minw

J (w) = MSE(w) + β‖w‖p

(3.7)s.t. w � 0.

Three important remarks are in order, as follows.

(1) OED and tuning of regularization parameters. The search for appropriate weights wi

can be interpreted as the usual problem of estimating regularization parameters for anill-posed inverse problem. The method we propose determines w that minimizes theMSE averaged over noise and model realizations. This is done prior to collecting data.Although it is possible (and in many cases desirable) to obtain a regularization parameteradapted to a particular realization of noise and model m, experimental design is conductedprior to having any data. It is therefore sensible to consider values of the regularizationparameters that work well in an average sense.

(2) Selecting α. It is clear that the Tikhonov estimate of m and its MSE depend on α onlythrough the ratio w/α, but α itself is not identifiable. It is only used to calibrate (bypreliminary test runs) w to be of the order of unity. This facilitates the identification ofwi that are not significantly different from zero.

(3) Convexity of the objective function. The formula for the error term V(w) is different inthe well- and ill-posed cases. In the former, the variance of an individual experimentis controlled by averaging observations of the same experimental unit. In the ill-posed

7


case, the variances of averaged observations are not used as weights. This time, wi arejust weights chosen so that the weighted Tikhonov estimator of m has a good MSE. Onereason for this change is convexity. In the well-posed case, the function w → trace C(w)

is convex (also see [5]), so the optimization problem can be easily solved by standardmethods. This is no longer true with ill-posed problems for in this case, we haveV(w) = trace[WKC(w)−2K�], a function that can have multiple minima. Instead,the new approach yields V(w) = trace[WKC(w)−2K�W ], which is better behaved. Forexample, in the simple case when K = L = I and α = 1, we haveV(w) = ∑

wi/(1+wi)2

for the well-posed case. This function is obviously not convex and has multiple minima.On the other hand, the ill-posed formula reduces to V(w) = ∑

w2i /(1 + wi)

2. Thisfunction is again not convex but is quasi-convex (i.e. it has a single minimum).

On the other hand, it turns out that the estimate of B(w,m) defined for each of the threedesigns is a convex function of w.

Lemma 2. Let m have a prior distribution π with finite second moments. Then EπB(w,m) isa convex function of w.

Proof. Let μ and �m = R�R be, respectively, the mean and covariance matrix of m under π .Then

EπB(w,m) = ‖C(w)−1L�Lμ‖2 + trace[RL�LC(w)−2L�LR�]

= ‖C(w)−1L�Lμ‖2 +∑

k

‖C(w)−1L�LR�ek‖2 (3.8)

where ek is the canonical basis of Rn. Hence, it is enough to show that for any b ∈ Rn, thefunction B(w) = b�C(w)−2b is convex, where C(w) = K� diag(w)K + αL�L. We do thisby showing that B(w) is convex along any line in the domain of w. Let w be a vector with non-negative entries and z a direction vector. Define the scalar function f (t) = b�C(w + tz)−2b.We show that f (t) is convex for all t such that w + tz � 0.

Define x implicitly by

C(w + tz)x = [K� diag(w + tz)K + αL�L]x = b, (3.9)

so that f (t) = x(t)�x(t). Differentiation of f leads to

f = 2x�x and f = 2x�x + 2x�x,

while differentiation of (3.9) with respect to t yields

K�ZKx + C(w + tz)x = 0 and 2K�ZKx + C(w + tz)x = 0, (3.10)

where Z = diag(z). Since f = 2‖x‖2 + 2x�x, showing that x�x � 0 is enough to prove thenon-negativity of f . Using (3.10), we obtain

x�x = −2x�C(w + tz)−1K�ZKx = 2x�C(w + tz)−1BC(w + tz)−1Bx = 2x�G2x,

where B = K�ZK and G = C(w + tz)−1B. Since C(w + tz) is symmetric andpositive definite, there is a nonsingular matrix S such that C(w + tz) = S�S. Notethat (S�)−1(S�SB)S� = SBS�. This means that G and SBS� are similar matrices andtherefore, since B is symmetric, G is diagonalizable: There is a nonsingular matrix T such thatG = T T −1, where is a diagonal matrix of eigenvalues of G. It follows that G2 = T 2T −1

and therefore x�G2x � 0. Hence, f � 0 and b�C(w)−2b is a convex function. �

Despite the convexity of the bias term, the estimate of the MSE is non-convex becauseof the non-convexity of V(w). There are two reasons why the non-convexity of the objective

8


function may not be critical. First, while it is desirable to find an optimal design, anyimprovement on a currently available one may be sufficient in practice. Second, in manyill-posed inverse problems the stochastic error V(w) is small compared to the bias componentB(w,m).

4. Solving the optimization problem

The framework presented in section 3 is a new approach that can be applied to a broad rangeof linear and linearized optimal experimental designs of ill-posed inverse problems. Beforediscussing a solution strategy, we need to define numerical approximations for the estimatesof the MSE.

Since the weights wi are non-negative, the non-differentiable L1-norm can be replaced by1�w, which leads to a large but tractable problem. However, the computation of a quadraticform or a trace involves large, dense matrix products and inverses which cannot be acquiredefficiently in large-scale problems. Hence, we now derive approximations of the objectivefunction that do not require the direct computation of traces or matrix inverses.

In order to estimate V(w), we need to approximate the trace of a possibly very largematrix. Stochastic trace estimators have been successfully used for this purpose. In particular,Golub and von Matt [10] have used a trace estimation method proposed by Hutchinson [12],in which the trace of a symmetric positive definite matrix H is approximated by

trace(H) ≈ 1

s

s∑i=1

v�i Hvi, (4.1)

where each vi is a random vector of iid entries taking the values ±1 with probability 1/2. Theperformance of this estimator was numerically studied in [2] with the surprising result thatthe best compromise between accuracy and computational cost is achieved with s = 1. Ournumerical experiments confirm this result. We therefore use the following approximation:

V(w) = v�WKC(w)−2K�Wv = w�V (w)�V (w)w, (4.2)

where V (w) = C(w)−1K� diag(v).If the mean and covariance matrix of m are known, then the expected value of B(w,m)

is given by (3.8); this still requires the computation of a norm and a trace. Nevertheless, weconsider the more general case where the mean and covariance matrix of m are estimatedbased on iid samples m1, . . . , ms . We use the sample estimate defined for the EBD:

B(w) = 1

s

s∑j=1

m�j B(w)�B(w)mj , B(w) = C(w)−1L�L. (4.3)

Summarizing, the approximation to the optimization problem (3.7) that we solve for thecase p = 1 is

minw

J (w) = α2B(w) + σ 2V(w) + βe�w

(4.4)s.t.w � 0.

4.1. Numerical solution of the optimization problem

For a solution of (4.4), we use projected gradient and projected Gauss–Newton methods. Thisrequires the computation of the gradients of B and V . We have

∇w[w�V (w)�V (w)w] = 2Jv(w)�V (w)w

∇w[m�B(w)�B(w)m] = 2Jb(w)�B(w)m,

9


where

Jv(w) = ∂[V (w)w]

∂wand Jb(w) = ∂[B(w)m]

∂w.

We use implicit differentiation to obtain an expression for the matrices Jv and Jb. Define

rb = (K�WK + αL�L)−1K�WKm.

The matrix Jb is nothing but ∇wrb. Differentiating rb with respect to w leads to

K� diag(Km) = (K�WK + αL�L)∂rb

∂w+ K� diag(Krb),

which yields

Jb = C(w)−1K� diag[K(m − rb)]. (4.5)

We proceed in a similar way to obtain an expression for Jv . First, note that

∂[V (w)w]

∂w= V (w) +

∂[V (w)wfixed]

∂w.

To compute the second term in the above sum, write

rv = V (w)wfixed = (K�WK + αL�L)−1K� diag(v)wfixed.

Differentiating this equation with respect to w leads to

(K�WK + αL�L)∂rv

∂w+ K� diag(Krv) = 0,

which finally yields

Jv = V − C(w)−1K� diag(Krv). (4.6)

This completes the evaluation of the derivatives of the objective function. It is important tonote that neither the matrix C(w) nor its inverse are needed explicitly for computation ofthe objective function or any of the gradients. Whenever a product of the form C(w)−1u isneeded, we simply solve the system C(w)x = u. To solve such a system, only matrix-vectorproducts of the form C(w)v are required.

Having found the gradient, we can now use any gradient-based method to solve theproblem. We have experimented with the projected gradient formulation, which requires onlygradient evaluation, as well as the projected Gauss–Newton method. For the Gauss–Newtonmethod, one can approximate the Hessian of the objective function J in (4.4) by

∇2J ≈ σ 2J�v Jv + α2J�

b Jb.

With the Jacobian at hand, it is possible to use the method suggested by Lin and More [13].The active set is (approximately) identified by the gradient projection method, and a truncatedGauss–Newton iteration is performed on the rest of the variables. Again, it is important tonote that the matrices Jv and Jb need not be calculated. A product of either with a vectorinvolves a solution of the system C(w)x = u. Thus, it is possible to facilitate a conjugategradient for computation of an approximation of a Gauss–Newton step.

Beyond the obvious computational benefits offered by a design framework that reliesmerely on matrix-vector products, an even more fundamental advantage applies. Manyimaging systems, such as tomographs, employ hardware-specific computational modules or,in other cases, a black box code module is in use. These modules are often accessible onlyvia matrix-vector products. Thus, the proposed methodology can be easily applied for thesecases, using the current customized computational machinery.

10


4.2. Approximating the L0 solution

Although it is straightforward to solve the L1 regularization problem, one can attempt toapproximate the L0 solution. The L0 solution is the vector w with the least number of nonzeroentries. Obtaining this solution is an intricate combinatorial problem. Nevertheless, it ispossible to approximate the L0 solution by using the L1 penalty.

Write {1, . . . , n} = I0 ∪ IA, where I0 contains all the indices for the zero entries of w andIA contains the rest. We write wI0 and wIA

for the restrictions of w to I0 and IA, respectively.If I0 were known a priori, the L0 solution could be obtained by defining wI0 = 0 and bysolving the unregularized optimization problem

minwIA

J(wIA

) = α2

s

s∑j=1

m�j B

(wIA

)�B

(wIA

)mj + σ 2w�

IAV

(wIA

)�V

(wIA

)wIA

(4.7)s.t wIA

� 0 wI0 = 0.

This problem does not require any regularization term because the zero set is assumed to beknown. The combinatorial nature of the L0 problem emerges due to the search for the set IA.Nevertheless, one can approximate IA with the corresponding index set of the L1 solution.

This idea has been explored in [5], where numerical experiments show that theapproximate L0 solution may differ and, in fact, improve on the L1 solution. In this work, weuse the same approximation: we set the final weights to those that solve (4.7) with the set IA

obtained from the solution of the L1 problem.

5. Numerical examples

We present two numerical examples that illustrate some applications of the proposedmethodology. These examples show that an experimental design is potentially of greatsignificance for practical applications.

5.1. A borehole tomography problem

We begin with a ray tomography example that is often used for illustrative purposes ingeophysical inverse problems. It also serves as a point of comparison as it has been used in[8, 19] for their experimental design work.

The objective of borehole ray tomography is to determine the slowness (inverse of velocity)of a medium. Sources and receivers are placed along boreholes and/or on the surface of theearth, and travel times from sources to receivers are recorded. The data (travel times) aremodeled as

dj =∫

�j

m(x) d� + εj (j = 1, . . . , n), (5.1)

where �j is the ray path traversing between the source and receiver. In the linearized caseconsidered here, the ray path does not change as a function of m (see, for example, [17]). Inthis case, the goal of experimental design in this case is to choose the optimal placement ofsources and receivers.

Figure 1 shows a sketch of the experimental setup used in our numerical simulation. Themedium is represented by the square region [0, 1]× [0, 1] with boreholes covering the interval[0, 1] on each of the sides. The top represents the surface, which also comprises the interval[0, 1]. The model is discretized using 64 × 64 cells.

11


Figure 1. Geometry of the borehole tomography example. Sources and receivers are placed alongthe two boreholes (left and right sides of the square region) and on the earth’s surface (top). Thelines represent straight ray paths connecting sources and receivers.

Sources and receivers are to be placed anywhere in the design space Y = I1 ∪ I2 ∪ I3,where I1 = {(x, y) : x = 0, y ∈ [0, 1]}, I2 = {(x, y) : x = 1, y ∈ [0, 1]} andI3 = {(x, y) : y = 0, x ∈ [0, 1]}. We are free to choose rays that connect any point onIk to a point on Ij (j, k = 1, 2, 3; j �= k). We discretize each Ii using 32 equally spacedpoints. This gives a total of 322 × 3 = 3072 possible experiments. Our goal is to choose theoptimal placement of 500 sources and receivers. For the solution of the inverse problem, wehave used the discretization of the gradient as a smoothing matrix.

To create the set of training models, we divide the Marmousi model [24] into fourm1, . . . , m4. The first three are used to obtain the optimal experiment and the fourth fortesting its performance. The training models are shown in figure 2 and the testing model infigure 3.

The code that minimizes J was executed with different values of β. Table 1 shows thefinal values of B(w∗), V(w∗) and the sparsity of the solution w∗ for some different values ofβ. The results are shown for the L1 penalty and the L0 approximation.

Note that the minimum of the objective function comprises a bias term B which issubstantially larger than V . This ascertains that methods for the optimal experimental designof ill-posed problems must take the bias into consideration as it may play a more important rolethan the variance. This can be interpreted as a problem with the chosen regularization operator.In fact, we have used a similar strategy in order to choose appropriate regularization operatorsL for ill-posed problems [11]. Of course, the importance of the bias is noise-dependant.For this example, we would need the noise level to be 30 times larger so that the variancecomponent will be of the same order as the bias term. Also note that in many cases, there isa difference between the results from the L0 and L1 designs. Typically, the L0 design gave asmaller MSE estimate than the L1 design.

12


Figure 2. The first three training models obtained from the Marmusi model.

Figure 3. The test model m4 (left) and its estimates obtained with the non-optimal (middle) andoptimal (right) designs.

In order to assess the performance of the optimal design obtained by the code, we simulatedata using the test model m4 and use the optimal weights for its recovery. We compare thisestimate with that obtained using the weights based on a uniform placement of sources andreceivers. For the sake of objectivity, an equal number of active sources and receivers weredeployed for both designs. This number was determined according to the optimum obtainedfor β = 10−2Y . The error ‖m(w) − m‖ was 1.7 × 103 for the optimal design and 3.1 × 103

for the other. Figure 3 shows the estimates of m4 obtained with the two designs. It is evidentthat an optimally designed survey yielded a substantial improvement in the recovered model.

Another important feature of our design is its controlled sparsity. The number of nonzeroentries in w is reduced by decreasing β. It is then natural to ask for a method to choose an‘optimal’ number of experiments. To answer this question, the MSE is plotted as a functionof the number of nonzeros in the design. The plot is shown in figure 4. The graph shows aclear L-shape with the MSE decreasing significantly in the region ‖w‖0 � 350 and withoutsignificant improvement beyond this point. This suggests that selection of the number ofexperiments at the corner point 350 where the MSE stabilizes may provide a cost-effectivecompromise. This example also shows that there is a point of diminishing return. Resourcesmay be invested to obtain more data, but the improvement in the MSE may not be significant.

13


Figure 4. MSE as a function of sparsity (‖w‖0 = NNZ (w)) of the optimal w obtained with theL0 heuristic on the tomography simulation.

Table 1. Sparsity of the optimal w∗ and components of the MSE estimate at w∗ for the tomographysimulation. The results are shown for different values of β and for the L1 and L0 designs.

β(×10−2) L1 : α2B(×102) L0 : α2B(×102) L1 : σ 2V(×10−1) L0 : σ 2V(×10−1) ‖w∗‖0

1000.0 4.3 3.1 0.51 0.98 249130.00 2.4 2.1 2.70 1.50 31016.000 2.1 2.1 2.80 1.70 4232.0000 2.1 2.1 2.90 1.38 4910.2400 2.1 2.1 2.90 3.40 5070.0310 2.1 2.1 2.90 3.60 519

5.2. Recovery of a function from noisy data

For a second example, we consider noisy observations of a sufficiently smooth functionf (x1, x2) on = [0, 1] × [0, 1] (e.g. f ∈ C2( )):

di = f(xi

1, xi2

)+ εi (i = 1, . . . , n). (5.2)

The goal is to recover the values of the function on a dense grid G ⊂ from values observed ona smaller, sparse gridS ⊂ G. The design question is to determineS (also see [4] for discussion).This is a common practical problem that arises, for example, in air-pollution monitoring whereone wishes to set the placement of the sensors (e.g. [7] and reference therein). In this type ofapplications, a pollution map database from previous years is available and can be used as asource of training models. As in the tomography problem, we begin by discretizing . Weuse a uniform grid G of � × � nodes and assume that sensors can be placed in any of its nodes.We write fh for the vector of values of f on G. Let W = diag(w) be the matrix of weightsassigned to each node. The Tikhonov estimate of fh that uses the discrete Laplacian �h as asmoothing penalty is defined as the minimizer of (fh − d)�W (fh − d) + α‖�hfh‖2, where�h is a discretization of the Laplacian (e.g. [25, 15]). The subset S of the grid is defined bythe nonzero entries in w.

14


Figure 5. Maps of air pollution in the Ashkelon/Ashdod area.

Table 2. Same as table 1 but using the results from the function reconstruction example.

β(×10−2) L1 : α2B L0 : α2B L1 : σ 2V L0 : σ 2V ‖w∗‖0

4000.0 34.6 9.50 0.190 0.1223 452000.0 18.4 5.10 0.140 0.1201 561000.0 11.5 3.20 0.100 0.1083 66500.00 7.80 2.10 0.086 0.0972 77250.00 5.20 1.40 0.074 0.0960 85125.00 3.50 1.10 0.065 0.0949 8662.500 2.10 0.97 0.065 0.1117 8831.300 1.30 1.55 0.073 0.1123 9315.600 0.74 1.05 0.077 0.1434 997.8100 0.51 0.55 0.082 0.1230 1303.9100 0.39 0.40 0.086 0.1074 1591.9500 0.34 0.33 0.090 0.1146 1810.0980 0.31 0.30 0.092 0.1221 1920.0490 0.30 0.28 0.094 0.1054 201

We run our algorithm using a training set of 1000 maps, randomly chosen from theavailable 5114 maps of daily air pollution data (NO2) for the years 2000–2007 in theAshkelon/Ashdod region (Israel). Six of the maps are shown in figure 5. The goal is todetermine the optimal location of air-pollution stations in the same region. In order to use ourmethod, we need the variance σ 2 of the data. The value 0.1 of this variance was provided bythe atmospheric scientists; it was obtained from the known performance characteristics of theinstruments.

The results are summarized in table 2. Once again, the table shows that the bias term ismore significant than the variance term V . Thus, an experimental design based only on thecovariance matrix would yield poor results. Also, just as in the tomography example, the L0

design gave a better MSE than L1.

15


Table 3. MSE of the estimates of the eight test models based on the optimal and uniform designs.

Design m1 m2 m3 m4 m5 m6 m7 m8

Optimal 0.82 0.85 0.89 0.75 0.77 0.92 0.88 0.83Uniform 1.41 1.34 1.23 1.37 1.39 1.31 1.29 1.28

We compare our results for β = 15.6 (which yields 99 sensors) to those obtained withthe commonly used design based on a uniform grid with 100 sensors. For the comparison,we use eight models that were not used as training models. We then compute the MSE of theestimates of each testing model based on our optimal design and on the uniform-grid design.The MSE for each of these models and for each of the designs is shown in table 3. Theseresults confirm that the optimally designed experiment is superior to the uniform design.

6. Summary

We have considered the problem of experimental design for ill-posed linear inverse problems.One of the main differences from the well-posed case is the addition of regularization that isrequired for dealing with the ill-posedness of the inverse problem. Consequently, we can nolonger focus only on the covariance matrix of the estimates, as the bias, which often dominatesthe overall error, has to be taken into account.

The basic idea had been to control the mean squared error of weighted Tikhonovestimates. The weights were chosen so as to control the total cost and sparsity of thedesign. We have developed an efficient computational strategy for the solution of the resultingL1 and (approximate) L0 optimization problems for the weights. The formulation leadsto continuously differentiable optimization problems that can be solved using conventionaloptimization techniques. These formulations can readily be applied to large-scale problems.

In this study we have only defined A-optimal sparsity-controlled designs. In the well-posed case, it is straightforward to define sparsity-controlled versions of D- and E-designs butit is more difficult for the ill-posed case. In this case, we have used A-designs because theyhave a natural mean squared error interpretation.

We have tested our methodology on illustrative inverse problems and demonstrated thatthe proposed approach can substantially improve naive designs. In an ongoing work, we applysimilar ideas for selection of optimal weights for joint inversion problems. We also intend toextend the above framework to the case of a nonlinear experimental design.

Acknowledgments

EH and LH were partially supported by NSF grants DMS 0724759, CCF-0728877 and CCF-0427094, and DOE grant DE-FG02-05ER25696. The work of LT was partially funded byNSF DMS 0724717 and DMS 0724715. The authors thank Yuval from the Technion IsraelInstitute of Technology for providing the atmospheric data used in section 5.2.

References

[1] Atkinson A C and Donev A N 1992 Optimum Experimental Designs (Oxford: Oxford University Press)[2] Bai Z, Fahey M and Golub G 1996 Some large scale matrix computation problems J. Comput. Appl.

Math. 74 71–89

16

http://dx.doi.org/10.1016/0377-0427(96)00018-0


[3] Bardow A 2008 Optimal experimental design for ill-posed problems, the meter approach Comput. Chem.Eng. 32 115–24

[4] Bissantz N, Hohage T, Munk A and Ruymgaart F 2007 Convergence rates of general regularization methodsfor statistical inverse problems and applications SIAM J. Numer. Anal. 45 2610–36

[5] Boyd S and Vandenberghe L 2004 Convex Optimization (Cambridge: Cambridge University Press)[6] Chaloner K and Verdinelli L 1995 Bayesian experimental design: a review Stat. Sci. 10 237–304[7] Chang N B and Tseng C C 1999 Optimal design of a multi-pollutant air quality monitoring network in a

metropolitan region using Kaohsiung, Taiwan as an example Environ. Monit. Assess. 57 121–48[8] Curtis A 1999 Optimal experimental design: cross borehole tomographic example Geophys. J. Int. 136 205–15[9] Donoho D 2006 For most large underdetermined systems of linear equations the minimal �1-norm solution is

also the sparsest solution Commun. Pure Appl. Math. 59 797–829[10] Golub G and von Matt U 1997 Tikhonov regularization for large scale problems Technical Report SCCM 4-79[11] Haber E and Tenorio L 2003 Learning regularization functionals—a supervised training approach Inverse

Problems 19 611–26[12] Hutchinson M F 1990 A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines

J. Commun. Stat. Simul. 19 433.–50[13] Lin C J and More J 1999 Newton’s method for large bound-constrained optimization problems SIAM J.

Optim. 9 1100–27[14] Maurer H, Boerner D and Curtis A 2000 Design strategies for electromagnetic geophysical surveys Inverse

Problems 16 1097–117[15] Modersitzki J 2004 Numerical Methods for Image Registration (New York: Oxford University Press)[16] O’Sullivan F 1986 A statistical perspective on ill-posed inverse problems Stat. Sci. 1 502–27[17] Parker R L 1994 Geophysical Inverse Theory (Princeton, NJ: Princeton University Press)[18] Pukelsheim F 1993 Optimal Design of Experiments (New York: Wiley)[19] Routh P G, Oldenborger G and Oldenburg D W 2005 Optimal survey design using the point-spread function

measure of resolution Proc. SEG (Houston, TX)[20] Sarma P, Durlofsky L J, Aziz K and Chen W 2007 A new approach to automatic history matching using kernel

PCA SPE Reservoir Simulation Symp. (Houston, TX)[21] Shepp L A and Logan B F 1974 The Fourier reconstruction of a head section IEEE Trans. Nucl. Sci. NS-21

43–52[22] Stark P B 2008 Generalizing resolution Inverse Problems 24 034014[23] Tenorio L 2001 Statistical regularization of inverse problems SIAM Rev. 43 347–66[24] Versteeg R J 1991 Analysis of the problem of the velocity model determination for seismic imaging PhD

Dissertation University of Paris, France[25] Wahba G 1990 Spline Models for Observational Data (Philadelphia: SIAM)

17

http://dx.doi.org/10.1016/j.compchemeng.2007.05.004

http://dx.doi.org/10.1137/060651884

http://dx.doi.org/10.1214/ss/1177009939

http://dx.doi.org/10.1023/A:1005992712569

http://dx.doi.org/10.1046/j.1365-246X.1999.00947.x

http://dx.doi.org/10.1002/cpa.20132

http://dx.doi.org/10.1088/0266-5611/19/3/309

http://dx.doi.org/10.1080/03610919008812866

http://dx.doi.org/10.1137/S1052623498345075

http://dx.doi.org/10.1088/0266-5611/16/5/302

http://dx.doi.org/10.1214/ss/1177013525

http://dx.doi.org/10.1088/0266-5611/24/3/034014

http://dx.doi.org/10.1137/S0036144500358232

Date post:	19-Aug-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Numerical methods for experimental design of large-scale ...horesh/publications/journal... ·...

Documents