Asymptotic analysis of the learning curve for Gaussian process regression

Mach LearnDOI 10.1007/s10994-014-5437-0

Asymptotic analysis of the learning curve for Gaussianprocess regression

Loic Le Gratiet · Josselin Garnier

Received: 10 January 2013 / Accepted: 25 February 2014© The Author(s) 2014

Abstract This paper deals with the learning curve in a Gaussian process regression frame-work. The learning curve describes the generalization error of the Gaussian process used forthe regression. The main result is the proof of a theorem giving the generalization error for alarge class of correlation kernels and for any dimension when the number of observations islarge. From this theorem, we can deduce the asymptotic behavior of the generalization errorwhen the observation error is small. The presented proof generalizes previous ones that werelimited to special kernels or to small dimensions (one or two). The theoretical results areapplied to a nuclear safety problem.

Keywords Gaussian process regression · Asymptotic mean squared error ·Learning curves · Generalization error · Convergence rate

1 Introduction

Gaussian process regression is a useful tool to approximate an objective function given someof its observations (Laslett 1994). It has originally been used in geostatistics to interpolate arandom field at unobserved locations (Wackernagel 2003; Berger et al. 2001; Gneiting et al.2010), it has been developed in many areas such as environmental and atmospheric sciences.

Editor: Kristian Kersting.

L. Le GratietUniversité Paris Diderot, 75205 Paris Cedex 13, France

L. Le Gratiet (B)CEA, DAM, DIF, 91297 Arpajon, Francee-mail: [email protected]

J. GarnierLaboratoire de Probabilites et Modeles Aleatoires & Laboratoire Jacques-Louis Lions,Universite Paris Diderot, 75205 Paris Cedex 13, France

123

Mach Learn

This method has become very popular during the last decades to build surrogate modelsfrom noise-free observations. For example, it is widely used in the field of “computer exper-iments” to build models which surrogate an expensive computer code (Sacks et al. 1989).Then, through the fast approximation of the computer code, uncertainty quantification andsensitivity analysis can be performed with a low computational cost.

Nonetheless, for many realistic cases, we do not have direct access to the function to beapproximated but only to noisy versions of it. For example, if the objective function is theresult of an experiment, the available responses can be tainted by measurement noise. Inthat case, we can reduce the noise of the observations by repeating the experiments at thesame locations. Another example is Monte-Carlo based simulators—also called stochasticsimulators—which use Monte-Carlo or Monte-Carlo Markov Chain methods to solve a sys-tem of differential equations through its probabilistic interpretation. For such simulators, thenoise level can be tuned by the number of Monte-Carlo particles used in the procedure.

In this paper, we are interested in obtaining learning curves describing the generalizationerror—defined as the averaged mean squared error—of the Gaussian process regression asa function of the training set size (Rasmussen and Williams 2006). The problem has beenaddressed in the statistical and numerical analysis areas. For an overview, the reader is referredto (Ritter 2000b) for a numerical analysis point of view and to (Rasmussen and Williams2006) for a statistical one. In particular, in the numerical analysis literature, the authors areinterested in numerical differentiation of functions from noisy data (Ritter 2000a; Bozziniand Rossini 2003). They have found very interesting results for kernels satisfying the Sacks–Ylvisaker conditions of order r (Sacks and Ylvisaker 1981) but only valid for 1-D or 2-Dfunctions.

In the statistical literature Sollich and Hallees (2002) give accurate approximations tothe learning curve and Opper and Vivarelli (1999) and Williams and Vivarelli (2000) giveupper and lower bounds on it. Their approximations give the asymptotic value of the learningcurve (for a very large number of observations). They are based on the Woodbury–Sherman–Morrison matrix inversion lemma (Harville 1997) which holds in finite-dimensional caseswhich correspond to degenerate covariance kernels in our context. Nonetheless, classical ker-nels used in Gaussian process regression are non-degenerate and we hence are in an infinite-dimensional case and the Woodbury–Sherman–Morrison formula cannot be used directly.

To deal with asymptotics of Gaussian process learning curves for more general kernels,some authors have used other definitions of the generalization error. For example, Seeger et al(2008) present consistency results and convergence rates for cumulative log loss of Bayesianprediction. Then, their work is revisited by van der Vaart and van Zanten (2011) who suggestand study another risk which is an upper bound for the one presented in (Seeger et al (2008)).

The main result of this paper is the proof of a theorem giving the value of the Gaussianprocess regression mean squared error (MSE) for a large training set size when the observationnoise variance is proportional to the number of observations. This value is given as a functionof the eigenvalues and eigenfunctions of the covariance kernel. From this theorem, we candeduce an approximation of the learning curve for non-degenerate and degenerate kernels[which generalizes the proofs given in (Opper and Vivarelli 1999; Sollich and Halees 2002;Picheny 2009)] and for any dimension [which generalizes the proofs given in (Ritter 2000a,b;Bozzini and Rossini 2003)].

The rate of convergence of the best linear unbiased predictor (BLUP) is of practical interestsince it provides a powerful tool for decision support. Indeed, from an initial experimentaldesign set, it can predict the additional computational budget (defined as the number ofexperiments including repetitions) necessary to reach a given desired accuracy.

123

Mach Learn

The paper is organized as follows. First we present the asymptotic framework consideredin this paper in Sect. 2. Although the main results of the paper are theoretical contributions,an application is provided in order to emphasize the possible implications for real-worldproblems. Second, we present in Sect. 3 the main result of the paper which is the theoremgiving the MSE of the considered model for a large training size. This theorem is proved inSect. 4. Third, we study the rate of convergence of the generalization error when the noisevariance decreases in Sect. 5. The theoretical asymptotic rates of convergences are comparedto the obtained ones in a numerical simulations and academic examples. Furthermore, astudy on how large the training set size should be for the asymptotic formulas to agreewith the numerical ones is provided for the specific case of the Brownian motion. Finally, anindustrial application to the safety assessment of a nuclear system containing fissile materialsis considered in Sect. 6. This real case emphasizes the effectiveness of the theoretical rate ofconvergence of the BLUP since it predicts a very good approximation of the budget neededto reach a prescribed precision.

2 Generalization error for noisy observations

The general framework of the paper is given in this section. First, the mathematical for-malism on which the theoretical developments are based is presented. Then, the consideredapplication is introduced. Finally, the bridge between the theoretical developments and theapplication is given.

2.1 Asymptotic framework for the analysis of the generalization error

Let us suppose that we want to approximate an objective function x ∈ Rd → f (x) ∈ R from

noisy observations of it at points (xi )i=1,...,n with xi ∈ Rd . The points of the experimental

design set (xi )i=1,...,n are supposed to be sampled from the probability measure μ over Rd .

μ is called the design measure, it can have either a compact support (for a bounded inputparameter space domain) or unbounded support (for unbounded input parameter space). Wehence have n observations of the form zi = f (xi ) + εi and we consider that (εi )i=1,...,n areindependently sampled from the Gaussian distribution with mean zero and variance nτ :

ε ∼ N (0, nτ). (1)

with τ a positive constant. Note that the number of observations and the observation noisevariance are both controlled by n. A noise additive in the number of observations is one ofthe main assumptions of this article. Intuitively, it allows for controlling the convergence ofthe generalization error when n tends to infinity by increasing the regularization. However,this is also the main limitation of the paper since the noise variance is generally independentof the number of observations. As presented in Sect. 2.3, this assumption is suitable for theparticular cases of stochastic simulators or experiments with repetitions when the number ofsimulations or experiments is fixed. The issue of the convergence of the generalization errorfor more general cases is still an open problem.

The main idea of the Gaussian process regression is to suppose that the objective functionf (x) is a realization of a Gaussian process Z(x) with a known mean and a known covariancekernel k(x, x ′). The mean can be considered equal to zero without loss of generality. Then,denoting by zn = [ f (xi )+εi ]1≤i≤n the vector of length n containing the noisy observations,we choose as predictor the BLUP given by the equation:

f̂ (x) = k(x)T (K + nτ I )−1zn, (2)

123

Mach Learn

where k(x) = [k(x, xi )]1≤i≤n is the n-vector containing the covariances between Z(x) andZ(xi ), 1 ≤ i ≤ n, K = [k(xi , x j )]1≤i, j≤n is the n × n-matrix containing the covariancesbetween Z(xi ) and Z(x j ), 1 ≤ i, j ≤ n and I the n × n identity matrix. We note herethat the unbiasedness means that E[ f̂ (x)] = E[Z(x)] where E stands for the expectationwith respect to the distribution of the Gaussian process Z(x) and the noise ε. The BLUPminimizes the MSE which equals:

σ 2(x) = k(x, x) − k(x)T (K + nτ I )−1k(x). (3)

Indeed, if we consider a linear unbiased predictor (LUP) of the form a(x)T zn , its MSE isgiven by:

E

[(Z(x) − aT (x)Zn)2

]= k(x, x) − 2a(x)T k(x) + a(x)T (K + nτ I )a(x), (4)

where Zn = [Z(xi )+ εi ]1≤i≤n . The value of a(x) minimizing (4) is aopt(x)T = k(x)T (K +nτ I )−1. Therefore, the BLUP given by aopt(x)T zn is equal to (2) and by substituting a(x)

with aopt(x) in Eq. (4) we obtain the MSE of the BLUP given by Eq. (3).The main focus of this paper is the asymptotic value of σ 2(x) when n → +∞. From

it, we can deduce the asymptotic value of the integrated mean squared error (IMSE)—alsocalled learning curve or generalization error—when n → +∞. The IMSE is defined by:

IMSE =∫

Rd

σ 2(x) dμ(x), (5)

where μ is the design measure of the input space parameters.The obtained asymptotic value has already be mentioned in several works (Rasmussen

and Williams 2006; Ritter 2000b,a; Bozzini and Rossini 2003; Opper and Vivarelli 1999;Sollich and Halees 2002; Picheny 2009). The original contribution of this paper is a rigorousproof of this result.

2.2 Introduction to stochastic simulators

We present in this section the industrial application studied in Sect. 6.2. A stochastic simulatoris a computer code which solves a system of partial differential equations with Monte-Carlomethods. It has the particularity to provide noisy observations centered on the true solutionof the system. Stochastic simulators are widely used in the field of nuclear physics to solvetransport equations and model systems containing fissile materials (e.g. nuclear reactors,storages of fissile materials, spacecraft reactors). In this paper, we are interested in a storageof dry Plutonium(IV) oxide (PuO2) used as fuel for nuclear reactors or several spacecrafts.As the PuO2 is highly toxic, the safety assessment of such storages is of great importance.

One of the most important factors used to assess the safety of a system containing fissilematerials is the neutron multiplication coefficient usually denoted by keff. It is the aver-age number of neutrons from one fission that cause another fission. This factor models thecriticality of a chain nuclear reaction:

– keff > 1 leads to an uncontrolled chain reaction due to an increasing neutron population.– keff = 1 leads to a self-sustained chain reaction with a stable neutron population.– keff < 1 leads to a faded chain reaction due to an decreasing neutron population.

The neutron multiplication factor is evaluated using the stochastic simulator calledMORET (Fernex et al. 2005). It depends on many parameters. However, we only focushere on the following quantities:

123

Mach Learn

– dPuO2 ∈ [0.5, 4]g.cm−3, the density of the fissile powder. It is scaled to [0, 1].– dwater ∈ [0, 1]g.cm−3, the density of water between storage tubes.

We use the notation x = (dPuO2 , dwater) for the input parameters. Let us denote by(Y j (x)) j=1,...,s the output of the MORET code at point x . (Y j (x)) j=1,...,s are realizations ofindependent and identically distributed random variables centered on keff(x). They are them-selves obtained by an empirical mean of a Monte-Carlo sample of 4000 particles. From theseparticles, we can estimate the variance σ 2

ε of the observation Y j (x) by a classical empiricalestimator.

Finally we can estimate keff(x) from the following quantity:

keff,s(x) = 1

s

s∑j=1

Y j (x).

Therefore, the variance of an observation keff,s(x) equals σ 2ε /s.

2.3 Relation between the application and the considered mathematical formalism

Let us consider that we want to approximate the function x ∈ Rd → f (x) ∈ R from noisy

observations at points (xi )i=1,...,n sampled from the design measure μ and with s replicationsat each point. We hence have ns data of the form zi, j = f (xi ) + εi, j and we consider that(εi, j )i=1,...,n

j=1,...,sare independently distributed from a Gaussian distribution with mean zero

and variance σ 2ε . Then, denoting the vector of observed values by zn = (zn

i )i=1,...,n =(∑s

j=1 zi, j/s)i=1,...,n , the variance of an observation zni is σ 2

ε /s. We recognize here the output

form given in Sect. 2.2. Thus, if we consider a fixed budget T = ns, we have σ 2ε /s = nτ

with τ = σ 2ε /T and the observation noise variance is proportional to n (as presented in Sect.

2.1). It means that if we increase the number n of observations, we automatically increase theuncertainty on the observations. An observation noise variance proportional to n is natural inthe framework of experiments with repetitions or stochastic simulators. Indeed, for a fixednumber of experiments (or simulations), the user can decide to perform them in few pointswith many repetitions (in that case the noise variance will be low) or to perform them inmany points with few repetitions (in that case the noise variance will be large).

We note that increasing n with a fixed τ is an idealized asymptotic setting since it wouldrequire that the number of replications s could tend to zero while it has to be a positive integer.However, this issue can be tackled in practice since for real applications n is finite and one hasjust to take a budget T such that T ≥ n (i.e. s ≥ 1). This is a first limitation of the suggestedmethod since it cannot be used for small budget (i.e. when T < n). A second one is theassumption that s does not depend on xi . Indeed, a uniform allocation could not be optimal.In this case, finding the optimal sequence {s1, s2, . . . , sn} leading to the minimal error is ofpractical interest. However, the corresponding observation noise variance will depend on xi

which means that τi will depend on xi as well. In this case, the presenting results do nothold. Nevertheless, they can be used to provide an upper bound for the convergence of thegeneralization error by considering the worst case τ = maxi τi .

The objective of the industrial example presented in Sect. 6.2 is to determine the budgetT required to reach a prescribe accuracy ε̄. To deal with this issue, we first build a Gaussianprocess regression model from an initial budget T0 and a large number of observations n.Then, from the results on the learning curve, we deduce the budget T such that the IMSEequals ε̄.

123

Mach Learn

3 Convergence of the learning curve for Gaussian process regression

This section deals with the convergence of the BLUP when the number of observations islarge. The rate of convergence of the BLUP is evaluated through the generalization error—i.e.the IMSE—defined in (5). The main theorem of this paper follows:

Theorem 1 Let us consider Z(x) a Gaussian process with zero mean and covariance kernelk(x, x ′) ∈ C0(Rd ×R

d) and (xi )i=1,...,n an experimental design set of n independent randompoints sampled with the probability measure μ on R

d . We assume that supx∈Rd k(x, x) < ∞.According to Mercer’s theorem (Mercer 1909), we have the following representation ofk(x, x ′):

k(x, x ′) =∑p≥0

λpφp(x)φp(x ′), (6)

where (φp(x))p is an orthonormal basis of L2μ(Rd) (denoting the set of square integrable

functions) consisting of eigenfunctions of (Tμ,k f )(x) = ∫Rd k(x, x ′) f (x ′)dμ(x ′) and λp is

the nonnegative sequence of corresponding eigenvalues sorted in decreasing order. Then, fora non-degenerate kernel—i.e. when λp > 0,∀p > 0—we have the following convergence inprobability for the MSE (3) of the BLUP:

σ 2(x)n→∞−→

∑p≥0

τλp

τ + λpφp(x)2. (7)

For degenerate kernels—i.e. when only a finite number of λp are not zero—the convergenceis almost sure. We note that we have the convergences with respect to the distribution of thepoints (xi )i=1,...,n of the experimental design set.

The proof of Theorem 1 is given in Sect. 4.

Remark For non-degenerate kernels such that ||φp(x)||L∞ < ∞ uniformly in p, the con-vergence is almost sure. Some kernels such as the one of the Brownian motion satisfy thisproperty.

The following theorem gives the asymptotic value of the learning curve when n is large.

Theorem 2 Let us consider Z(x) a Gaussian process with known mean and covariance ker-nel k(x, x ′) ∈ C0(Rd ×R

d) such that supx∈Rd k(x, x) < ∞ and (xi )i=1,...,n an experimentaldesign set of n independent random points sampled with the probability measure μ on R

d .Then, for a non-degenerate kernel, we have the following convergence in probability:

IMSEn→∞−→

∑p≥0

τλp

τ + λp. (8)

For degenerate kernels, the convergence is almost sure.

Proof From Theorem 1 and the orthonormal property of the basis (φp(x))p in L2μ(R), the

proof of the theorem is straightforward by integration. We note that we can permute theintegral and the limit thanks to the dominated convergence theorem since σ 2(x) ≤ k(x, x). �

A strength of Theorem 2 is that it allows for obtaining the rate of convergence of the learn-ing curve even when the eigenvalues (λp)p≥0 are not explicit. Indeed, as presented in Sect.5.2, this rate can be deduced from the asymptotic behavior of λp for large p. Furthermore,this asymptotic behavior is known for usual kernels (fractional Brownian kernel, Matérncovariance kernel, Gaussian covariance kernel, …). However, this is also a limitation sinceit could be unknown for general covariance kernels.

123

Mach Learn

3.1 Discussion

The limit obtained is identical to the one presented in (Rasmussen and Williams 2006) Sect.7.3 Eq. (7.26) for a degenerate kernel. Furthermore, the limit in Eq. (8) corresponds to theaverage bound given for degenerate kernels in (Opper and Vivarelli 1999) in Sect. 6 Eq. (17)with the correspondence τ = σ 2/n. In particular, they prove that it is a lower bound for thegeneralization error and an upper bound for the training error. The training error is definedas the empirical mean

∑ni=1 σ 2(xi )/n where (xi )i=1,...,n are the design points. They also

note that this bound should be exact for the asymptotic n large since the sum∑n

i=1 σ 2(xi )/napproaches to the IMSE asymptotically. Moreover, they numerically observed that this boundis relevant for a Gaussian covariance kernel (Opper and Vivarelli 1999), Eq. (18) which is anon-degenerate kernel. The work of Opper and Vivarelli is also investigated in (Williams andVivarelli 2000; Sollich and Halees 2002; Picheny 2009). In particular, a proof of Theorem1 is given for degenerate kernels and the relevance of the bound is illustrated on numericalexamples using non-degenerate kernels [e.g. Gaussian covariance kernel and exponentialkernel (Rasmussen and Williams 2006)].

We note that the proof of Theorem 1 for non-degenerate kernels is of interest since theusual kernels for Gaussian process regression are non-degenerate and we will exhibit dramaticdifferences between the learning curves of degenerate and non-degenerate kernels.

4 Proof of Theorem 1

We present in this section the proof of Theorem 1. The aim is to find the asymptotic valueof the MSE σ 2(x) (3) when n tends to the infinity. The principle of the proof is to find anupper bound and a lower bound for σ 2(x) which converge to the same quantity. One of themain ideas of the proof is to use the fact that in a Gaussian process regression framework weconsider the BLUP, i.e. the one which minimizes the MSE. Therefore, for a given Gaussianprocess modeling the function f (x), any LUP has a larger MSE. Furthermore, to providea lower bound for σ 2(x), we use the result presented in Theorem 1 for degenerate kernels.Therefore, we start the proof by presenting the degenerate case.

4.1 The degenerate case

The proof in the degenerate case follows the lines of the ones given by (Opper and Vivarelli1999; Rasmussen and Williams 2006; Picheny 2009). For a degenerate kernel, the number p̄ ofnon-zero eigenvalues is finite. Let us denote Λ = diag(λi )1≤i≤ p̄, φ(x) = (φ1(x), . . . , φ p̄(x))

and Φ = (φ(x1)T , . . . , φ(xn)T )T . The MSE of the Gaussian process regression (3) is given

by:

σ 2p̄(x) = φ(x)Λφ(x)T − φ(x)ΛΦT

(ΦΛΦT + nτ I

)−1ΦΛφ(x)T .

Thanks to the Woodbury–Sherman–Morrison formula1 and according to (Opper andVivarelli 1999; Picheny 2009) the Gaussian process regression error can be written:

σ 2p̄(x) = φ(x)

(ΦT Φ

nτ+ Λ−1

)−1

φ(x)T .

1 If B is a non-singular p × p matrix, C a non-singular m × m matrix and A a m × p matrix with m, p < ∞,then (B + AC−1 A)−1 = B−1 − B−1 A(AT B−1 A + C)−1 AT B−1.

123

Mach Learn

Since p̄ is finite, by the strong law of large numbers, the p̄ × p̄ matrix ΦT /n convergesalmost surely as n → ∞. Therefore, we have the almost sure convergence:

σ 2p̄(x)

n→∞−→∑p≤ p̄

τλp

τ + λpφp(x)2. (9)

4.2 The lower bound for σ 2(x)

The objective is to find a lower bound for the MSE σ 2(x) (3) for non-degenerate kernels.If we denote by ai (x) the coefficients of the BLUP f̂ (x) associated to Z(x)—i.e. f̂ (x) =∑ni=1 ai (x)(Z(xi ) + εi ), the MSE can be written:

σ 2(x) = E

⎡⎣(

Z(x) −n∑

i=1

ai (x)(Z(xi ) + εi )

)2⎤⎦ .

Let us consider the Karhunen-Loève decomposition of Z(x) = ∑p≥0 Z p

√λpφp(x)

where (Z p)p is a sequence of independent Gaussian random variables with mean zeroand variance 1 and λp > 0 for all p ∈ N

∗. Therefore, we have the equalities E[Z p

] =0, E

[Z2

p

]= 1 and E

[Z p Zq

] = 0 when p = q . Then, the MSE equals:

σ 2(x) = E

⎡⎢⎣⎛⎝∑

p≥0

√λp

(φp(x) −

n∑i=1

ai (x)φp(xi )

)Z p

⎞⎠

2⎤⎥⎦ + nτ

n∑i=1

ai (x)2

=∑p≥0

λp

(φp(x) −

n∑i=1

ai (x)φp(xi )

)2

+ nτ

n∑i=1

ai (x)2.

Then, for a fixed p̄, the following inequality holds:

σ 2(x) ≥∑p≤ p̄

λp

(φp(x) −

n∑i=1

ai (x)φp(xi )

)2

+ nτ

n∑i=1

ai (x)2 = σ 2LU P, p̄(x). (10)

σ 2LU P, p̄(x) is the MSE of the LUP of coefficients ai (x) associated to the Gaussian process

Z p̄(x) = ∑p≤ p̄ Z p

√λpφp(x) and noisy observations with variance nτ . Let us consider

σ 2p̄(x) the MSE of the BLUP of Z p̄(x), we have the following inequality:

σ 2LU P, p̄(x) ≥ σ 2

p̄(x). (11)

Since Z p̄(x) has a degenerate kernel, the almost sure convergence given in Eq. (9) holdsfor σ 2

p̄(x). Then, considering inequalities (10) and (11) and the convergence (9), we obtain:

lim infn→∞ σ 2(x) ≥ ∑p≤ p̄ τλp/

(τ + λp

)φp(x)2. Taking the limit p̄ → ∞ gives the fol-

lowing lower bound:

lim infn→∞ σ 2(x) ≥

∑p≥0

τλp

τ + λpφp(x)2. (12)

123

Mach Learn

4.3 The upper bound for σ 2(x)

The objective is to find an upper bound for σ 2(x). Since σ 2(x) is the MSE of the BLUPassociated to Z(x), if we consider any other LUP associated to Z(x), then the correspondingMSE denoted by σ 2

LU P (x) satisfies the following inequality:

σ 2(x) ≤ σ 2LU P (x).

The idea is to find a LUP so that its MSE is a sharp upper bound of σ 2(x). We considerthe following LUP:

f̂LU P (x) = k(x)T Azn, (13)

with A the n × n matrix defined by A = L−1 + ∑qk=1(−1)k(L−1 M)k L−1 with L =

nτ I + ∑p<p∗ λp[φp(xi )φp(x j )]1≤i, j≤n, M = ∑

p≥p∗ λp[φp(xi )φp(x j )]1≤i, j≤n, q a finiteinteger and p∗ such that λp∗ < τ .

The choice of the LUP (13) is motivated by the fact that the matrix A is an approximationof the inverse of the matrix (nτ I + K ) that is tractable in the calculations. Indeed, wehave (nτ I + K ) = L + M and thus (nτ I + K )−1 = L−1(I + L−1 M)−1. Then, theterm (I + L−1 M)−1 is approximated with the sum

∑qk=1(−1)k(L−1 M)k . We note that the

condition p∗ such that λp∗ < τ is used to control the convergence of this sum when q tendsto the infinity.

The MSE of the LUP (13) is given by:

σ 2LU P (x) = k(x, x) − k(x)T (2A − A(nτ I + K )A) k(x),

and by substituting the expression of A into the previous equation we obtain:

σ 2LU P (x) = k(x, x) − k(x)T L−1k(x) −

2q+1∑i=1

(−1)i k(x)T (L−1 M)i L−1k(x). (14)

The rest of the proof consists in finding the asymptotic values of the terms present in theexpression of σ 2

LU P (x).First, we deal with the term k(x)T L−1k(x) with the following lemma proved in Appendix.

Lemma 1 Let us consider the term k(x)T L−1k(x) in Eq. (14). The following convergenceholds:

k(x)T L−1k(x)n→∞−→

∑p≤p∗

λ2p

λp + τφp(x)2 + 1

τ

∑p>p∗

λ2pφp(x)2. (15)

Second, let us consider the term∑2q+1

i=1 (−1)i k(x)T (L−1 M)i L−1k(x). We have the fol-lowing equality:

k(x)T (L−1 M)i L−1k(x) = ∑ij=0

(ij

)1

nτk(x)T

( Mnτ

) j(− L ′ M

(nτ)2

)i− jk(x)

−k(x)T( M

nτ

) j(− L ′ M

(nτ)2

)i− jL ′

(nτ)2 k(x),

(16)

with L ′ = Φp∗(

ΦTp∗Φp∗nτ

+ Λ−1)−1

ΦTp∗ = ∑

p,p′≤p∗ d(n)

p,p′ [φp(xi )φp(x j )]1≤i, j≤n and

d(n)

p,p′ =[(

ΦTp∗Φp∗nτ

+ Λ−1)−1

]

p,p′. Since q < ∞, we can obtain the convergence in

123

Mach Learn

probability of∑2q+1

i=1 (−1)i k(x)T (L−1 M)i L−1k(x) from the ones of:

k(x)T 1

n

(M

n

) j ( L ′Mn2

)i− j

k(x), (17)

and:

k(x)T(

M

n

) j ( L ′Mn2

)i− j L ′

n2 k(x), (18)

with i ≤ 2q + 1 and j ≤ i . We first study the convergence of the term (17) for i < j and theterm (18) for i ≤ j . Then, we study the convergence of (17) for i = j . We have the followinglemma proved in Appendix:

Lemma 2 For i < j we have the following convergence when n → ∞:

k(x)T 1

n

(M

n

) j ( L ′Mn2

)i− j

k(x)Pμ−→ 0, (19)

and for i ≤ j the following one holds:

k(x)T 1

n

(M

n

) j ( L ′Mn2

)i− j L ′

n2 k(x)Pμ−→ 0. (20)

We note that the convergences presented in Lemma 2 hold in probability. Then, we havethe following lemma proved in Appendix:

Lemma 3 The following convergence holds when n → ∞:

1

nk(x)T

(M

n

)i

k(x)Pμ−→

∑p>p∗

λi+2p φp(x)2. (21)

From the convergences (19) and (20) and thanks to the equality (16), we deduce thefollowing convergence when n → ∞:

k(x)T (L−1 M

)iL−1k(x) − 1

nτ i+1 k(x)T(

M

n

)i

k(x)Pμ−→ 0.

Then, using the convergence (21) we obtain when n → ∞:

k(x)T (L−1 M)i L−1k(x)Pμ−→

(1

τ

)i+1 ∑p>p∗

λi+2p φp(x)2. (22)

From the Eq. (14) and the convergences (15) and (22), we obtain the following convergencewhen n → ∞:

σ 2LU P (x)

Pμ−→∑p≤p∗

(λp − λ2

p

τ + λp

)φp(x)2 +

∑p>p∗

λpφp(x)2

+∑p>p∗

λpφp(x)22q+1∑i=0

(−1)i+1(

1

τ

)i+1

λi+1p .

123

Mach Learn

From classical results about the sum of geometric series, we have:

σ 2LU P (x)

Pμ−→∑p≥0

(λp − λ2

p

τ + λp

)φp(x)2 −

∑p>p∗

λ2p

(λpτ

)2q+1

τ + λpφp(x)2. (23)

By considering the limit q → ∞ and the inequality λp∗ < τ , we obtain the followingupper bound for σ 2(x):

lim supn→∞

σ 2(x) ≤∑p≥0

τλp

τ + λpφp(x)2. (24)

The result announced in Theorem 1 is deduced from the lower and upper bounds (12) and(24).

5 Examples of rates of convergence for the learning curve

5.1 Numerical study on the assumptions of Theorem 2

Theorem 2 gives the asymptotic value of the IMSE (5) when the number of observationsn increases. The aim of this section is to determine when the assumptions of Theorem 2hold—i.e. to find the critical number of observations n beyond which, for a given τ and agiven covariance kernel k, the sum in (8) is a sharp approximation of the IMSE. To performsuch a study, we consider a Brownian kernel k(x, y) = x + y −|x − y| with x, y ∈ [0, 1], τ ∈{0.001, 0.01, 0.1} and a uniform measure μ on [0, 1]. The eigenvalues of k are the followingones (Bronski 2003):

λp = 1

(p + 1/2)2π2 , p ∈ N.

Therefore, for a given τ , we can explicitly obtain the value of the sum presented in (8) andcompare it with an empirical estimation of the IMSE. This empirical estimation is obtainedby considering the MSE (3) built from n points randomly spread into the interval [0, 1] andby estimating the integral (5) with a numerical integration. Furthermore, for each pair (τ, n)

we repeat this procedure 100 times in order to obtain an empirical estimator and confidenceintervals for the value of the IMSE. The results of this procedure are presented in Fig. 1.

Figure 1 represents the ratio between the value IMSE for a given n and the asymptoticvalue IMSE∞ given by (8). For large n this ratio is close to one. This allows for representingthe convergence of the IMSE to its asymptotic value.

We observe in Fig. 1 that the convergence is effective for n < 100 for all values of τ . Theconvergence is robust for small values of τ : the asymptotic value (8) is a good approximationof the IMSE if n ≥ 5 for τ = 10−1, if n ≥ 20 for τ = 10−2 and if n ≥ 60 for τ = 10−3. Thiscorresponds approximately to the threshold values nτ = 0.5 for IMSE∞ = 0.1575, nτ = 0.2for IMSE∞ = 0.05 and nτ = 0.06 for IMSE∞ = 0.0.0158 ; or globally to nτ ≈ 4IMSE∞.

This highlights the relevance of the asymptotic value of the IMSE given in Theorem 2.However, in general we do not have an explicit expression for the eigenvalues of a covariancekernel. In this case, we can obtain the asymptotic expression of IMSE∞ for small τ from theasymptotic behavior of the eigenvalues (λp)p≥0 for large p. We deal with this issue in thenext subsection.

123

Mach Learn

20 40 60 80 100

01

23

45

67

τ = 0.001 ; IMSE∞ = 0.0158

n

Nor

mal

ized

IMS

E

5% and 95% C.I. Mean

20 40 60 80 100

01

23

45

67

τ = 0.01 ; IMSE∞ = 0.05

nN

orm

aliz

ed IM

SE


20 40 60 80 100

01

23

45

67

τ = 0.1 ; IMSE∞ = 0.1575

n

Nor

mal

ized

IMS

E


Fig. 1 Comparison between the IMSE for different n and the theoretical asymptotic value IMSE∞ givenby the sum (8). The ratio IMSE/IMSE∞ is plotted as a function of n for three values of τ . For each pair(τ, n) 100 approximations of IMSE are evaluated from design points randomly spread on [0, 1]. From themthe empirical mean (represented by the dashed lines) and the 5 % and 95 % confidence intervals (representedby the dotted lines) of the ratio IMSE/IMSE∞ are evaluated

5.2 Rate of convergence for some usual kernels

Theorem 2 gives the asymptotic value of the generalization error as a function of the eigen-values of the covariance kernel. However, this asymptotic value is hard to handle since theexpression of the eigenvalues is rarely known. To deal with this problem, we introduce inProposition 1 a quantity Bτ which has the same rate of convergence of the asymptotic valueof the generalization error and which is tractable for our purpose.

Proposition 1 Let us denote IMSE∞ = limn→∞ IMSE. The following inequality holds:

1

2Bτ ≤ IMSE∞ ≤ Bτ , (25)

123

Mach Learn

with Bτ = ∑p s.t. λp≤τ λp + τ#

{p s.t. λp > τ

}.

Proof The proof is directly deduced from Theorem 2 and the following inequality:

1

2hτ (x) ≤ x

x + τ≤ hτ (x),

with:

hτ (x) ={

x/τ x ≤ τ

1 x > τ.

�

Proposition 1 shows that the rate of convergence of the generalization error IMSE∞ as afunction of τ is equivalent to the one of Bτ . In this section, we analyze the rate of convergenceof IMSE∞ (or equivalently Bτ ) when τ is small.

In this section, we consider that the design measure μ is uniform on [0, 1]d .

Example 2 (Degenerate kernels) For degenerate kernels we have #{

p s.t. λp > 0}

< ∞.Thus, when τ → 0, we have:

∑p s.t. λp<τ

λp = 0,

from which:

Bτ ∝ τ. (26)

Therefore, the IMSE decreases with τ . We find here a classical result about Monte-Carloconvergence which gives that the variance decay is proportional to the observation noisevariance (nτ ) divided by the number of observations (n) for any dimension. Nevertheless,for non-degenerate kernels, the number of non-zero eigenvalues is infinite and we are hencein an infinite-dimensional case (contrarily to the degenerate one). We see in the followingexamples that we do not conserve the usual Monte-Carlo convergence rate in this case whichemphasizes the importance of Theorem 1 dealing with non-degenerate kernels.

Example 3 (The fractional Brownian motion) Let us consider the fractional Brownian kernelwith Hurst parameter H ∈ (0, 1):

k(x, y) = x2H + y2H − |x − y|2H . (27)

The associated Gaussian process—called fractional Brownian motion—is Hölder contin-uous with exponent H − ε,∀ε > 0. According to (Bronski 2003), we have the followingresult:

Lemma 4 The eigenvalues of the fractional Brownian motion with Hurst exponent H ∈(0, 1) satisfy the behavior

λp = νH

p2H+1 + o(

p− (2H+2)(4H+3)4H+5 +δ

), p � 1,

where δ > 0 is arbitrary, νH = sin(π H) (2H+1)

π2H+1 , and is the Euler Gamma function.

123

Mach Learn

Therefore, when τ � 1, we have:

λp < τ if p >(νH

τ

) 12H+1

.

We hence have the following approximation for Bτ :

Bτ ≈∑

p>( νH

τ

) 12H+1

νH

p2H+1 + τ(νH

τ

) 12H+1

.

Furthermore, we have:

∑

p>( νH

τ

) 12H+1

νH

p2H+1 ≈+∞∫

( νHτ

) 12H+1

νH

x2H+1 dx = νH

2H(

νHτ

)1− 12H+1

,

from which:

Bτ ≈ CH τ 1− 12H+1 , τ � 1, (28)

where CH is a constant independent of τ .The rate of convergence for a fractional Brownian motion with Hurst parameter H is

τ 1− 12H+1 . We note that the case H = 1/2 corresponds to the classical Brownian motion. We

observe that the larger the Hurst parameter is (i.e. the more regular the Gaussian process is),the faster the convergence is. Furthermore, for H → 1 the convergence rate gets close toτ 2/3. Therefore, even for the most regular fractional Brownian motion, we are still far fromthe classical Monte-Carlo convergence rate.

Example 4 (The 1-D Matérn covariance kernel) In this example we deal with the Matérnkernel with regularity parameter ν > 0 in dimension 1:

k1D(x, x ′; ν, l) = 21−ν

Γ (ν)

(√2ν|x − x ′|

l

)ν

Kν

(√2ν|x − x ′|

l

), (29)

where Kν is the modified Bessel function (Abramowitz and Stegun 1965). The eigenvaluesof this kernel satisfy the following asymptotic behavior (Nazarov and Nikitin 2004):

λp ≈ 1

p2(ν+1/2), p � 1.

Following the guideline of the Example 3 we deduce the following asymptotic behaviorfor Bτ :

Bτ ≈ Cντ1− 1

2(ν+1/2) , τ � 1, (30)

where Cν is a constant independent of τ .This result is in agreement with the one of Ritter (2000a) who proved that for 1-dimensional

kernels satisfying the Sacks–Ylvisaker of order r conditions (where r is an integer), thegeneralization error for the best linear estimator and experimental design set strategy decays as

τ 1− 12r+2 . Indeed, for such kernels, the eigenvalues satisfy the large-p behavior λp ∝ 1/p2r+2

(Rasmussen and Williams 2006) and by following the guideline of the previous examples wefind the same convergence rate. We note that the Matérn kernel with parameter ν = r + 1/2satisfies the Sacks–Ylvisaker of order r conditions.

123

Mach Learn

Example 5 (The d-D tensorial Matérn covariance kernel) We focus here on the d-dimensionaltensorial Matérn kernel with isotropic regularity parameter ν > 1

2 . According to Pusev (2011)the eigenvalues of this kernel satisfy the asymptotics:

λp ≈ φ(p), p � 1,

where the function φ is defined by:

φ(p) = log(1 + p)2(d−1)(ν+1/2)

p2(ν+1/2).

Its inverse φ−1 satisfies:

φ−1(ε) = ε− 1

2(ν+1/2)

(log

(ε− 1

2(ν+1/2)

))d−1(1 + o(1)), ε � 1.

We hence have the approximation:

Bτ ≈ 2(ν + 1/2) − 1

φ−1 (τ )2(ν+1/2)−1

log(1 + φ−1 (τ )

)2(d−1)(ν+1/2) + τφ−1 (τ ) .

We can deduce the following rate of convergence for Bτ :

Bτ ≈ C(ν+1/2),dτ1− 1

2(ν+1/2) log (1/τ)d−1 , τ � 1, (31)

with Cν,d a constant independent of τ .

Example 6 (The d-D Gaussian covariance kernel) According to Todor, 2006 the asymptoticbehavior of the eigenvalues for a Gaussian kernel is:

λp � exp(−p

1d

).

Applying the procedure presented in the previous examples, it can be shown than the rateof convergence of the IMSE is bounded by:

Cdτ log (1/τ)d , τ � 1, (32)

with Cd a constant independent of τ .

Remark We can see from the previous examples that for smooth kernels, the convergencerate is close to τ , i.e. the classical Monte-Carlo rate.

5.3 Numerical examples

We compare the previous theoretical results on the rate of convergence of the generalizationerror with full numerical simulations. In order to observe the asymptotic convergence, wefix n = 200 and we consider 1/τ varying from 50 to 1000. The experimental design sets aresampled from a uniform measure on [0, 1] and the observation noise is nτ . To estimate theIMSE (5) we use a trapezoidal numerical integration with 4000 quadrature points over [0, 1].Furthermore, to build the convergence curves in Figs. 2 and 3 we use a linear regression withthe first value of the IMSE, an intercept fixed to zero (since the IMSE tends to 0 when τ

tends to 0) and a unique explanatory variable corresponding to the tested convergence (e.g.τ 0.1, τ log(1/τ), …).

First, we deal with the 1-D fractional Brownian kernel (27) with Hurst parameter H . We

have proved that for large n, the IMSE decays as τ 1− 12H+1 . Figure 2 compares the numerically

estimated convergences to the theoretical ones.

123

Mach Learn

200 400 600 800

0.00

0.02

0.04

0.06

1 τ

IMS

E

IMSE ~ τ0.1

IMSE ~ τ0.3

IMSE ~ τ0.5

IMSE ~ τ0.7

IMSE ~ τ0.9

200 400 600 800

0.00

00.

005

0.01

00.

015

0.02

00.

025

1 τ

IMS

E

IMSE ~ τ0.1

IMSE ~ τ0.3

IMSE ~ τ0.5

IMSE ~ τ0.64

IMSE ~ τ0.9

Fig. 2 Rate of convergence of the IMSE when the level of observation noise decreases for a fractionalBrownian motion with Hurst parameter H = 0.5 (left) and H = 0.9 (right). The number of observations isn = 200 and the observation noise variance is nτ with 1/τ varying from 50 to 1000. The triangles representthe numerically estimated IMSE, the solid line represents the theoretical convergence, and the other non-solidlines represent various convergence rates

We see in Fig. 2 that the observed rate of convergence is perfectly fitted by the theoreticalone. We note that we are far from the classical Monte-Carlo rate since we are not in anon-degenerate case.

Finally, we deal with the 2-D tensorial Matérn-5/2 kernel and the 1-D Gaussian kernel.The 1-dimensional Matérn-ν class of covariance functions k1D(t, t ′; ν, θ) is given by (29)and the 2-D tensorial Matérn-ν covariance function is given by:

123

Mach Learn

Fig. 3 Rate of convergence ofthe IMSE when the level ofobservation noise decreases for a2-D tensorial Matérn-5/2 kernelon the left hand side and for a1-D Gaussian kernel on the righthand side. The number ofobservations is n = 200 and theobservation noise variance is nτ

with 1/τ varying from 100 to1000. The triangles represent thenumerically estimated IMSE, thesolid line represents thetheoretical convergence, and theother non-solid lines representvarious convergences

200 400 600 800 1000

05

1015

20

1 τ

IMS

E

IMSE ~ log(1 τ)τ(1−1 (2ν+1))

IMSE ~ τ0.3

IMSE ~ τ0.5

IMSE ~ τ0.6

IMSE ~ τ0.8

IMSE ~ τ

200 400 600 800 1000

0.00

0.01

0.02

0.03

0.04

0.05

1 τ

IMS

E

IMSE ~ τ0.3

IMSE ~ τ0.5

IMSE ~ τ0.7

IMSE ~ τIMSE ~ τlog(1 τ)

k(x, x ′; ν, θ) = k1D(x1, x ′1; ν, θ1)k1D(x2, x ′

2; ν, θ2). (33)

Furthermore, the 1-D Gaussian kernel is defined by:

k(x, x ′; θ) = exp

(−1

2

(x − x ′)2

θ2

).

Figure 3 compares the numerically observed convergence of the IMSE to the theoreticalone when θ1 = θ2 = 0.2 for the Matérn-5/2 kernel and when θ = 0.2 for the Gaussiankernel. We see in Fig. 3 that the theoretical rate of convergence is a sharp approximation ofthe observed one.

6 Applications of the learning curve

Let us consider that we want to approximate the function x ∈ Rd → f (x) from noisy

observations at fixed points (xi )i=1,...,n , with n � 1, sampled from the design measure μ

and with s replications at each point xi . In Sect. 6.1 we present how to determine the neededbudget T = ns to achieve a prescribed precision. Then, in Sect. 6.2, we illustrate this methodon an industrial example.

123

Mach Learn

6.1 Estimation of the budget required to reach a prescribed precision

Let us consider a prescribed generalization error denoted by ε̄. The purpose of this subsectionis to determine from an initial budget T0 the budget T for which the generalization errorreaches the value ε̄.

First, we build an initial experimental design set (x traini )i=1,...,n sampled with respect to

the design measure μ and with s∗ replications at each point such that T0 = ns∗. From the s∗replications (zi, j ) j=1,...,s∗ , we can estimate the observation noise variances σ 2

ε with a classical

empirical estimator: σ̄ 2ε = ∑n

i=1∑s∗

j=1(zi, j − zni )2/(n(s∗ − 1)), zn

i = ∑s∗j=1 zi, j/s∗.

Second, we use the observations zni = (

∑s∗j=1 zi, j )/s∗ to estimate the covariance kernel

k(x, x ′). In practice, we consider a parametrized family of covariance kernels and we selectthe parameters which maximize the likelihood (Stein 1999).

Third, from Theorem 2 we can get the expression of the generalization error decay withrespect to T (denoted by IMSET ). Therefore, we just have to determine the budget T suchthat IMSET = ε̄. In practice, we will not use Theorem 2 but the asymptotic results describedin Sect. 5.2.

This strategy is applied to an industrial case in Sect. 6.2. We note that in the applicationpresented in Sect. 6.2, we have s∗ = 1. In fact, in this example the observations are themselvesobtained by an empirical mean of a Monte-Carlo sample and thus the noise variance can beestimated without processing replications.

6.2 Industrial case: MORET code

We illustrate in this section an industrial application of our results about the rate of conver-gence of the IMSE.

6.2.1 Data presentation

We use in this section the notation presented in Sect. 2.2. The outputs of the MORET codeat point xi are denoted by Y j (xi ) where j = 1, . . . , si and i = 1, . . . , n.

A large data base (Y j (xi ))i=1,...,5625, j=1,...,200 is available to us. We divide it into a trainingset and a test set. The 5625 points xi of the data base come from a 75×75 grid over [0, 1]2. Thetraining set consists of n = 100 points (x train

i )i=1,...,n extracted from the complete data baseusing a Latin Hypercube Sample (Fang et al. 2006) optimized with respect to the maximincriterion and of the first observations (Y1(x train

i ))i=1,...,100. We note that the maximin criterionaims to maximize the minimal distance (with respect to the L2-norm) between the points ofthe design. We will use the other 5525 points as a test set.

The aim of the study is—given the training set—to predict the budget needed to achievea prescribed precision for the surrogate model.

Furthermore, the observation noise variance σ 2ε is estimated by σ̄ 2

ε = 3.3 × 10−3 (seeSect. 6.1).

6.2.2 Model selection

To build the model, we consider the training set plotted in Fig. 4. It is composed of then = 100 points (x train

i )i=1,...,n which are uniformly spread on Q = [0, 1]2.Let us suppose that the response is a realization of a Gaussian process with a tensorial

Matérn-ν covariance function. The 2-D tensorial Matérn-ν covariance function k(x, x ′; ν, θ)

123

Mach Learn

Fig. 4 Initial experimentaldesign set with n = 100

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

X1

X2

is given in (33). The hyper-parameters are estimated by maximizing the concentrated Maxi-mum Likelihood (Stein 1999):

−1

2(z − m)T (σ 2 K + σ 2

ε I )−1(z − m) − 1

2det(σ 2 K + σ̄ 2

ε I ),

where K = [k(x traini , x train

j ; ν, θ)]i, j=1,...,n, I is the identity matrix, σ 2 the variance parame-

ter, m the mean of keff,s(x) and z = (Y1(x train1 ), . . . , Y1(x train

n )) the observations at points inthe training set. The mean of keff,s(x) is estimated by m = 1

100

∑100i=1 Y1(x train

i ) = 0.65.Due to the fact that the convergence rate is strongly dependent of the regularity parameter

ν, we have to perform a good estimation of this hyper-parameter to evaluate the model errordecay accurately. Note that we cannot have a closed form expression for the estimator of σ 2,it hence has to be estimated jointly with θ and ν.

Let us consider the vector of parameters φ = (ν, θ1, θ2, σ2). In order to perform the

maximization, we have first randomly generated a set of 10,000 parameters (φk)k=1,...,104

on the domain [0.5, 3] × [0.01, 2] × [0.01, 2] × [0.01, 1]. We have then selected the 150best parameters (i.e. the ones maximizing the concentrated Maximum Likelihood) and wehave started a quasi-Newton based maximization from these parameters. More specifically,we have used the BFGS method (Shanno 1970). Finally, from the results of the 150 max-imization procedures, we have selected the best parameter. We note that the quasi-Newtonbased maximizations have all converged to two parameter values, around 30 % to the actualmaximum and 70 % to another local maximum.

The estimation of the hyper-parameters are ν = 1.31, θ1 = 0.67, θ2 = 0.45 and σ 2 =0.24. This means that we have a rough surrogate model which is not differentiable andα-Hölder continuous with exponent α = 0.81. The variance of the observations is σ̄ 2

ε =3.3 × 10−3, using the same notations as Sect. 2.3, we have τ = σ̄ 2

ε /T0 with T0 = n (itcorresponds to s = 1).

The IMSE of the Gaussian process regression is IMSET0 = 1.0 × 10−3 and its empiricalmean squared error is EMSET0 = 1.2 × 10−3. To compute the empirical mean squarederror (EMSE), we use the observations (Y j (xi ))i=1,...,5525, j=1...,200 with xi = x train

k ∀k =1, . . . , 100, i = 1, . . . , 5525 and to compute the IMSE (5) (that depends only on the positionsof the training set and on the selected hyper-parameters) we use a trapezoidal numerical

123

Mach Learn

Fig. 5 Comparison betweenempirical mean squared error(EMSE) decay and theoreticalIMSE decay for n = 100 whenthe total budget T = nsincreases. The triangles representthe EMSE, the solid linerepresents the theoretical decay,the horizontal dashed linerepresents the desired accuracyand the dashed line the classicalM-C convergence. We see thatMonte-Carlo decay does notmatch the empirical MSE and itis too fast

0 10 20 30 40 50 600e

+00

2e−

044e

−04

6e−

048e

−04

1e−

03

s=T/n

EM

SE

IMSE ~ σε2 T

IMSE ~ log(T σε2) (T σε

2)(1−1 (2ν+1))

integration into a 75 × 75 grid over [0, 1]2. For s = 200, the observation variance of theoutput keff,s(x) equals σ̄ 2

ε /200 = 1.64 × 10−5 and is neglected for the estimation of theempirical error. We can see that the IMSE is close to the empirical MSE which means thatour model describes the observations accurately.

6.2.3 Convergence of the IMSE

According to (31), we have the following convergence rate for the IMSE:

IMSE ∼ log(1/τ)τ1− 1

2(ν+1/2) = log(T/σ̄ 2ε )

(T/σ̄ 2ε )

1− 12(ν+1/2)

, (34)

where the model parameter ν plays a crucial role. We can therefore expect that the IMSEdecays as (see Sect. 6.1):

IMSET = IMSET0

log(T/σ̄ 2ε )

(T/σ̄ 2ε )

1− 12(ν+1/2)

/log(T0/σ̄

2ε )

(T0/σ̄ 2ε )

1− 12(ν+1/2)

. (35)

Let us assume that we want to reach an IMSE of ε̄ = 2.0 × 10−4. According to theIMSE decay and the fact that the IMSE for the budget T0 has been estimated to be equal to1.0 × 10−3, the total budget required is T = ns = 2000, i.e. s = 20. Figure 5 compares theempirical mean squared error convergence and the predicted convergence (35) of the IMSE.

We see empirically that the EMSE of ε̄ = 2.0 × 10−4 is achieved for s = 31. This showsthat the predicted IMSE and the empirical MSE are close and that the selected kernel capturesthe regularity of the response accurately.

Let us consider the classical Monte-Carlo convergence rate σ̄ 2ε /T , which corresponds

to the convergence rate of degenerate kernels, i.e. in the finite-dimensional case. Figure 5compares the theoretical rate of convergence of the IMSE with the classical Monte-Carlo one.We see that the Monte-Carlo decay is too fast and does not represent correctly the empiricalMSE decay. If we had considered the rate of convergence IMSE ∼ σ̄ 2

ε /T , we would havereached an IMSE of ε̄ = 2.0 × 10−4 for s = 6 (which is very far from the observed values = 31).

123

Mach Learn

7 Conclusion

The main result of this paper is the proof of a theorem giving the Gaussian process regressionMSE when the number of observations is large and the observation noise variance is propor-tional to the number of observations. The proof generalizes previous ones which prove thisresult in dimension one or two or for a restricted class of covariance kernels (for degenerateones).

A first limitation of the presented results is that the noise variance generally does notdepend on the number of observations. The additive dependence of the noise variance inthe number of observations is a technical assumption which allows for controlling the con-vergence of the learning curve. However, it is natural in the framework of experiments withreplications or Monte-Carlo simulators. Deriving the presented results for the case of constantnoise is still an open problem and is of great practical interest.

The asymptotic value of the MSE is derived in terms of the eigenvalues and eigenfunctionsof the covariance function and holds for degenerate and non-degenerate kernels and for anydimension. From this theorem, we can deduce the asymptotic behavior of the generalizationerror—defined in this paper as the IMSE—as a function of the reduced observation noisevariance (it corresponds to the noise variance when the number of observations equals one).A strength of this theorem is that the rate of convergence of the generalization error canbe deduced from the one of the eigenvalues which is known for usual covariance kernels.The relevance of this rate of convergence is emphasized on a numerical study for differentkernels. However, this leads to another limitation since the presented results cannot be usedfor general covariance kernels for which the eigenvalue decay rate is unknown.

The significant differences between the rate of convergence of degenerate and non-degenerate kernels highlight the importance to prove this result for non-degenerate ker-nels. This is especially important as usual kernels for Gaussian process regression are non-degenerate.

Finally, for practical perspectives, the presented method allows for evaluating the com-putational budget required to reach a given accuracy. It has been successfully applied toa real-word problem about the safety assessment of a nuclear system. However, it is effi-cient for specific applications (e.g. stochastic simulators with a constant observation noisevariance) and when the computational budget is important. More investigations have to beperformed to deal with the cases of heterogeneous noise, noise-free simulators or for verylimited computational budget.

Acknowledgments The authors are grateful to Dr. Yann Richet of the IRSN—Institute for RadiologicalProtection and Nuclear Safety—for providing the data for the industrial case through the reDICE project.

Appendix: Proofs of the technical lemmas

Proof of Lemma 1

Let us consider the term k(x)T L−1k(x). Since p∗ < ∞, the matrix L can be written:

L = nτ I + Φp∗ΛΦTp∗ , (36)

whereΛ=diag(λi )1≤i≤p∗ , Φp∗ =(φ(x1)

T . . . φ(xn)T)T

andφ(x) = (φ1(x), . . . , φp∗(x)).Thanks to the Woodbury–Sherman–Morrison formula, the matrix L−1 is given by:

123

Mach Learn

L−1 = I

nτ− Φp∗

nτ

(ΦT

p∗Φp∗

nτ+ Λ−1

)−1ΦT

p∗

nτ. (37)

From the continuity of the inverse operator for invertible p∗× p∗ matrices and by applyingthe strong law of large numbers, we obtain the following almost sure convergence:

k(x)T L−1k(x) = 1

nτ

n∑i=1

k(x, xi )2 − 1

τ 2

p∗∑p,q=0

⎡⎣(

ΦTp∗Φp∗

nτ+ Λ−1

)−1⎤⎦

p,q

×[

1

n

n∑i=1

k(x, xi )φp(xi )

]⎡⎣1

n

n∑j=1

k(x, x j )φq(x j )

⎤⎦

n→∞−→ 1

τEμ[k(x, X)2] − 1

τ 2

p∗∑p,q=0

[(I

τ+ Λ−1

)−1]

p,q

×Eμ[k(x, X)φp(X)]Eμ[k(x, X)φq(X)],where Eμ is the expectation with respect to the design measure μ. We note that we can usethe Woodbury–Sherman–Morrison formula and the strong law of large numbers since p∗ isfinite and independent of n. Then, the orthonormal property of the basis (φp(x))p≥0 implies:

Eμ[k(x, X)2] =∑p≥0

λ2pφp(x)2, Eμ[k(x, X)φp(X)] = λpφp(x).

Therefore, we have the following almost sure convergence:

k(x)T L−1k(x)n→∞−→

∑p≤p∗

λ2p

λp + τφp(x)2 + 1

τ

∑p>p∗

λ2pφp(x)2.

Proof of Lemma 2

Let us consider k(x)T 1n

( Mn

) j(

L ′ Mn2

)i− jk(x) and i > j , we have:

k(x)T 1

n

(M

n

) j ( L ′Mn2

)i− j

k(x) =∑

p1,...,pi− j ≤p∗p′

1,...,p′i− j ≤p∗

d(n)

p1,p′1. . . d(n)

pi− j ,p′i− j

∑q1,...,qi− j >p∗m1,...,m j >p∗

S(n)q,m,

(38)

with:

S(n)q,m =

(√λm1

n

n∑r=1

k(x, xr )φm1(xr )

)(√λm j

n

n∑r=1

φm j (xr )φp′1(xr )

)

×(

λqi− j

n

n∑r=1

k(x, xr )φqi− j (xr )

n∑r=1

φpi− j (xr )φqi− j (xr )

)

×j−1∏l=1

√λml λml+1

n

n∑r=1

φml (xr )φml+1(xr )

123

Mach Learn

×i− j−1∏

l=1

λql

n

n∑r=1

φql (xr )φpl+1(xr )

n∑r=1

φql (xr )φp′l(xr ).

We consider now the term:

a(n)

q,p,p′ = λq

n

n∑r=1

φq(xr )φp(xr )1

n

n∑r=1

φp′(xr )φq(xr ), (39)

with p, p′ ≤ p∗. From Cauchy Schwarz inequality and thanks to the following inequality:

|φp(x)|2 ≤ 1

λp

∑p′≥0

λp′ |φp′(x)|2 = λ−1p k(x, x),

we obtain (using λp ≥ λp∗ ,∀p ≤ p∗ and [∑nr=1 |φq(xr )|]2 ≤ n

∑nr=1 φq(xr )

2):

∣∣∣a(n)

q,p,p′∣∣∣ ≤ σ 2λ−1

p∗λq

n

n∑r=1

φq(xr )2, ∀p, p′ ≤ p∗,

with σ 2 = supx k(x, x). Considering the expectation with respect to the distribution of pointsxr , we obtain ∀ p̄ < ∞:

Eμ

⎡⎣∑

q> p̄

∣∣∣a(n)

q,p,p′∣∣∣⎤⎦ ≤ σ 2λ−1

p∗∑q> p̄

λq .

From Markov inequality, ∀δ > 0, we have:

Pμ

⎛⎝∣∣∣∣∣∣∑q> p̄

a(n)

q,p,p′

∣∣∣∣∣∣> δ

⎞⎠ ≤

Eμ

[∣∣∣∑q> p̄ a(n)

q,p,p′∣∣∣]

δ≤ σ 2λ−1

p∗∑

q> p̄ λq

δ. (40)

Furthermore, ∀δ > 0,∀ p̄ > p∗:

Pμ

⎛⎝∣∣∣∣∣∣∑

q>p∗a(n)

q,p,p′

∣∣∣∣∣∣> 2δ

⎞⎠ ≤ Pμ

⎛⎝∣∣∣∣∣∣

∑p∗<q≤ p̄

a(n)

q,p,p′

∣∣∣∣∣∣> δ

⎞⎠ + Pμ

⎛⎝∣∣∣∣∣∣∑q> p̄

a(n)

q,p,p′

∣∣∣∣∣∣> δ

⎞⎠ .

We have for all q ∈ (p∗, p̄] : a(n)

q,p,p′ → aq,p,p′ = λqδq=pδq=p′ = 0 (with δ theKronecker product), as n → ∞, therefore:

lim supn→∞

Pμ

(∣∣∣∣∣∑

q>p∗a(n)

q,p,p′

∣∣∣∣∣ > 2δ

)≤ σ 2λ−1

p∗∑

q> p̄ λq

δ.

Taking the limit p̄ → ∞ on the right hand side, we obtain the convergence in probabilityof

∑q>p∗ a(n)

q,p,p′ when n → ∞:

∑q>p∗

λq

n

n∑r=1

φq(xr )φp(xr )1

n

n∑r=1

φp′(xr )φq(xr )Pμ−→ 0, ∀p, p′ ≤ p∗. (41)

Following the same method, we obtain the convergence:

∑q>p∗

λq

n

n∑r=1

k(x, xr )φq(xr )

n∑r=1

φp(xr )φq(xr )Pμ−→ 0, ∀p ≤ p∗. (42)

123

Mach Learn

Let us return to S(n)q,m . By using Cauchy Schwarz inequality and bounding by the constant

M all the terms independent of qi and mi , we obtain:∣∣∣∣∣∣

∑q1,...,qi− j >p∗

S(n)q,m

∣∣∣∣∣∣≤ M

j∏l=1

λml

1

n

n∑r=1

φml (xr )2

×∣∣∣∣∣∣

∑qi− j >p∗

(λqi− j

n

n∑r=1

k(x, xr )φqi− j (xr )

n∑r=1

φpi− j (xr )φqi− j (xr )

)∣∣∣∣∣∣

×∣∣∣∣∣∣

∑q1,...,qi− j−1>p∗

i− j−1∏l=1

λql

n

n∑r=1

φql (xr )φpl+1(xr )

n∑r=1

φql (xr )φp′l(xr )

∣∣∣∣∣∣.

Since∑

p≥0 λpφp(x)2 = k(x, x) ≤ σ 2, we have the inequalities:

0 ≤∑

m1,...,m j

j∏l=1

λml

1

n

n∑r=1

φml (xr )2 ≤ (σ 2) j .

Thus, for i > j and from (41) and (42) we obtain the following convergence in probabilitywhen n → ∞:

∑q1,...,qi− j >p∗m1,...,m j >p∗

S(n)q,m

Pμ−→ 0.

Therefore, from (38) we obtain the following convergence when n → ∞:

k(x)T 1

n

(M

n

) j ( L ′Mn2

)i− j

k(x)Pμ−→ 0, ∀i < j.

Following the same guideline as previously, it can be shown that when n → ∞:

k(x)T 1

n

(M

n

) j ( L ′Mn2

)i− j L ′

n2 k(x)Pμ−→ 0, ∀i ≤ j.

Proof of Lemma 3

Let us consider for a fixed j ≥ 1:

1

nk(x)T

(M

n

) j

k(x) =∑

m1,...,m j >p∗a(n)

m (x),

with m = (m1, . . . , m j ) and:

a(n)m (x) =

(1

n

n∑r=1

k(x, xr )φm1(xr )

)(1

n

n∑r=1

k(x, xr )φm j (xr )

)

×j−1∏l=1

1

n

n∑r=1

φml (xr )φml+1(xr )

j∏i=1

λmi .

123

Mach Learn

From Cauchy–Schwarz inequality, we have:

∣∣∣a(n)m (x)

∣∣∣ ≤(

1

n

n∑r=1

k(x, xr )2

) j∏i=1

1

n

n∑r=1

λmi φmi (xr )2 (43)

≤ σ 4j∏

i=1

1

n

n∑r=1

λmi φmi (xr )2. (44)

Therefore, considering the expectation with respect to the distribution of the points(xr )r=1,...,n , we have:

Eμ

[∣∣∣a(n)m (x)

∣∣∣]

≤ σ 4

⎛⎝

j∏i=1

λmi

⎞⎠ 1

n j

n∑t1,...,t j =1

Eμ

[φm1(Xt1)

2 . . . φm j (Xt j )2] , ∀x ∈ R

d .

The following inequality holds uniformly in t1, . . . , t j = 1, . . . , n:

Eμ

⎡⎣

j∏i=1

φmi (Xti )2

⎤⎦ ≤ bm,

where bm = ∑P∈�({1,..., j})

P=∪lr=1 Ir

∏lr=1 Eμ

[∏i∈Ir

φmi (X)2]

because the term of left hand side of

the inequality is equal to one of the terms in the sum on the right hand side. Here Π({1, . . . , j})is the collection of all partitions of {1, . . . , j} and Ir ∩ Ir ′ = ∅,∀r = r ′. We hence have:

Eμ

[∣∣∣a(n)m (x)

∣∣∣]

≤ σ 4j∏

i=1

λmi bm .

Since∑

p≥0 λpφp(x)2 ≤ σ 2, we have:

∑m1,...,m j >p∗

j∏i=1

λmi bm =∑

m1,...,m j >p∗

j∏l=1

λml

∑P∈Π({1,..., j})

P=∪lr=1 Ir

l∏r=1

Eμ

⎡⎣∏

i∈Ir

φmi (X)2

⎤⎦

=∑

P∈Π({1,..., j})P=∪l

r=1 Ir

l∏r=1

Eμ

⎡⎣∏

i∈Ir

∑mi >p∗

λmi φmi (X)2

⎤⎦

≤ σ 2 j #{Π({1, . . . , j})}.Since the cardinality of the collection Π({1, . . . , j}) of partitions of {1, . . . , j} is finite, the

series∑

m1,...,m j >p∗∏ j

i=1 λmi bm converges. Furthermore, as it is a series with non-negativeterms, ∀ε > 0, ∃ p̄ > p∗ such that :

σ 4∑

m∈MCp̄

j∏i=1

λmi bm ≤ ε,

123

Mach Learn

where MCp̄ designs the complement of Mp̄ defined by the collection of m = (m1, . . . , m j )

such that:

M = {m = (m1, . . . , m j ) such that mi > p∗, i = 1, . . . , j},Mp̄ = {m = (m1, . . . , m j ) such that p∗ < mi ≤ p̄, i = 1, . . . , j},MC

p̄ = M\Mp̄.

Therefore, we have ∀δ > 0,∀ε > 0∃ p̄ > 0 such that uniformly in n:

∑

m∈MCp̄

Eμ

[∣∣∣a(n)m (x)

∣∣∣]

≤ εδ

2.

Applying the Markov inequality, we obtain:

P

⎛⎜⎝

∑

m∈MCp̄

∣∣∣a(n)m (x)

∣∣∣ >δ

2

⎞⎟⎠ ≤ ε. (45)

Furthermore, by denoting am(x) = limn→∞ a(n)m (x), we have:

am(x) = λm1λm j φm1(x)φm j (x)

j∏i=1

λmi

j−1∏i=1

δmi = mi+1 , (46)

and from Cauchy–Schwarz inequality [see Eq. (44)], we have:

|am(x)| ≤ σ 4j∏

i=1

λmi .

We hence can deduce the inequality:

∑

m∈MCp̄

|am(x)| ≤ σ 4∑

m∈MCp̄

j∏i=1

λmi . (47)

Thus, ∃ p̄ such that∑

m∈MCp̄

|am(x)| ≤ δ2 for all x ∈ R

d . From the inequalities (45) and

(47), we find that ∃ p̄ such that:

Pμ

(∣∣∣∣∣∑

m∈M

a(n)m (x) −

∑m∈M

am(x)

∣∣∣∣∣ > 2δ

)≤ ε + Pμ

⎛⎝∣∣∣∣∣∣∑

m∈Mp̄

a(n)m (x) −

∑m∈Mp̄

am(x)

∣∣∣∣∣∣> δ

⎞⎠ .

Since Mp̄ is a finite set:

lim supn→∞

Pμ

⎛⎝∣∣∣∣∣∣∑

m∈Mp̄

a(n)m (x) −

∑m∈Mp̄

am(x)

∣∣∣∣∣∣> δ

⎞⎠ = 0,

therefore:

lim supn→∞

Pμ

(∣∣∣∣∣∑

m∈M

a(n)m (x) −

∑m∈M

am(x)

∣∣∣∣∣ > 2δ

)≤ ε.

123

Mach Learn

The previous inequality holds ∀ε > 0, thus we have the convergence in probability of∑m∈M a(n)

m (x) to∑

m∈M am(x) with [by using the limit in the Eq. (46)]:∑

m∈M

am(x) =∑p>p∗

λj+2p φp(x)2.

References

Abramowitz, M., & Stegun, I. A. (1965). Handbook of mathematical functions. New York: Dover.Berger, J. O., De Oliveira, V., & Sans, B. (2001). Objective bayesian analysis of spatially correlated data

objective bayesian analysis of spatially correlated data. Journal of the American Statistical Association,96, 1361–1374.

Bozzini, M., & Rossini, M. (2003). Numerical differentiation of 2d functions from noisy data. Computer andMathematics with Applications, 45, 309–327.

Bronski, J. C. (2003). Asymptotics of Karhunen–Loève eigenvalues and tight constants for probability distri-butions of passive scalar transport. Communications in Mathematical Physics, 238, 563–582.

Fang, K. T., Li, R., & Sudjianto, A. (2006). Design and modeling for computer experiments. Computer scienceand data analysis series. London: Chapman & Hall.

Fernex, F., Heulers, L., Jacquet, O., Miss, J., Richet, Y. (2005). The Moret 4b monte carlo code new features totreat complex criticality systems. In: MandC International Conference on Mathematics and ComputationSupercomputing, Reactor and Nuclear and Biological Application, Avignon, France.

Gneiting, T., Kleiber, W., & Schlater, M. (2010). Matérn cross-covariance functions for multivariate randomfields. Journal of the American Statistical Association, 105, 1167–1177.

Harville, D. A. (1997). Matrix algebra from statistician’s perspective. New York: Springer-Verlag.Laslett, G. M. (1994). Kriging and splines: An empirical comparison of their predictive performance in

some applications kriging and splines: An empirical comparison of their predictive performance in someapplications. Journal of the American Statistical Association, 89, 391–400.

Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of integralequations. Philosophical Transactions of the Royal Society A, 209, 441–458.

Nazarov, A. I., & Nikitin, Y. Y. (2004). Exact l2-small ball behaviour of integrated Gaussian processes andspectral asymptotics of boundary value problems. Probability Theory and Related Fields, 129, 469–494.

Opper, M., & Vivarelli, F. (1999). General bounds on Bayes errors for regression with Gaussian processes.Advances in Neural Information Processing Systems, 11, 302–308.

Picheny, V. (2009). Improving accuracy and compensating for uncertainty in surrogate modeling. PhD thesis,Ecole Nationale Supérieure des Mines de Saint Etienne.

Pusev, R. S. (2011). Small deviation asymptotics for Matérn processes and fields under weighted quadraticnorm. Theory of Probability and its Applications, 55, 164–172.

Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Cambridge: MITPress.

Ritter, K. (2000a). Almost optimal differentiation using noisy data. Journal of Approximation Theory, 86,293–309.

Ritter, K. (2000b). Average-case analysis of numerical problems. Berlin: Springer Verlag.Sacks, J., & Ylvisaker, D. (1981). Variance estimation for approximately linear models. Series Statistics, 12,

147–162.Sacks, J., Welch, W. J., Mitchell, T. J., & Wynn, H. P. (1989). Design and analysis of computer experiments.

Statistical Science, 4, 409–423.Seeger, M. W., Kakade, S. M., & Foster, D. P. (2008). Information consistency of nonparametric Gaussian

process methods. IEEE Transactions on Information Theory, 54(5), 2376–2382.Shanno, D. F. (1970). Conditioning of quasi-Newton methods for function minimization. Mathematics of

Computation, 24, 647–656.Sollich, P., & Halees, A. (2002). Learning curves for Gaussian process regression: Approximations and bounds.

Neural Computation, 14, 1393–1428.Stein, M. L. (1999). Interpolation of spatial data. Series in statistics. New York: Springer.van der Vaart, A., & van Zanten, H. (2011). Information rates of nonparametric Gaussian process methods.

The Journal of Machine Learning Research, 12, 2095–2119.Wackernagel, H. (2003). Multivariate geostatistics. Berlin: Springer-Verlag.Williams, C. K. I., & Vivarelli, F. (2000). Upper and lower bounds on the learning curve for Gaussian processes.

Machine Learning, 40, 77–102.

123

Date post:	24-Jan-2017
Category:	Documents
Upload:	josselin
View:	221 times
Download:	0 times

Asymptotic analysis of the learning curve for Gaussian process regression

Documents