+ All Categories
Home > Documents > A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to...

A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to...

Date post: 19-Jul-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
32
A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo , Xiaowei Zhang , and Zeyu Zheng Hong Kong University of Science and Technology, IEDA City University of Hong Kong, MS UC Berkeley, IEOR
Transcript
Page 1: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

A Scalable Approach to Gradient-Enhanced

Stochastic Kriging

Haojun Huodagger Xiaowei Zhanglowast and Zeyu ZhengDagger

dagger Hong Kong University of Science and Technology IEDAlowast City University of Hong Kong MSDagger UC Berkeley IEOR

Table of Contents

1 Stochastic Kriging and Big n Problem

2 Markovian Covariance Functions

3 Scalable Gradient Extrapolated Stochastic Kriging

4 Conclusions

1

Stochastic Kriging and Big n

Problem

Metamodeling

SimModel

RealSystem

Meta-model

bull Simulation models are often computationally expensive

bull Metamodel fast approximation of simulation model

bull Run simulation at a small number of design points

bull Predict responses based on the simulation outputs

2

Stochastic Kriging

bull Also called Gaussian process (GP) regression

bull Unknown surface is modeled as a Gaussian process

Z(x) = β +M(x) x isin X sube Rd

bull M(x) is characterized by covariance function k(x y)bull Leverage spatial correlation for prediction

3

Partial Literature

bull Quantification of input uncertainty

bull Barton Nelson and Xie (2014)

bull Xie Nelson and Barton (2014)

bull Simulationblack-boxBayesian optimization

bull Huang et al (2006)

bull Sun Hong and Hu (2014)

bull Scott Frazier and Powell (2011)

bull Shahriari et al (2016)

4

The Big n Problem

bull Response surface is observed at x1 xn with noise

z(xi ) = β +M(xi ) + ε(xi )

bull Best linear unbiased predictor of Z(x0)

983141Z(x0) = β +ΣΣΣM(x0 middot)[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull Maximum likelihood estimation

maxβθθθ

983051minus log[det(ΣΣΣM +ΣΣΣε)]minus [z minus β1n]

⊺[ΣΣΣM +ΣΣΣε][z minus β1n]983052

bull Slow [ΣΣΣM +ΣΣΣε] isin Rntimesn and inverting it takes O(n3) time

bull Numerically unstable [ΣΣΣM +ΣΣΣε] is often nearly singular

bull Especially for the popular Gaussian covariance function

bull Usually run into trouble when n gt 100 which can easily happen

when d ge 3

5

Enhancing SK with Gradient Information

bull j-th run of the simulation model at xi producesbull response estimate zj(xi )bull gradient estimate gj(xi ) = (g 1

j (xi ) gdj (xi ))

g rj (xi ) = G r (xi ) + δrj (xi ) r = 1 d

where G r (xi ) is the true r -th partial derivative

bull Predict Z(x0) using both response estimates and gradient estimates

bull Qu and Fu (2014) gradient extrapolated stochastic kriging (GESK)

simple using gradients indirectly

bull Chen Ankenman and Nelson (2013) stochastic kriging with

gradient estimators (SKG) sophisticated using gradients directly

6

GESK (Qu and Fu 2014)

bull Use gradient estimates to create ldquopseudordquo response estimates

zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi

where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation

bull Predict Z(x0) using the augmented data

(z(x1) z(xn) z(x1) z(xn))

bull The size of the covariance matrix now becomes 2n times 2n

bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n

bull Similar problem for SKG

7

Approximation Schemes

bull Well developed in spatial statistics and machine learning

bull Banerjee et al (2015)

bull Rasmussen and Williams (2006)

bull Reduced-rank approximations emphasize long-range dependences

bull Sparse approximations emphasize short-range dependences

optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search

2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie

kethxTHORN frac14 1

eth2THORNd

Ze$ iWTxsethWTHORNdW (38)

Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following

kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i

(39)

As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that

kethxx0THORN

m

Xm

ifrac14 1

e$ iWethiTHORNTxeiWethiTHORN

Tx0 (40)

where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN

As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data

3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface

Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off

The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the

Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95

credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are

shown as black crosses The SSGP model used a basis of 80 Fourier features

Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization

Vol 104 No 1 January 2016 | Proceedings of the IEEE 159

Figure 1 Posterior means and variances Source Shahriari et al (2016)

8

Approximation-free

8

Markovian Covariance Functions

Gaussian Markov Random Field (GMRF)

bull M is multivariate normal with sparsity specified on ΣΣΣminus1M

bull A discrete model using graph to describe Markovian structure

bull Given all its neighbors node i is conditionally independent of its

non-neighbors

bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1

M (i j) ∕= 0 lArrrArr i and j are neighbors

0 1 2 3 4

bull The sparsity can reduce necessary computation to O(n2)

9

Disadvantages

bull Has no explicit expression for the covariances

bull Cannot predict locations ldquooff the gridrdquo

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

10

Markovian Covariance Function Best of Two Worlds

bull Construct a class of covariance functions for which

1 ΣΣΣM can be inverted analytically

2 ΣΣΣminus1M is sparse

bull Explicit link between covariance function and sparsity

Definition (1-d MCF)

Let p and q be two positive continuous functions that satisfy

p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then

k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF

bull Brownian motion kBM(x y) = x Ixley +y Ixgty

bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty

bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty

11

Markovian Covariance Function

bull x1 xn are not necessarily equally spaced

Theorem (Ding and Z 2018)

Kminus1 is tridiagonal and its nonzero entries are

(Kminus1)ii =

983099983105983105983105983105983105983105983103

983105983105983105983105983105983105983101

p2p1(p2q1 minus p1q2)

if i = 1

pi+1qiminus1 minus piminus1qi+1

(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1

qnminus1

qn(pnqnminus1 minus pnminus1qn) if i = n

and

(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1

piqiminus1 minus piminus1qi i = 2 n

12

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 2: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Table of Contents

1 Stochastic Kriging and Big n Problem

2 Markovian Covariance Functions

3 Scalable Gradient Extrapolated Stochastic Kriging

4 Conclusions

1

Stochastic Kriging and Big n

Problem

Metamodeling

SimModel

RealSystem

Meta-model

bull Simulation models are often computationally expensive

bull Metamodel fast approximation of simulation model

bull Run simulation at a small number of design points

bull Predict responses based on the simulation outputs

2

Stochastic Kriging

bull Also called Gaussian process (GP) regression

bull Unknown surface is modeled as a Gaussian process

Z(x) = β +M(x) x isin X sube Rd

bull M(x) is characterized by covariance function k(x y)bull Leverage spatial correlation for prediction

3

Partial Literature

bull Quantification of input uncertainty

bull Barton Nelson and Xie (2014)

bull Xie Nelson and Barton (2014)

bull Simulationblack-boxBayesian optimization

bull Huang et al (2006)

bull Sun Hong and Hu (2014)

bull Scott Frazier and Powell (2011)

bull Shahriari et al (2016)

4

The Big n Problem

bull Response surface is observed at x1 xn with noise

z(xi ) = β +M(xi ) + ε(xi )

bull Best linear unbiased predictor of Z(x0)

983141Z(x0) = β +ΣΣΣM(x0 middot)[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull Maximum likelihood estimation

maxβθθθ

983051minus log[det(ΣΣΣM +ΣΣΣε)]minus [z minus β1n]

⊺[ΣΣΣM +ΣΣΣε][z minus β1n]983052

bull Slow [ΣΣΣM +ΣΣΣε] isin Rntimesn and inverting it takes O(n3) time

bull Numerically unstable [ΣΣΣM +ΣΣΣε] is often nearly singular

bull Especially for the popular Gaussian covariance function

bull Usually run into trouble when n gt 100 which can easily happen

when d ge 3

5

Enhancing SK with Gradient Information

bull j-th run of the simulation model at xi producesbull response estimate zj(xi )bull gradient estimate gj(xi ) = (g 1

j (xi ) gdj (xi ))

g rj (xi ) = G r (xi ) + δrj (xi ) r = 1 d

where G r (xi ) is the true r -th partial derivative

bull Predict Z(x0) using both response estimates and gradient estimates

bull Qu and Fu (2014) gradient extrapolated stochastic kriging (GESK)

simple using gradients indirectly

bull Chen Ankenman and Nelson (2013) stochastic kriging with

gradient estimators (SKG) sophisticated using gradients directly

6

GESK (Qu and Fu 2014)

bull Use gradient estimates to create ldquopseudordquo response estimates

zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi

where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation

bull Predict Z(x0) using the augmented data

(z(x1) z(xn) z(x1) z(xn))

bull The size of the covariance matrix now becomes 2n times 2n

bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n

bull Similar problem for SKG

7

Approximation Schemes

bull Well developed in spatial statistics and machine learning

bull Banerjee et al (2015)

bull Rasmussen and Williams (2006)

bull Reduced-rank approximations emphasize long-range dependences

bull Sparse approximations emphasize short-range dependences

optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search

2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie

kethxTHORN frac14 1

eth2THORNd

Ze$ iWTxsethWTHORNdW (38)

Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following

kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i

(39)

As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that

kethxx0THORN

m

Xm

ifrac14 1

e$ iWethiTHORNTxeiWethiTHORN

Tx0 (40)

where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN

As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data

3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface

Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off

The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the

Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95

credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are

shown as black crosses The SSGP model used a basis of 80 Fourier features

Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization

Vol 104 No 1 January 2016 | Proceedings of the IEEE 159

Figure 1 Posterior means and variances Source Shahriari et al (2016)

8

Approximation-free

8

Markovian Covariance Functions

Gaussian Markov Random Field (GMRF)

bull M is multivariate normal with sparsity specified on ΣΣΣminus1M

bull A discrete model using graph to describe Markovian structure

bull Given all its neighbors node i is conditionally independent of its

non-neighbors

bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1

M (i j) ∕= 0 lArrrArr i and j are neighbors

0 1 2 3 4

bull The sparsity can reduce necessary computation to O(n2)

9

Disadvantages

bull Has no explicit expression for the covariances

bull Cannot predict locations ldquooff the gridrdquo

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

10

Markovian Covariance Function Best of Two Worlds

bull Construct a class of covariance functions for which

1 ΣΣΣM can be inverted analytically

2 ΣΣΣminus1M is sparse

bull Explicit link between covariance function and sparsity

Definition (1-d MCF)

Let p and q be two positive continuous functions that satisfy

p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then

k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF

bull Brownian motion kBM(x y) = x Ixley +y Ixgty

bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty

bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty

11

Markovian Covariance Function

bull x1 xn are not necessarily equally spaced

Theorem (Ding and Z 2018)

Kminus1 is tridiagonal and its nonzero entries are

(Kminus1)ii =

983099983105983105983105983105983105983105983103

983105983105983105983105983105983105983101

p2p1(p2q1 minus p1q2)

if i = 1

pi+1qiminus1 minus piminus1qi+1

(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1

qnminus1

qn(pnqnminus1 minus pnminus1qn) if i = n

and

(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1

piqiminus1 minus piminus1qi i = 2 n

12

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 3: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Stochastic Kriging and Big n

Problem

Metamodeling

SimModel

RealSystem

Meta-model

bull Simulation models are often computationally expensive

bull Metamodel fast approximation of simulation model

bull Run simulation at a small number of design points

bull Predict responses based on the simulation outputs

2

Stochastic Kriging

bull Also called Gaussian process (GP) regression

bull Unknown surface is modeled as a Gaussian process

Z(x) = β +M(x) x isin X sube Rd

bull M(x) is characterized by covariance function k(x y)bull Leverage spatial correlation for prediction

3

Partial Literature

bull Quantification of input uncertainty

bull Barton Nelson and Xie (2014)

bull Xie Nelson and Barton (2014)

bull Simulationblack-boxBayesian optimization

bull Huang et al (2006)

bull Sun Hong and Hu (2014)

bull Scott Frazier and Powell (2011)

bull Shahriari et al (2016)

4

The Big n Problem

bull Response surface is observed at x1 xn with noise

z(xi ) = β +M(xi ) + ε(xi )

bull Best linear unbiased predictor of Z(x0)

983141Z(x0) = β +ΣΣΣM(x0 middot)[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull Maximum likelihood estimation

maxβθθθ

983051minus log[det(ΣΣΣM +ΣΣΣε)]minus [z minus β1n]

⊺[ΣΣΣM +ΣΣΣε][z minus β1n]983052

bull Slow [ΣΣΣM +ΣΣΣε] isin Rntimesn and inverting it takes O(n3) time

bull Numerically unstable [ΣΣΣM +ΣΣΣε] is often nearly singular

bull Especially for the popular Gaussian covariance function

bull Usually run into trouble when n gt 100 which can easily happen

when d ge 3

5

Enhancing SK with Gradient Information

bull j-th run of the simulation model at xi producesbull response estimate zj(xi )bull gradient estimate gj(xi ) = (g 1

j (xi ) gdj (xi ))

g rj (xi ) = G r (xi ) + δrj (xi ) r = 1 d

where G r (xi ) is the true r -th partial derivative

bull Predict Z(x0) using both response estimates and gradient estimates

bull Qu and Fu (2014) gradient extrapolated stochastic kriging (GESK)

simple using gradients indirectly

bull Chen Ankenman and Nelson (2013) stochastic kriging with

gradient estimators (SKG) sophisticated using gradients directly

6

GESK (Qu and Fu 2014)

bull Use gradient estimates to create ldquopseudordquo response estimates

zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi

where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation

bull Predict Z(x0) using the augmented data

(z(x1) z(xn) z(x1) z(xn))

bull The size of the covariance matrix now becomes 2n times 2n

bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n

bull Similar problem for SKG

7

Approximation Schemes

bull Well developed in spatial statistics and machine learning

bull Banerjee et al (2015)

bull Rasmussen and Williams (2006)

bull Reduced-rank approximations emphasize long-range dependences

bull Sparse approximations emphasize short-range dependences

optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search

2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie

kethxTHORN frac14 1

eth2THORNd

Ze$ iWTxsethWTHORNdW (38)

Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following

kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i

(39)

As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that

kethxx0THORN

m

Xm

ifrac14 1

e$ iWethiTHORNTxeiWethiTHORN

Tx0 (40)

where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN

As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data

3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface

Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off

The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the

Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95

credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are

shown as black crosses The SSGP model used a basis of 80 Fourier features

Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization

Vol 104 No 1 January 2016 | Proceedings of the IEEE 159

Figure 1 Posterior means and variances Source Shahriari et al (2016)

8

Approximation-free

8

Markovian Covariance Functions

Gaussian Markov Random Field (GMRF)

bull M is multivariate normal with sparsity specified on ΣΣΣminus1M

bull A discrete model using graph to describe Markovian structure

bull Given all its neighbors node i is conditionally independent of its

non-neighbors

bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1

M (i j) ∕= 0 lArrrArr i and j are neighbors

0 1 2 3 4

bull The sparsity can reduce necessary computation to O(n2)

9

Disadvantages

bull Has no explicit expression for the covariances

bull Cannot predict locations ldquooff the gridrdquo

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

10

Markovian Covariance Function Best of Two Worlds

bull Construct a class of covariance functions for which

1 ΣΣΣM can be inverted analytically

2 ΣΣΣminus1M is sparse

bull Explicit link between covariance function and sparsity

Definition (1-d MCF)

Let p and q be two positive continuous functions that satisfy

p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then

k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF

bull Brownian motion kBM(x y) = x Ixley +y Ixgty

bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty

bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty

11

Markovian Covariance Function

bull x1 xn are not necessarily equally spaced

Theorem (Ding and Z 2018)

Kminus1 is tridiagonal and its nonzero entries are

(Kminus1)ii =

983099983105983105983105983105983105983105983103

983105983105983105983105983105983105983101

p2p1(p2q1 minus p1q2)

if i = 1

pi+1qiminus1 minus piminus1qi+1

(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1

qnminus1

qn(pnqnminus1 minus pnminus1qn) if i = n

and

(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1

piqiminus1 minus piminus1qi i = 2 n

12

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 4: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Metamodeling

SimModel

RealSystem

Meta-model

bull Simulation models are often computationally expensive

bull Metamodel fast approximation of simulation model

bull Run simulation at a small number of design points

bull Predict responses based on the simulation outputs

2

Stochastic Kriging

bull Also called Gaussian process (GP) regression

bull Unknown surface is modeled as a Gaussian process

Z(x) = β +M(x) x isin X sube Rd

bull M(x) is characterized by covariance function k(x y)bull Leverage spatial correlation for prediction

3

Partial Literature

bull Quantification of input uncertainty

bull Barton Nelson and Xie (2014)

bull Xie Nelson and Barton (2014)

bull Simulationblack-boxBayesian optimization

bull Huang et al (2006)

bull Sun Hong and Hu (2014)

bull Scott Frazier and Powell (2011)

bull Shahriari et al (2016)

4

The Big n Problem

bull Response surface is observed at x1 xn with noise

z(xi ) = β +M(xi ) + ε(xi )

bull Best linear unbiased predictor of Z(x0)

983141Z(x0) = β +ΣΣΣM(x0 middot)[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull Maximum likelihood estimation

maxβθθθ

983051minus log[det(ΣΣΣM +ΣΣΣε)]minus [z minus β1n]

⊺[ΣΣΣM +ΣΣΣε][z minus β1n]983052

bull Slow [ΣΣΣM +ΣΣΣε] isin Rntimesn and inverting it takes O(n3) time

bull Numerically unstable [ΣΣΣM +ΣΣΣε] is often nearly singular

bull Especially for the popular Gaussian covariance function

bull Usually run into trouble when n gt 100 which can easily happen

when d ge 3

5

Enhancing SK with Gradient Information

bull j-th run of the simulation model at xi producesbull response estimate zj(xi )bull gradient estimate gj(xi ) = (g 1

j (xi ) gdj (xi ))

g rj (xi ) = G r (xi ) + δrj (xi ) r = 1 d

where G r (xi ) is the true r -th partial derivative

bull Predict Z(x0) using both response estimates and gradient estimates

bull Qu and Fu (2014) gradient extrapolated stochastic kriging (GESK)

simple using gradients indirectly

bull Chen Ankenman and Nelson (2013) stochastic kriging with

gradient estimators (SKG) sophisticated using gradients directly

6

GESK (Qu and Fu 2014)

bull Use gradient estimates to create ldquopseudordquo response estimates

zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi

where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation

bull Predict Z(x0) using the augmented data

(z(x1) z(xn) z(x1) z(xn))

bull The size of the covariance matrix now becomes 2n times 2n

bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n

bull Similar problem for SKG

7

Approximation Schemes

bull Well developed in spatial statistics and machine learning

bull Banerjee et al (2015)

bull Rasmussen and Williams (2006)

bull Reduced-rank approximations emphasize long-range dependences

bull Sparse approximations emphasize short-range dependences

optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search

2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie

kethxTHORN frac14 1

eth2THORNd

Ze$ iWTxsethWTHORNdW (38)

Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following

kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i

(39)

As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that

kethxx0THORN

m

Xm

ifrac14 1

e$ iWethiTHORNTxeiWethiTHORN

Tx0 (40)

where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN

As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data

3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface

Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off

The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the

Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95

credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are

shown as black crosses The SSGP model used a basis of 80 Fourier features

Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization

Vol 104 No 1 January 2016 | Proceedings of the IEEE 159

Figure 1 Posterior means and variances Source Shahriari et al (2016)

8

Approximation-free

8

Markovian Covariance Functions

Gaussian Markov Random Field (GMRF)

bull M is multivariate normal with sparsity specified on ΣΣΣminus1M

bull A discrete model using graph to describe Markovian structure

bull Given all its neighbors node i is conditionally independent of its

non-neighbors

bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1

M (i j) ∕= 0 lArrrArr i and j are neighbors

0 1 2 3 4

bull The sparsity can reduce necessary computation to O(n2)

9

Disadvantages

bull Has no explicit expression for the covariances

bull Cannot predict locations ldquooff the gridrdquo

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

10

Markovian Covariance Function Best of Two Worlds

bull Construct a class of covariance functions for which

1 ΣΣΣM can be inverted analytically

2 ΣΣΣminus1M is sparse

bull Explicit link between covariance function and sparsity

Definition (1-d MCF)

Let p and q be two positive continuous functions that satisfy

p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then

k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF

bull Brownian motion kBM(x y) = x Ixley +y Ixgty

bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty

bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty

11

Markovian Covariance Function

bull x1 xn are not necessarily equally spaced

Theorem (Ding and Z 2018)

Kminus1 is tridiagonal and its nonzero entries are

(Kminus1)ii =

983099983105983105983105983105983105983105983103

983105983105983105983105983105983105983101

p2p1(p2q1 minus p1q2)

if i = 1

pi+1qiminus1 minus piminus1qi+1

(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1

qnminus1

qn(pnqnminus1 minus pnminus1qn) if i = n

and

(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1

piqiminus1 minus piminus1qi i = 2 n

12

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 5: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Stochastic Kriging

bull Also called Gaussian process (GP) regression

bull Unknown surface is modeled as a Gaussian process

Z(x) = β +M(x) x isin X sube Rd

bull M(x) is characterized by covariance function k(x y)bull Leverage spatial correlation for prediction

3

Partial Literature

bull Quantification of input uncertainty

bull Barton Nelson and Xie (2014)

bull Xie Nelson and Barton (2014)

bull Simulationblack-boxBayesian optimization

bull Huang et al (2006)

bull Sun Hong and Hu (2014)

bull Scott Frazier and Powell (2011)

bull Shahriari et al (2016)

4

The Big n Problem

bull Response surface is observed at x1 xn with noise

z(xi ) = β +M(xi ) + ε(xi )

bull Best linear unbiased predictor of Z(x0)

983141Z(x0) = β +ΣΣΣM(x0 middot)[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull Maximum likelihood estimation

maxβθθθ

983051minus log[det(ΣΣΣM +ΣΣΣε)]minus [z minus β1n]

⊺[ΣΣΣM +ΣΣΣε][z minus β1n]983052

bull Slow [ΣΣΣM +ΣΣΣε] isin Rntimesn and inverting it takes O(n3) time

bull Numerically unstable [ΣΣΣM +ΣΣΣε] is often nearly singular

bull Especially for the popular Gaussian covariance function

bull Usually run into trouble when n gt 100 which can easily happen

when d ge 3

5

Enhancing SK with Gradient Information

bull j-th run of the simulation model at xi producesbull response estimate zj(xi )bull gradient estimate gj(xi ) = (g 1

j (xi ) gdj (xi ))

g rj (xi ) = G r (xi ) + δrj (xi ) r = 1 d

where G r (xi ) is the true r -th partial derivative

bull Predict Z(x0) using both response estimates and gradient estimates

bull Qu and Fu (2014) gradient extrapolated stochastic kriging (GESK)

simple using gradients indirectly

bull Chen Ankenman and Nelson (2013) stochastic kriging with

gradient estimators (SKG) sophisticated using gradients directly

6

GESK (Qu and Fu 2014)

bull Use gradient estimates to create ldquopseudordquo response estimates

zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi

where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation

bull Predict Z(x0) using the augmented data

(z(x1) z(xn) z(x1) z(xn))

bull The size of the covariance matrix now becomes 2n times 2n

bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n

bull Similar problem for SKG

7

Approximation Schemes

bull Well developed in spatial statistics and machine learning

bull Banerjee et al (2015)

bull Rasmussen and Williams (2006)

bull Reduced-rank approximations emphasize long-range dependences

bull Sparse approximations emphasize short-range dependences

optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search

2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie

kethxTHORN frac14 1

eth2THORNd

Ze$ iWTxsethWTHORNdW (38)

Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following

kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i

(39)

As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that

kethxx0THORN

m

Xm

ifrac14 1

e$ iWethiTHORNTxeiWethiTHORN

Tx0 (40)

where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN

As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data

3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface

Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off

The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the

Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95

credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are

shown as black crosses The SSGP model used a basis of 80 Fourier features

Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization

Vol 104 No 1 January 2016 | Proceedings of the IEEE 159

Figure 1 Posterior means and variances Source Shahriari et al (2016)

8

Approximation-free

8

Markovian Covariance Functions

Gaussian Markov Random Field (GMRF)

bull M is multivariate normal with sparsity specified on ΣΣΣminus1M

bull A discrete model using graph to describe Markovian structure

bull Given all its neighbors node i is conditionally independent of its

non-neighbors

bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1

M (i j) ∕= 0 lArrrArr i and j are neighbors

0 1 2 3 4

bull The sparsity can reduce necessary computation to O(n2)

9

Disadvantages

bull Has no explicit expression for the covariances

bull Cannot predict locations ldquooff the gridrdquo

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

10

Markovian Covariance Function Best of Two Worlds

bull Construct a class of covariance functions for which

1 ΣΣΣM can be inverted analytically

2 ΣΣΣminus1M is sparse

bull Explicit link between covariance function and sparsity

Definition (1-d MCF)

Let p and q be two positive continuous functions that satisfy

p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then

k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF

bull Brownian motion kBM(x y) = x Ixley +y Ixgty

bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty

bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty

11

Markovian Covariance Function

bull x1 xn are not necessarily equally spaced

Theorem (Ding and Z 2018)

Kminus1 is tridiagonal and its nonzero entries are

(Kminus1)ii =

983099983105983105983105983105983105983105983103

983105983105983105983105983105983105983101

p2p1(p2q1 minus p1q2)

if i = 1

pi+1qiminus1 minus piminus1qi+1

(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1

qnminus1

qn(pnqnminus1 minus pnminus1qn) if i = n

and

(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1

piqiminus1 minus piminus1qi i = 2 n

12

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 6: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Partial Literature

bull Quantification of input uncertainty

bull Barton Nelson and Xie (2014)

bull Xie Nelson and Barton (2014)

bull Simulationblack-boxBayesian optimization

bull Huang et al (2006)

bull Sun Hong and Hu (2014)

bull Scott Frazier and Powell (2011)

bull Shahriari et al (2016)

4

The Big n Problem

bull Response surface is observed at x1 xn with noise

z(xi ) = β +M(xi ) + ε(xi )

bull Best linear unbiased predictor of Z(x0)

983141Z(x0) = β +ΣΣΣM(x0 middot)[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull Maximum likelihood estimation

maxβθθθ

983051minus log[det(ΣΣΣM +ΣΣΣε)]minus [z minus β1n]

⊺[ΣΣΣM +ΣΣΣε][z minus β1n]983052

bull Slow [ΣΣΣM +ΣΣΣε] isin Rntimesn and inverting it takes O(n3) time

bull Numerically unstable [ΣΣΣM +ΣΣΣε] is often nearly singular

bull Especially for the popular Gaussian covariance function

bull Usually run into trouble when n gt 100 which can easily happen

when d ge 3

5

Enhancing SK with Gradient Information

bull j-th run of the simulation model at xi producesbull response estimate zj(xi )bull gradient estimate gj(xi ) = (g 1

j (xi ) gdj (xi ))

g rj (xi ) = G r (xi ) + δrj (xi ) r = 1 d

where G r (xi ) is the true r -th partial derivative

bull Predict Z(x0) using both response estimates and gradient estimates

bull Qu and Fu (2014) gradient extrapolated stochastic kriging (GESK)

simple using gradients indirectly

bull Chen Ankenman and Nelson (2013) stochastic kriging with

gradient estimators (SKG) sophisticated using gradients directly

6

GESK (Qu and Fu 2014)

bull Use gradient estimates to create ldquopseudordquo response estimates

zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi

where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation

bull Predict Z(x0) using the augmented data

(z(x1) z(xn) z(x1) z(xn))

bull The size of the covariance matrix now becomes 2n times 2n

bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n

bull Similar problem for SKG

7

Approximation Schemes

bull Well developed in spatial statistics and machine learning

bull Banerjee et al (2015)

bull Rasmussen and Williams (2006)

bull Reduced-rank approximations emphasize long-range dependences

bull Sparse approximations emphasize short-range dependences

optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search

2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie

kethxTHORN frac14 1

eth2THORNd

Ze$ iWTxsethWTHORNdW (38)

Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following

kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i

(39)

As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that

kethxx0THORN

m

Xm

ifrac14 1

e$ iWethiTHORNTxeiWethiTHORN

Tx0 (40)

where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN

As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data

3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface

Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off

The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the

Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95

credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are

shown as black crosses The SSGP model used a basis of 80 Fourier features

Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization

Vol 104 No 1 January 2016 | Proceedings of the IEEE 159

Figure 1 Posterior means and variances Source Shahriari et al (2016)

8

Approximation-free

8

Markovian Covariance Functions

Gaussian Markov Random Field (GMRF)

bull M is multivariate normal with sparsity specified on ΣΣΣminus1M

bull A discrete model using graph to describe Markovian structure

bull Given all its neighbors node i is conditionally independent of its

non-neighbors

bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1

M (i j) ∕= 0 lArrrArr i and j are neighbors

0 1 2 3 4

bull The sparsity can reduce necessary computation to O(n2)

9

Disadvantages

bull Has no explicit expression for the covariances

bull Cannot predict locations ldquooff the gridrdquo

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

10

Markovian Covariance Function Best of Two Worlds

bull Construct a class of covariance functions for which

1 ΣΣΣM can be inverted analytically

2 ΣΣΣminus1M is sparse

bull Explicit link between covariance function and sparsity

Definition (1-d MCF)

Let p and q be two positive continuous functions that satisfy

p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then

k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF

bull Brownian motion kBM(x y) = x Ixley +y Ixgty

bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty

bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty

11

Markovian Covariance Function

bull x1 xn are not necessarily equally spaced

Theorem (Ding and Z 2018)

Kminus1 is tridiagonal and its nonzero entries are

(Kminus1)ii =

983099983105983105983105983105983105983105983103

983105983105983105983105983105983105983101

p2p1(p2q1 minus p1q2)

if i = 1

pi+1qiminus1 minus piminus1qi+1

(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1

qnminus1

qn(pnqnminus1 minus pnminus1qn) if i = n

and

(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1

piqiminus1 minus piminus1qi i = 2 n

12

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 7: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

The Big n Problem

bull Response surface is observed at x1 xn with noise

z(xi ) = β +M(xi ) + ε(xi )

bull Best linear unbiased predictor of Z(x0)

983141Z(x0) = β +ΣΣΣM(x0 middot)[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull Maximum likelihood estimation

maxβθθθ

983051minus log[det(ΣΣΣM +ΣΣΣε)]minus [z minus β1n]

⊺[ΣΣΣM +ΣΣΣε][z minus β1n]983052

bull Slow [ΣΣΣM +ΣΣΣε] isin Rntimesn and inverting it takes O(n3) time

bull Numerically unstable [ΣΣΣM +ΣΣΣε] is often nearly singular

bull Especially for the popular Gaussian covariance function

bull Usually run into trouble when n gt 100 which can easily happen

when d ge 3

5

Enhancing SK with Gradient Information

bull j-th run of the simulation model at xi producesbull response estimate zj(xi )bull gradient estimate gj(xi ) = (g 1

j (xi ) gdj (xi ))

g rj (xi ) = G r (xi ) + δrj (xi ) r = 1 d

where G r (xi ) is the true r -th partial derivative

bull Predict Z(x0) using both response estimates and gradient estimates

bull Qu and Fu (2014) gradient extrapolated stochastic kriging (GESK)

simple using gradients indirectly

bull Chen Ankenman and Nelson (2013) stochastic kriging with

gradient estimators (SKG) sophisticated using gradients directly

6

GESK (Qu and Fu 2014)

bull Use gradient estimates to create ldquopseudordquo response estimates

zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi

where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation

bull Predict Z(x0) using the augmented data

(z(x1) z(xn) z(x1) z(xn))

bull The size of the covariance matrix now becomes 2n times 2n

bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n

bull Similar problem for SKG

7

Approximation Schemes

bull Well developed in spatial statistics and machine learning

bull Banerjee et al (2015)

bull Rasmussen and Williams (2006)

bull Reduced-rank approximations emphasize long-range dependences

bull Sparse approximations emphasize short-range dependences

optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search

2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie

kethxTHORN frac14 1

eth2THORNd

Ze$ iWTxsethWTHORNdW (38)

Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following

kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i

(39)

As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that

kethxx0THORN

m

Xm

ifrac14 1

e$ iWethiTHORNTxeiWethiTHORN

Tx0 (40)

where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN

As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data

3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface

Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off

The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the

Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95

credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are

shown as black crosses The SSGP model used a basis of 80 Fourier features

Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization

Vol 104 No 1 January 2016 | Proceedings of the IEEE 159

Figure 1 Posterior means and variances Source Shahriari et al (2016)

8

Approximation-free

8

Markovian Covariance Functions

Gaussian Markov Random Field (GMRF)

bull M is multivariate normal with sparsity specified on ΣΣΣminus1M

bull A discrete model using graph to describe Markovian structure

bull Given all its neighbors node i is conditionally independent of its

non-neighbors

bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1

M (i j) ∕= 0 lArrrArr i and j are neighbors

0 1 2 3 4

bull The sparsity can reduce necessary computation to O(n2)

9

Disadvantages

bull Has no explicit expression for the covariances

bull Cannot predict locations ldquooff the gridrdquo

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

10

Markovian Covariance Function Best of Two Worlds

bull Construct a class of covariance functions for which

1 ΣΣΣM can be inverted analytically

2 ΣΣΣminus1M is sparse

bull Explicit link between covariance function and sparsity

Definition (1-d MCF)

Let p and q be two positive continuous functions that satisfy

p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then

k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF

bull Brownian motion kBM(x y) = x Ixley +y Ixgty

bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty

bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty

11

Markovian Covariance Function

bull x1 xn are not necessarily equally spaced

Theorem (Ding and Z 2018)

Kminus1 is tridiagonal and its nonzero entries are

(Kminus1)ii =

983099983105983105983105983105983105983105983103

983105983105983105983105983105983105983101

p2p1(p2q1 minus p1q2)

if i = 1

pi+1qiminus1 minus piminus1qi+1

(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1

qnminus1

qn(pnqnminus1 minus pnminus1qn) if i = n

and

(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1

piqiminus1 minus piminus1qi i = 2 n

12

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 8: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Enhancing SK with Gradient Information

bull j-th run of the simulation model at xi producesbull response estimate zj(xi )bull gradient estimate gj(xi ) = (g 1

j (xi ) gdj (xi ))

g rj (xi ) = G r (xi ) + δrj (xi ) r = 1 d

where G r (xi ) is the true r -th partial derivative

bull Predict Z(x0) using both response estimates and gradient estimates

bull Qu and Fu (2014) gradient extrapolated stochastic kriging (GESK)

simple using gradients indirectly

bull Chen Ankenman and Nelson (2013) stochastic kriging with

gradient estimators (SKG) sophisticated using gradients directly

6

GESK (Qu and Fu 2014)

bull Use gradient estimates to create ldquopseudordquo response estimates

zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi

where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation

bull Predict Z(x0) using the augmented data

(z(x1) z(xn) z(x1) z(xn))

bull The size of the covariance matrix now becomes 2n times 2n

bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n

bull Similar problem for SKG

7

Approximation Schemes

bull Well developed in spatial statistics and machine learning

bull Banerjee et al (2015)

bull Rasmussen and Williams (2006)

bull Reduced-rank approximations emphasize long-range dependences

bull Sparse approximations emphasize short-range dependences

optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search

2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie

kethxTHORN frac14 1

eth2THORNd

Ze$ iWTxsethWTHORNdW (38)

Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following

kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i

(39)

As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that

kethxx0THORN

m

Xm

ifrac14 1

e$ iWethiTHORNTxeiWethiTHORN

Tx0 (40)

where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN

As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data

3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface

Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off

The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the

Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95

credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are

shown as black crosses The SSGP model used a basis of 80 Fourier features

Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization

Vol 104 No 1 January 2016 | Proceedings of the IEEE 159

Figure 1 Posterior means and variances Source Shahriari et al (2016)

8

Approximation-free

8

Markovian Covariance Functions

Gaussian Markov Random Field (GMRF)

bull M is multivariate normal with sparsity specified on ΣΣΣminus1M

bull A discrete model using graph to describe Markovian structure

bull Given all its neighbors node i is conditionally independent of its

non-neighbors

bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1

M (i j) ∕= 0 lArrrArr i and j are neighbors

0 1 2 3 4

bull The sparsity can reduce necessary computation to O(n2)

9

Disadvantages

bull Has no explicit expression for the covariances

bull Cannot predict locations ldquooff the gridrdquo

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

10

Markovian Covariance Function Best of Two Worlds

bull Construct a class of covariance functions for which

1 ΣΣΣM can be inverted analytically

2 ΣΣΣminus1M is sparse

bull Explicit link between covariance function and sparsity

Definition (1-d MCF)

Let p and q be two positive continuous functions that satisfy

p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then

k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF

bull Brownian motion kBM(x y) = x Ixley +y Ixgty

bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty

bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty

11

Markovian Covariance Function

bull x1 xn are not necessarily equally spaced

Theorem (Ding and Z 2018)

Kminus1 is tridiagonal and its nonzero entries are

(Kminus1)ii =

983099983105983105983105983105983105983105983103

983105983105983105983105983105983105983101

p2p1(p2q1 minus p1q2)

if i = 1

pi+1qiminus1 minus piminus1qi+1

(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1

qnminus1

qn(pnqnminus1 minus pnminus1qn) if i = n

and

(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1

piqiminus1 minus piminus1qi i = 2 n

12

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 9: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

GESK (Qu and Fu 2014)

bull Use gradient estimates to create ldquopseudordquo response estimates

zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi

where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation

bull Predict Z(x0) using the augmented data

(z(x1) z(xn) z(x1) z(xn))

bull The size of the covariance matrix now becomes 2n times 2n

bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n

bull Similar problem for SKG

7

Approximation Schemes

bull Well developed in spatial statistics and machine learning

bull Banerjee et al (2015)

bull Rasmussen and Williams (2006)

bull Reduced-rank approximations emphasize long-range dependences

bull Sparse approximations emphasize short-range dependences

optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search

2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie

kethxTHORN frac14 1

eth2THORNd

Ze$ iWTxsethWTHORNdW (38)

Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following

kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i

(39)

As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that

kethxx0THORN

m

Xm

ifrac14 1

e$ iWethiTHORNTxeiWethiTHORN

Tx0 (40)

where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN

As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data

3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface

Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off

The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the

Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95

credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are

shown as black crosses The SSGP model used a basis of 80 Fourier features

Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization

Vol 104 No 1 January 2016 | Proceedings of the IEEE 159

Figure 1 Posterior means and variances Source Shahriari et al (2016)

8

Approximation-free

8

Markovian Covariance Functions

Gaussian Markov Random Field (GMRF)

bull M is multivariate normal with sparsity specified on ΣΣΣminus1M

bull A discrete model using graph to describe Markovian structure

bull Given all its neighbors node i is conditionally independent of its

non-neighbors

bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1

M (i j) ∕= 0 lArrrArr i and j are neighbors

0 1 2 3 4

bull The sparsity can reduce necessary computation to O(n2)

9

Disadvantages

bull Has no explicit expression for the covariances

bull Cannot predict locations ldquooff the gridrdquo

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

10

Markovian Covariance Function Best of Two Worlds

bull Construct a class of covariance functions for which

1 ΣΣΣM can be inverted analytically

2 ΣΣΣminus1M is sparse

bull Explicit link between covariance function and sparsity

Definition (1-d MCF)

Let p and q be two positive continuous functions that satisfy

p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then

k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF

bull Brownian motion kBM(x y) = x Ixley +y Ixgty

bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty

bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty

11

Markovian Covariance Function

bull x1 xn are not necessarily equally spaced

Theorem (Ding and Z 2018)

Kminus1 is tridiagonal and its nonzero entries are

(Kminus1)ii =

983099983105983105983105983105983105983105983103

983105983105983105983105983105983105983101

p2p1(p2q1 minus p1q2)

if i = 1

pi+1qiminus1 minus piminus1qi+1

(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1

qnminus1

qn(pnqnminus1 minus pnminus1qn) if i = n

and

(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1

piqiminus1 minus piminus1qi i = 2 n

12

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 10: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Approximation Schemes

bull Well developed in spatial statistics and machine learning

bull Banerjee et al (2015)

bull Rasmussen and Williams (2006)

bull Reduced-rank approximations emphasize long-range dependences

bull Sparse approximations emphasize short-range dependences

optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search

2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie

kethxTHORN frac14 1

eth2THORNd

Ze$ iWTxsethWTHORNdW (38)

Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following

kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i

(39)

As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that

kethxx0THORN

m

Xm

ifrac14 1

e$ iWethiTHORNTxeiWethiTHORN

Tx0 (40)

where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN

As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data

3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface

Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off

The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the

Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95

credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are

shown as black crosses The SSGP model used a basis of 80 Fourier features

Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization

Vol 104 No 1 January 2016 | Proceedings of the IEEE 159

Figure 1 Posterior means and variances Source Shahriari et al (2016)

8

Approximation-free

8

Markovian Covariance Functions

Gaussian Markov Random Field (GMRF)

bull M is multivariate normal with sparsity specified on ΣΣΣminus1M

bull A discrete model using graph to describe Markovian structure

bull Given all its neighbors node i is conditionally independent of its

non-neighbors

bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1

M (i j) ∕= 0 lArrrArr i and j are neighbors

0 1 2 3 4

bull The sparsity can reduce necessary computation to O(n2)

9

Disadvantages

bull Has no explicit expression for the covariances

bull Cannot predict locations ldquooff the gridrdquo

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

10

Markovian Covariance Function Best of Two Worlds

bull Construct a class of covariance functions for which

1 ΣΣΣM can be inverted analytically

2 ΣΣΣminus1M is sparse

bull Explicit link between covariance function and sparsity

Definition (1-d MCF)

Let p and q be two positive continuous functions that satisfy

p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then

k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF

bull Brownian motion kBM(x y) = x Ixley +y Ixgty

bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty

bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty

11

Markovian Covariance Function

bull x1 xn are not necessarily equally spaced

Theorem (Ding and Z 2018)

Kminus1 is tridiagonal and its nonzero entries are

(Kminus1)ii =

983099983105983105983105983105983105983105983103

983105983105983105983105983105983105983101

p2p1(p2q1 minus p1q2)

if i = 1

pi+1qiminus1 minus piminus1qi+1

(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1

qnminus1

qn(pnqnminus1 minus pnminus1qn) if i = n

and

(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1

piqiminus1 minus piminus1qi i = 2 n

12

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 11: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Approximation-free

8

Markovian Covariance Functions

Gaussian Markov Random Field (GMRF)

bull M is multivariate normal with sparsity specified on ΣΣΣminus1M

bull A discrete model using graph to describe Markovian structure

bull Given all its neighbors node i is conditionally independent of its

non-neighbors

bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1

M (i j) ∕= 0 lArrrArr i and j are neighbors

0 1 2 3 4

bull The sparsity can reduce necessary computation to O(n2)

9

Disadvantages

bull Has no explicit expression for the covariances

bull Cannot predict locations ldquooff the gridrdquo

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

10

Markovian Covariance Function Best of Two Worlds

bull Construct a class of covariance functions for which

1 ΣΣΣM can be inverted analytically

2 ΣΣΣminus1M is sparse

bull Explicit link between covariance function and sparsity

Definition (1-d MCF)

Let p and q be two positive continuous functions that satisfy

p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then

k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF

bull Brownian motion kBM(x y) = x Ixley +y Ixgty

bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty

bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty

11

Markovian Covariance Function

bull x1 xn are not necessarily equally spaced

Theorem (Ding and Z 2018)

Kminus1 is tridiagonal and its nonzero entries are

(Kminus1)ii =

983099983105983105983105983105983105983105983103

983105983105983105983105983105983105983101

p2p1(p2q1 minus p1q2)

if i = 1

pi+1qiminus1 minus piminus1qi+1

(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1

qnminus1

qn(pnqnminus1 minus pnminus1qn) if i = n

and

(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1

piqiminus1 minus piminus1qi i = 2 n

12

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 12: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Markovian Covariance Functions

Gaussian Markov Random Field (GMRF)

bull M is multivariate normal with sparsity specified on ΣΣΣminus1M

bull A discrete model using graph to describe Markovian structure

bull Given all its neighbors node i is conditionally independent of its

non-neighbors

bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1

M (i j) ∕= 0 lArrrArr i and j are neighbors

0 1 2 3 4

bull The sparsity can reduce necessary computation to O(n2)

9

Disadvantages

bull Has no explicit expression for the covariances

bull Cannot predict locations ldquooff the gridrdquo

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

10

Markovian Covariance Function Best of Two Worlds

bull Construct a class of covariance functions for which

1 ΣΣΣM can be inverted analytically

2 ΣΣΣminus1M is sparse

bull Explicit link between covariance function and sparsity

Definition (1-d MCF)

Let p and q be two positive continuous functions that satisfy

p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then

k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF

bull Brownian motion kBM(x y) = x Ixley +y Ixgty

bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty

bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty

11

Markovian Covariance Function

bull x1 xn are not necessarily equally spaced

Theorem (Ding and Z 2018)

Kminus1 is tridiagonal and its nonzero entries are

(Kminus1)ii =

983099983105983105983105983105983105983105983103

983105983105983105983105983105983105983101

p2p1(p2q1 minus p1q2)

if i = 1

pi+1qiminus1 minus piminus1qi+1

(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1

qnminus1

qn(pnqnminus1 minus pnminus1qn) if i = n

and

(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1

piqiminus1 minus piminus1qi i = 2 n

12

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 13: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Gaussian Markov Random Field (GMRF)

bull M is multivariate normal with sparsity specified on ΣΣΣminus1M

bull A discrete model using graph to describe Markovian structure

bull Given all its neighbors node i is conditionally independent of its

non-neighbors

bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1

M (i j) ∕= 0 lArrrArr i and j are neighbors

0 1 2 3 4

bull The sparsity can reduce necessary computation to O(n2)

9

Disadvantages

bull Has no explicit expression for the covariances

bull Cannot predict locations ldquooff the gridrdquo

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

10

Markovian Covariance Function Best of Two Worlds

bull Construct a class of covariance functions for which

1 ΣΣΣM can be inverted analytically

2 ΣΣΣminus1M is sparse

bull Explicit link between covariance function and sparsity

Definition (1-d MCF)

Let p and q be two positive continuous functions that satisfy

p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then

k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF

bull Brownian motion kBM(x y) = x Ixley +y Ixgty

bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty

bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty

11

Markovian Covariance Function

bull x1 xn are not necessarily equally spaced

Theorem (Ding and Z 2018)

Kminus1 is tridiagonal and its nonzero entries are

(Kminus1)ii =

983099983105983105983105983105983105983105983103

983105983105983105983105983105983105983101

p2p1(p2q1 minus p1q2)

if i = 1

pi+1qiminus1 minus piminus1qi+1

(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1

qnminus1

qn(pnqnminus1 minus pnminus1qn) if i = n

and

(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1

piqiminus1 minus piminus1qi i = 2 n

12

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 14: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Disadvantages

bull Has no explicit expression for the covariances

bull Cannot predict locations ldquooff the gridrdquo

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

10

Markovian Covariance Function Best of Two Worlds

bull Construct a class of covariance functions for which

1 ΣΣΣM can be inverted analytically

2 ΣΣΣminus1M is sparse

bull Explicit link between covariance function and sparsity

Definition (1-d MCF)

Let p and q be two positive continuous functions that satisfy

p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then

k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF

bull Brownian motion kBM(x y) = x Ixley +y Ixgty

bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty

bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty

11

Markovian Covariance Function

bull x1 xn are not necessarily equally spaced

Theorem (Ding and Z 2018)

Kminus1 is tridiagonal and its nonzero entries are

(Kminus1)ii =

983099983105983105983105983105983105983105983103

983105983105983105983105983105983105983101

p2p1(p2q1 minus p1q2)

if i = 1

pi+1qiminus1 minus piminus1qi+1

(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1

qnminus1

qn(pnqnminus1 minus pnminus1qn) if i = n

and

(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1

piqiminus1 minus piminus1qi i = 2 n

12

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 15: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Markovian Covariance Function Best of Two Worlds

bull Construct a class of covariance functions for which

1 ΣΣΣM can be inverted analytically

2 ΣΣΣminus1M is sparse

bull Explicit link between covariance function and sparsity

Definition (1-d MCF)

Let p and q be two positive continuous functions that satisfy

p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then

k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF

bull Brownian motion kBM(x y) = x Ixley +y Ixgty

bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty

bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty

11

Markovian Covariance Function

bull x1 xn are not necessarily equally spaced

Theorem (Ding and Z 2018)

Kminus1 is tridiagonal and its nonzero entries are

(Kminus1)ii =

983099983105983105983105983105983105983105983103

983105983105983105983105983105983105983101

p2p1(p2q1 minus p1q2)

if i = 1

pi+1qiminus1 minus piminus1qi+1

(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1

qnminus1

qn(pnqnminus1 minus pnminus1qn) if i = n

and

(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1

piqiminus1 minus piminus1qi i = 2 n

12

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 16: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Markovian Covariance Function

bull x1 xn are not necessarily equally spaced

Theorem (Ding and Z 2018)

Kminus1 is tridiagonal and its nonzero entries are

(Kminus1)ii =

983099983105983105983105983105983105983105983103

983105983105983105983105983105983105983101

p2p1(p2q1 minus p1q2)

if i = 1

pi+1qiminus1 minus piminus1qi+1

(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1

qnminus1

qn(pnqnminus1 minus pnminus1qn) if i = n

and

(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1

piqiminus1 minus piminus1qi i = 2 n

12

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 17: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Reduction in Complexity

bull Woodbury matrix identity

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M983167983166983165983168known

+ ΣΣΣminus1M983167983166983165983168

sparse

983147ΣΣΣminus1

M +ΣΣΣminus1ε983167 983166983165 983168

sparse

983148minus1

ΣΣΣminus1M

bull inversion O(n2)

bull multiplications O(n2)

bull addition O(n2)

bull It takes O(n2) time to compute BLUP

983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known

[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]

bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is

needed and computing BLUP is O(n)

13

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 18: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Improvement in Stability

1 ΣΣΣM can be made much better conditioned

2 Woodbury also improves numerical stability

[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1

M +ΣΣΣminus1M

983147ΣΣΣminus1

M +ΣΣΣminus1ε

983148minus1

ΣΣΣminus1M

bull The diagonal entries of ΣΣΣminus1ε are often large

14

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 19: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Uncertainty Quantification

15

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 20: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Extension for d gt 1

bull Product form k(x y) =983124d

i=1 ki (xi y i )

bull Limitation x1 xn must form a regular lattice

bull Then K =983121d

i=1 Ki and Kminus1 =983121d

i=1 Kminus1i preserving sparsity

(00)

(01)

(02)

(10)

(11)

(12)

(20)

(21)

(22)

16

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 21: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Two-Dimensional Response Surfaces

Function Name Expression

Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6

6+ xy + y 2

Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07

17

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 22: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Prediction Accuracy

bull Standardized RMSE =

983155983123K

i=1[Z(xi )minusZ(xi )]2

raquo983123K

i=1[Z(xi )minusKminus1983123K

h=1Z(xh)]

2

18

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 23: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Condition Number of ΣΣΣM +ΣΣΣε

bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo

19

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 24: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Scalability Demonstration

bull 4-d Griewank func Z(x) =9831234

i=1

Aumlx (i)

20

auml2minus 10

983124Di=1 cos

Aumlx (i)radici

auml+ 10

bull Mean cycle time of a N-station Jackson network with D different

types of arrivals (Yang et al 2011) N = D = 4

E[CT1] =N983131

j=1

δ1j

microj

iuml1minus ρ

Aring 983123D

i=1αiδijmicroj

maxh983123D

i=1αiδihmicroh

atildeograve

20

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 25: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Computational Efficiency

21

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 26: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Scalable Gradient Extrapolated

Stochastic Kriging

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 27: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Enhancing Scalability of GESK with MCFs

bull GESK creates an augmented set of response estimates for SK

bull MCFs can be applied if the design points form a regular lattice of

size n = n1 times n2 times middot middot middot nd

bull Result in 2dn points in the augmented dataset

bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product

to reduce its inversion to inverting d much smaller matrices each

having size 2nr times 2nr

22

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 28: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Numerical Illustration

SK GESK

n=54=625 =108

0

001

002

003

004

005

006

007

008

EIM

SE

SK GESK

n=84=4096 =07

0

001

002

003

004

005

006

007

008

SK GESK

n=104=10000 =06

0

001

002

003

004

005

006

007

008

bull 4-dimensional Griewank function

bull Can manage n = 104 design points

23

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 29: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Conclusions

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 30: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Remarks on MCFs

bull Allow modeling association directly while retaining sparsity in the

precision matrix

bull Improve the scalability of SK so that it can be used for simulation

models with a high-dimensional design space

bull Reduce computational cost from O(n3) to O(n2) without approx

bull Further reduce to O(n) if observations are noise-free

bull Enhance numerical stability substantially

bull Limitation design points must form a regular lattice though not

necessarily equally spaced

24

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 31: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Remarks on Gradient Enhanced SK

bull GESK (Qu and Fu 2014) can easily benefit from MCFs

bull But there are two issues

bull Extrapolation error is hard to characterize

bull Each design point needs (2d minus 1) pseudo response estimates a great

deal of redundancy in using gradient info

bull SKG (Chenn Ankenman and Nelson 2013) does not incur such

computational overhead but requires calculating the gradient

surface of the Gaussian process (on-going work)

25

Markovian covariances without approx

vs

Good approx for all covariances

25

Page 32: A Scalable Approach to Gradient-Enhanced ... - Xiaowei Zhang · A Scalable Approach to Gradient-Enhanced Stochastic Kriging Haojun Huo†, Xiaowei Zhang∗, and Zeyu Zheng‡ †

Markovian covariances without approx

vs

Good approx for all covariances

25


Recommended