+ All Categories
Home > Documents > Distribution of Gaussian Process Arc Lengthssjrob/Pubs/Bewsher_aistats2017.pdf · Distribution of...

Distribution of Gaussian Process Arc Lengthssjrob/Pubs/Bewsher_aistats2017.pdf · Distribution of...

Date post: 03-Sep-2019
Category:
Upload: others
View: 11 times
Download: 0 times
Share this document with a friend
10
Distribution of Gaussian Process Arc Lengths Justin D. Bewsher Alessandra Tosi Michael A. Osborne Stephen J. Roberts University of Oxford Mind Foundry, Oxford University of Oxford University of Oxford Abstract We present the first treatment of the arc length of the Gaussian Process (gp) with more than a single output dimension. Gps are commonly used for tasks such as trajec- tory modelling, where path length is a crucial quantity of interest. Previously, only paths in one dimension have been considered, with no theoretical consideration of higher dimen- sional problems. We fill the gap in the exist- ing literature by deriving the moments of the arc length for a stationary gp with multiple output dimensions. A new method is used to derive the mean of a one-dimensional gp over a finite interval, by considering the dis- tribution of the arc length integrand. This technique is used to derive an approximate distribution over the arc length of a vector valued gp in R n by moment matching the distribution. Numerical simulations confirm our theoretical derivations. 1 INTRODUCTION Gaussian Processes (gps) [22] are a ubiquitous tool in machine learning. They provide a flexible non- parametric approach to non-linear data modelling. Gps have been used in a number of machine learn- ing problems, including latent variable modelling [10], dynamical time-series modelling [20] and Bayesian op- timisation [18]. At present, a gap exists in the literature; we fill that gap by providing the first analysis of the moments of the arc length of a vector-valued gp. Previous work tackles only the univariate case, making it inapplica- ble to important applications in, for example, medical vision (or brain imaging) [6] and path planning [15]. Proceedings of the 20 th International Conference on Artifi- cial Intelligence and Statistics (AISTATS) 2017, Fort Laud- erdale, Florida, USA. JMLR: W&CP volume 54. Copy- right 2017 by the author(s). The authors believe that an understanding of the arc length properties of a gp will open up promising av- enues of research. Arc length statistics have been used to analyze multivariate time series modelling [21]. In [19], the authors minimize the arc length of a deter- ministic curve which then implicitly defines a gp. In another related paper [7], the authors use gps as ap- proximations to geodesics and compute arc lengths us- ing a na¨ ıve Monte Carlo method. We envision the arc length as a cost function in Bayesian optimisation, as a tool in path planning prob- lems [11] and a way to construct meaningful features from functional data. Consider a Euclidean space X = R n and a dieren- tiable injective function γ : [0,T ] ! R n . Then the image of the curve, γ , is a curve with length: length(γ )= Z T 0 |γ 0 (t)|dt. (1) Importantly, the length of the curve is independent of the choice of parametrization of the curve [4]. For the specific case where X = R 2 , with the parametrization in terms of t, γ =(y(t),x(t)), we have: length(γ )= Z T 0 |γ 0 (t)|dt = Z T 0 p y 0 (t) 2 + x 0 (t) 2 dt. (2) If we can write y = f (x), x = t, then our expression reduces to the commonly known expression for the arc length of a function: length(γ )= s = Z T 0 |γ 0 (t)|dt = Z T 0 q 1+(f 0 (t)) 2 dt, (3) where we have introduced s as a shorthand for the length of our curve. In some cases the exact form of s can be computed, for more complicated curves, such as Beizer and splines curves, we must appeal to numerical methods to compute the length. Our interest lies in considering the length of a function modelled with a gp. Intuition suggests that the length of a gp will concentrate around the mean function with the statistical properties dictated by the choice of ker- nel and the corresponding hyperparameters.
Transcript

Distribution of Gaussian Process Arc Lengths

Justin D. Bewsher Alessandra Tosi Michael A. Osborne Stephen J. RobertsUniversity of Oxford Mind Foundry, Oxford University of Oxford University of Oxford

Abstract

We present the first treatment of the arclength of the Gaussian Process (gp) withmore than a single output dimension. Gpsare commonly used for tasks such as trajec-tory modelling, where path length is a crucialquantity of interest. Previously, only pathsin one dimension have been considered, withno theoretical consideration of higher dimen-sional problems. We fill the gap in the exist-ing literature by deriving the moments of thearc length for a stationary gp with multipleoutput dimensions. A new method is usedto derive the mean of a one-dimensional gpover a finite interval, by considering the dis-tribution of the arc length integrand. Thistechnique is used to derive an approximatedistribution over the arc length of a vectorvalued gp in Rn by moment matching thedistribution. Numerical simulations confirmour theoretical derivations.

1 INTRODUCTION

Gaussian Processes (gps) [22] are a ubiquitous toolin machine learning. They provide a flexible non-parametric approach to non-linear data modelling.Gps have been used in a number of machine learn-ing problems, including latent variable modelling [10],dynamical time-series modelling [20] and Bayesian op-timisation [18].

At present, a gap exists in the literature; we fill thatgap by providing the first analysis of the moments ofthe arc length of a vector-valued gp. Previous worktackles only the univariate case, making it inapplica-ble to important applications in, for example, medicalvision (or brain imaging) [6] and path planning [15].

Proceedings of the 20

thInternational Conference on Artifi-

cial Intelligence and Statistics (AISTATS) 2017, Fort Laud-

erdale, Florida, USA. JMLR: W&CP volume 54. Copy-

right 2017 by the author(s).

The authors believe that an understanding of the arclength properties of a gp will open up promising av-enues of research. Arc length statistics have been usedto analyze multivariate time series modelling [21]. In[19], the authors minimize the arc length of a deter-ministic curve which then implicitly defines a gp. Inanother related paper [7], the authors use gps as ap-proximations to geodesics and compute arc lengths us-ing a naıve Monte Carlo method.

We envision the arc length as a cost function inBayesian optimisation, as a tool in path planning prob-lems [11] and a way to construct meaningful featuresfrom functional data.

Consider a Euclidean space X = Rn and a di↵eren-tiable injective function � : [0, T ] ! Rn. Then theimage of the curve, �, is a curve with length:

length(�) =

ZT

0

|�0(t)|dt. (1)

Importantly, the length of the curve is independent ofthe choice of parametrization of the curve [4]. For thespecific case where X = R2, with the parametrizationin terms of t, � = (y(t), x(t)), we have:

length(�) =

ZT

0

|�0(t)|dt =Z

T

0

py

0(t)2 + x

0(t)2dt.

(2)If we can write y = f(x), x = t, then our expressionreduces to the commonly known expression for the arclength of a function:

length(�) = s =

ZT

0

|�0(t)|dt =Z

T

0

q1 + (f 0(t))2dt,

(3)where we have introduced s as a shorthand for thelength of our curve. In some cases the exact form of scan be computed, for more complicated curves, such asBeizer and splines curves, we must appeal to numericalmethods to compute the length.

Our interest lies in considering the length of a functionmodelled with a gp. Intuition suggests that the lengthof a gp will concentrate around the mean function withthe statistical properties dictated by the choice of ker-nel and the corresponding hyperparameters.

Distribution of Gaussian Process Arc Lengths

Previous work [2] considered the derivative processf

0(t), which is itself a gp. A direct calculation wasperformed and an exact form for the mean was ob-tained in terms of modified Bessel functions; a formfor the variance is also presented. This result is alsoderived in [14, 3]. Analysis has been presented on thearc length of a high-level excursion from the mean [16].

However, other important questions have not been ex-plored within the literature. In particular, the shapeof the distribution has not been communicated, com-puting the arc length of a posterior gps has not beenaddressed, nor has anyone considered the arc lengthof gps in anything other than R. These issues are ad-dressed within this paper; we present a new derivationof the mean of a one dimensional gp and derive themoments of a gp in Rn.

The paper is structured as follows. In Section 2 wereview the theory of gps and introduce the notationnecessary to deal with gps defined on Rn. In Section 3we examine the one dimensional case, deriving a distri-bution over the arc length increment before computingthe mean and variance of the arc length. In Section4 we consider the general case. A closed form distri-bution is not possible for the increment, therefore weprovide a moment-matched approximation that proveshigh-fidelity to the true distribution. This distributionallows us to compute the corresponding moments forthe arc length. Section 5 presents numerical simula-tions demonstrating the theoretical results. Finally weconclude with thoughts on the use of arc length priors.

2 GAUSSIAN PROCESSES

2.1 Single Output Gaussian Processes

Consider a stochastic process from a domain f : X !R. Then if f is a gp, with mean function µ and kernelk, we write

f ⇠ GP(µ, k). (4)

We can think of a gp as an extension of the multi-variate Gaussian distribution for function values andas the multivariate case, a gp is completely speci-fied by its mean and covariance function. For a de-tailed introduction see [22]. Given a set of observationsS = {x

i

, y

i

}Ni=1

with Gaussian noise �

2 the posteriordistribution for an unseen data, x⇤, is

p(f(x⇤)|S, x⇤,�) = N (f(x⇤),m(x⇤), k(x⇤, x⇤)). (5)

Letting X = [x1

, . . . , x

n

], and defining k

x⇤ =K(X,x⇤), the posterior mean and covariance are:

m(x⇤) = k

T

x⇤(k(X,X) + �

2I)�1

y (6)

C⇤(x⇤, x⇤) = k(x⇤, x⇤)� k

T

x⇤(k(X,X) + �

2I)�1

k

x⇤

(7)

The derivative of the posterior mean can be calculated:

@m⇤@x⇤

=@k(x⇤, X)

@x⇤(k(X,X) + �

2I)�1

y, (8)

wherever the derivative of the kernel function can becalculated. The covariance for the derivative processcan likewise be derived and hence a full distributionover the derivative process can be specified.

Alternatively, a derivative gp process can be definedfor any twice-di↵erentiable kernel in terms of our priordistribution. If f ⇠ GP(µ, k), then we write thederivative process as

f

0 ⇠ GP(@µ, @2

k). (9)

The auto-correlation, ⇢(⌧), of the derivative process isrelated to the autocorrelation of the original processvia the relation [13]:

f

0(⌧) = � d2

d⌧2⇢

f

(⌧) =@

2

@x@x

0 k(x� x

0), (10)

where ⌧ = x�x

0. The variance of f 0 is therefore givenby �

2

f

0 = ⇢

f

0(0).

2.2 Vector Valued Gaussian Processes

The development of the multi-output gps proceeds in amanner similar to the single output case; for a detailedreview see [1]. The outputs are random variables asso-ciated with di↵erent processes evaluated at potentiallydi↵erent values of x. We consider a vector valued gp:

f ⇠ GP(m,K), (11)

where m 2 RD is the mean vector where {md

(x)}Dd=1

are mean functions associated with each output andK is now a positive definite matrix valued function.(K(x,x0))

d,d

0 is the covariance between f

d

(x) andf

d

0(x0). Given input X, our prior over f(X) is now

f(X) ⇠ N (m(X),K(X,X)). (12)

m(X) is a DN -length vector that concatenates themean vectors for each output and K(X,X) is a ND⇥ND block partitioned matrix. In the vector valuedcase the predictive equations for an unseen datum, x⇤become:

m(x⇤) = KT

x⇤(K(X,X) +⌃)�1y (13)

C⇤(x⇤,x⇤) = K(x⇤,x⇤)�KT

x⇤(K(X,X) +⌃)�1Kx⇤ ,

(14)

where ⌃ is block diagonal matrix with the prior noiseof each output along the diagonal. The problem nowfocuses on specifying the form of the covariance matrixK. We are interested in separable kernels of the form:

K(x,x0)d,d

0 = k(x,x0)kT

(d, d0), (15)

Justin D. Bewsher, Alessandra Tosi, Michael A. Osborne, Stephen J. Roberts

where k and k

T

are themselves valid kernels. The ker-nel can then be specified in the form:

K(x,x0) = k(x,x0)B (16)

where B is a D ⇥D matrix. For a data set X:

K(X,X) = B⌦ k(X,X), (17)

with ⌦ representing the Kronecker product. B speci-fies the degree of correlation between the outputs. Var-ious choices of B result in what is known as the Intrin-sic Model of Coregionalisation (IMC) or Linear Modelof Coregionalisation (LMC).

2.3 Kernel Choices

A gp prior is specified by a choice of kernel, whichencodes our belief about the nature of our functionbehaviour. In the case of infinitely di↵erentiable func-tions we might choose the exponentiated quadratic,for periodic functions, the periodic kernel, or for caseswhere we wish to control the di↵erentiability of ourfunction we might select the Matern class of kernels[22]. Samples from each kernel result in distinct sam-ple curve behaviour.

We demonstrate how the choice of kernel impacts thestatistical behaviour of our arc length and relate thisto the kernel hyperparameters of several popular ker-nels. For the vector-valued gp we show the statisticalproperties are related to the choice of the spatial ker-nel coupled with the choice of the output dependencymatrix B and in particular, its eigenvalues. This high-lights how kernel choice a↵ects not only the shape butthe length of our functions or conversely how knowl-edge of the prior curve length could be used to informkernel selection.

3 ONE DIMENSIONAL ARCLENGTH

First we consider the one dimensional case, where wedevelop a new method to derive the expected length;an approach that can be used in the vector case. Con-sider a gp, f ⇠ GP(0,K) with a corresponding deriva-tive process f

0 ⇠ GP(0, @2K). Then the arc length isthe quantity:

s =

Zb

a

p1 + (f 0)2dt. (18)

We are interested in computing E[s] and V[s], whichrequire integrating s and s

2 against the distributionover f 0. Instead of attempting to compute these quan-tities directly we sidestep the problem and first deter-mine the probability distribution over the arc lengthintegrand (1 + (f 0)2)1/2.

3.1 Integrand Distribution

We present a new method for deriving the mean andvariance of the arc length of a one-dimensional gp byfirst considering the transformation of a normal dis-tribution variable under the non-linear transformationg(x) = (1 + x)1/2. Specifically, we can consider thedistribution of a normally distributed random variableunder the transformation g:

Y = g(X) =p1 +X

2

, X ⇠ N (µ,�2). (19)

We consider the more general case where µ 6= 0. Intu-itively, we expect our distribution for Y to be a skewedChi distribution. We are able to directly compute theprobability density function for Y by considering thecumulative distribution and using the standard rulesfor the transformation of probability functions:

P (Y < y) = F

X

(py

2 � 1 + µ)

� (1� F

X

(py

2 � 1� µ)), (20)

where FX

is the cumulative probability distribution ofX; details in the Supplementary Material. The proba-bility density function (pdf) of Y is obtained by takingthe derivative of P (Y < y) with respect to y; furtherdetails in the Supplementary Material:

p

Y

(y) =1p2⇡�

"exp

� (py

2 � 1 + µ)2

2�2

!

+ exp

� (py

2 � 1� µ)2

2�2

!#yp

y

2 � 1.

(21)

This probability distribution is valid for y >

1 and a straightforward calculation shows thatRy2Y

p

Y

(y)dy = 1. Computation of the expectation ofthe integrand distribution can now be done in closed;the process is outlined in the Supplementary Material.The final expression is:

E[y] = 1p2⇡�

exp

✓� µ

2

2�2

1X

l=0

��l + 1

2

(2l)!

⇣µ

2

⌘2l

U

✓l +

1

2, l + 2,

1

2�2

◆. (22)

Here �(n) is the gamma function and U(a, b, z) is theconfluent hypergeometric function of the second kind,defined by the integral expression:

U(a, b, z) =

Z 1

0

exp (�zt) ta�1(1 + t)b�a�1dt. (23)

A similar process allows us to derive an exact expres-sion for E

pY (y)

[y2] and hence VpY (y)

[y]. Figure 1 showsdraws of g(X), overlaid with p

Y

(y) for a range of µ and�.

Distribution of Gaussian Process Arc Lengths

Figure 1: Histogram of samples fromp1 +X

2, whereX ⇠ N (µ,⌃), overlaid with the corresponding distribution.We display the e↵ects of varying µ and �.

3.2 Arc Length Statistics

Having derived expressions for the arc length integranddistribution we are able to evaluate the moments of thearc length. Specifically, we consider a zero mean gpwith kernel K:

f ⇠ N (0,K) (24)

The derivative process is a gp [22] defined by:

f

0 ⇠ N (0, @2K) (25)

Taking the expectation of the arc length, noting thatthe integrand is non-negative, therefore by Fubini’sTheorem [5] we can interchange the expectation andintegral:

E[s] = E"Z

T

0

p1 + (f 0)2dt

#(26)

=

ZT

0

Ehp

1 + (f 0)2idt. (27)

The variance of f 0 is given by �

2

f

0 = R

f

0(0). At eachpoint along the integral the expectation of the inte-grand is the same, therefore we arrive at:

E[s] =Z

T

0

"1p

2⇡�f

0�

✓1

2

◆U

1

2, 2,

1

2�2

f

0

!#dt (28)

= T

1p2⇡�

✓1

2

◆U

1

2, 2,

1

2�2

f

0

!, (29)

where we have used the expectation of the integrandfor the zero-mean case. Using identities related to theConfluent Hypergeometric we can rewrite the mean as:

E[s] =T exp(1/4�2

f

0)

2p2⇡�

f

0

"BF

0

1

4�2

f

0

!+ BF

1

1

4�2

f

0

!#,

(30)

where BFi

is the modified Bessel function of the secondkind of order i. For a posterior distribution of the arclength, given data observations, we would use Eqn 69along with the the posterior derivative mean, µ

f

0 =@m⇤@x⇤

and variance function of the posterior gp, �2

f

0 tocompute the expected length;

E[s] =1X

l=0

��l + 1

2

(2l)!

ZT

0

1p2⇡�

f

0exp

µ

2

f

0

2�2

f

0

!

µ

f

0

2

f

0

!2l

U

l +

1

2, l + 2,

1

2�2

f

0

!dt, (31)

where µ

f

0 and �

f

0 depend on t. We have derived aclosed form expression for the mean of the arc lengthof a one dimensional zero mean gp, reproducing theoriginal result from [2] whilst providing a way to com-pute the arc length mean of a gp posterior distribu-tion. The variance involves the computation of thesecond moment, a calculation involving the bi-variateform of the integrand distribution; we do not derivethat in this paper. An alternate derivation is reportedin [2].

Justin D. Bewsher, Alessandra Tosi, Michael A. Osborne, Stephen J. Roberts

3.2.1 Kernel Derivatives

The value of the mean arc length is determined solelyby the derivative variance, �2

f

0 . For stationary kernels,k(x, x0) = k(x� x

0), this equates to:

2

f

0 =@

2

@x@x

0 k(x� x

0)

����x=x

0. (32)

Table 1 summarises a table of common kernels [22]and the variance of the e↵ective length scale in termsof their hyperparameters. The e↵ect of the choice ofhyperparameters on the expected length is shown inFigure 2.

Table 1: Derivative process variance, �2

˙

f

, in terms of

kernel hyperparameters for a range of common kernels.In each case �

2 is the output (signal) variance hyper-parameter and � is the input dimension length scalehyperparameter.

SquareExponential

Matern,⌫ = 3

2

Matern,⌫ = 3

2

RationalQuadratic

2

/�

2 3�2

/�

2 5�2

/3�2

2

/�

2

Figure 2: Values of the expected arc length (colourshading) for various values of the SE kernel parame-ters. The plot shows the heat map of the log of the arclength to show su�cient detail. The length is domi-nated by the input scale parameter.

4 MULTI-DIMENSIONAL ARCLENGTH

In this section we present the first treatment of the arclength of a gp in more than one output dimension. We

present an approximation to the arc length integranddistribution and use this to compute the moments ofthe arc length. For the vector case, we now consider avector gp and its corresponding derivative process:

f ⇠ GP(0,K), f 0 ⇠ GP(0, @2K), (33)

where K = B⌦ k, with a coregionalised matrix B anda stationary kernel k. The arc length for the vectorcase is given by:

s =

Zb

a

|f 0|dt. (34)

As we did in the one-dimensional case we first considerthe distribution of the arc length integrand |f 0| andthen use this to derive the moments of the arc lengthitself.

4.1 Integrand Distribution

We are interested in the distribution over the arclength. Ultimately we are interested in R3, however,the theory we present is valid for any Rn. We considerthe random variable W, defined by:

W = |x| = (xTx)1/2 =

vuutnX

i

x2

i

(35)

x ⇠ N (µ,⌃), (36)

with x, µ 2 Rn and ⌃ 2 Rn⇥n is a full-rank covariancematrix. This is the square root of the sum of squaresof correlated normal variables. It is well know that thesum of squares of independent identically distributednormal variables is Chi-squared distributed and thatthe corresponding square root is Chi distributed [9].At first glance it seems that we should easily be ableto identify this transformed distribution, however, thefull-covariance between the elements of x hinder thederivation of a straightforward distribution.

Substantial work has been done on the distribution ofquadratic forms, Q(x) = xT

Ax [12], where x is an n⇥1normal vector defined previously and A is a symmetricn⇥ n matrix. It is possible to write:

Q(x) = xT

Ax =nX

i

i

(Ui

+ b

i

)2, (37)

where the U

i

are i.i.d. normal variables with zeromean and unit variance, the �

i

are the eigenvalues of⌃ and b

i

is the ith component of b = P

T⌃12µ, with P

a matrix that diagonalises ⌃12A⌃

12 .

Observing the summation of the quadratic form inEqn 37, we see that our distribution is a weightedsum of Chi-squared variables. Unfortunately, there

Distribution of Gaussian Process Arc Lengths

exists no simple closed-form solution for this distribu-tion, however, it is possible to express this distributionvia power-series of Laguerre polynomials and some ap-proximations have been used [12].

We note that a Chi-squared distribution is a gammadistributed variable for the case where the shape pa-rameter is v/2 and the scale factor is 2. Therefore wewill approximate Q(x) = xTx with a single gammarandom variable by moment matching the first twomoments. The mean and variance of Q(x) are givenby:

E[Q(x)] = tr(⌃) + µ

T

µ, (38)

V[Q(x)] = 2tr(⌃⌃) + 4µT⌃µ, (39)

where tr() denotes that trace of a matrix. The pdf ofa gamma distribution with shape k

G

and scale ✓

G

isgiven by

p

G

(x : kG

, ✓

G

) =x

kG�1 exp⇣� x

✓G

kGG

�(kG

). (40)

The first two moments are:

µ

G

= k

G

G

, �

2

G

= k

G

2

G

. (41)

Solving for kG

and ✓

G

:

k

G

2

G

2

G

, ✓

G

=�

2

G

µ

G

. (42)

Equating moments, we set µ

G

= E[Q(x)] and�

2

G

= V[Q(x)]. Thus, Q is approximatedas a gamma random variable and we write,Q(x) ⇠ Gamma(k

G

, ✓

G

).

Now we are in a position to consider the quan-tity

pQ. Here we use that fact that if a random

variable Q ⇠ Gamma(kG

, ✓

G

), then the randomvariable W =

pQ is a Nakagami random variable

W ⇠ Nagakami(m,⌦), with parameters given bym = k

G

and ⌦ = k

G

G

. The nagakami distribution[8] is:

p

Nak

(x;m, ✓) =2mm

�(m)⌦m

x

2m�1 exp⇣�m

⌦x

2

⌘. (43)

Using the value for k and ✓ obtained via our momentmatched approximation and transforming to the Nak-agami distribution we say

pQ is approximated as a

Nakagami distribution with parameters:

m =µ

2

G

2

G

, ⌦ = µ

G

. (44)

In terms of our original distribution x ⇠ N (µ,⌃), wetherefore have W =

pxTx ⇠ Nakagami(m,⌦), with:

m =[tr(⌃) + µ

T

µ]2

2tr(⌃⌃) + 4µT⌃µ, ⌦ = tr(⌃) + µ

T

µ. (45)

The mean and variance are:

E[W] =�(m+ 1

2

)

�(m)

✓⌦

m

◆ 12

(46)

V[W] = ⌦

1� 1

m

✓�(m+ 1

2

)

�(m)

◆2

!. (47)

The method we have used to derive the distribution ofthe arc length integrand is summarised in Eqn 48:

N (µ,⌃)Q!

Approximate

Gamma(kG

, ✓

G

)pQ!

Exact

Nakagami(m,⌦)

(48)

Numerical samples of Q(x) andpQ(x) and the pdf of

the corresponding gamma and Nakagami distributionsare show in Figure 3 for d = 3. The approximateddistributions show a reasonable approximation for arange of µ and ⌃.

The quadratic form approximated to the gamma dis-tribution is exact when all the eigenvalues of the co-variance are identical, in that case we have only a sin-gle gamma random variable.

4.2 Arc Length Statistics

We are now in a position to consider the arc lengthdirectly. Taking the expectation of the arc length, re-calling that expectation is a linear operator and usingFubini’s theroem:

E[s] =Z

T

0

Eh(f 0T f 0)

12

idt. (49)

Recalling the form of our kernel as K(x, x0) = B ⌦k(x, x0), the infinitesimal distribution of f 0 is constantwith respect to t with covariance given by:

⌃f

0 = B⌦ @

2

@x@x

0 k(x, x0)

����x=x

0= B �

2

f

0 . (50)

Therefore the expected length of the arc length is:

E[s] ⇡ T

�(mf

0 + 1

2

)

�(mf

0)

✓⌦

f

0

m

f

0

◆ 12

, (51)

with,

m

f

0 =[tr(⌃

f

0)]2

2tr(⌃f

0⌃f

0), ⌦

f

0 = tr(⌃f

0), (52)

where we have used the Nakagami approximation tothe arc length integrand to evaluate the mean. As inthe one-dimensional case, the expected length of thegp is determined solely by the choice of kernel and thelength of the interval. The calculation of the variancerequires the second moment:

E[s2] =Z Z

E⇥|f 0t1| |f 0

t2|⇤dt

1

dt2

. (53)

Justin D. Bewsher, Alessandra Tosi, Michael A. Osborne, Stephen J. Roberts

Figure 3: Samples from Q(x) andpQ(x) overlaid with the approximated gamma and Nakagami distributions.

x ⇠ (0,⌃) in the top row, and x ⇠ (µ,⌃) in the bottom row with µ and ⌃ randomly generated. Similar plotsare obtained for di↵erent values of µ and ⌃. The gamma and Nakagami distributions provide a reasonableapproximation to the shape of the distribution, whilst capturing the true mean and variance.

Making use of the Nakagami approximation to ourintegrand we need the mixed moment of two corre-lated Nakagami variables. Let us write |f 0

t1| ⇡ W

1

,|f 0t2| ⇡ W

2

, with W1

⇠ Nakagmi(mf

0,⌦

f

0) and W2

⇠Nakagmi(m

f

0,⌦

f

0). The mixed moments of two cor-related Nakagami variables with the same parametersis given by [17]:

E[Wn

1

Wl

2

] =⌦

m

[�(m+ n/2)]2

[�(m)]22

F

1

✓�n

2,� l

2,m : ⇢(⌧)

◆,

(54)

where ⇢(⌧) is the correlation between the gamma vari-ables that the Nakagami distribution was derived fromand

2

F

1

is the hypergeometric function:

2

F

1

(a, b, c : z) =1X

n=0

(a)n

(b)n

(c)n

z

n

n!(55)

with the Pochhammer (q)n

symbol defined as (q)n

=q(q+1) . . . (q+n�1) and (q)

0

= 1. The second momentcan now be expressed as a power series in ⇢:

E[s2] = ⌦

m

[�(m+ 1/2)]2

[�(m)]2

1X

n=0

�� 1

2

�n

�� 1

2

�n

(m)n

1

n!Z

T

0

ZT

0

⇢(t1

� t

2

)ndt1

dt2

. (56)

We derive the correlation function:

⇢(t� t

0) =

@

2

@t@t

0 k(t, t0)

�2

1

4

f

0. (57)

Eqn 56 can be solved numerically (noting that the two-dimensional integral is readily tackled using traditionalmethods of quadrature) and the variance is then com-puted by V[s] = E[s2]� E[s]2.

4.3 Arc Length Posterior

The moments of the arc length of a gp posterior followa similar derivation. The posterior mean of the arclength is:

E[s] ⇡Z

T

0

�(mf

0 + 1

2

)

�(mf

0)

✓⌦

f

0

m

f

0

◆ 12

dt, (58)

where m

f

0 and ⌦f

0 are the Nakagami parameterswhich now depend on the mean and covariance func-tions of the gp posterior, which themselves are func-tions of t. This non tractable expression now requiresan integration (which, again, can be e�ciently approx-imated with quadrature). The posterior second mo-

Distribution of Gaussian Process Arc Lengths

ment is given by:

E[s2] ⇡1X

n=0

�� 1

2

�n

(� 1

2

)n

n!

ZT

0

ZT

0

✓⌦

1

m

1

◆ 12✓⌦

2

m

2

◆ 12

�(m1

+ 1/2)�(m2

+ 1/2)

�(m1

)�(m2

)(m2

)n

⇢(|t1

� t

2

|)ndt1

dt2

,

(59)

wheremi

and ⌦i

again depend on the mean and covari-ance functions of the gp posterior and are evaluatedat t

i

.

5 SIMULATIONS

In this section we generate samples from our gp priorand compute the arc length, focusing on the vectorcase. We show the e↵ect of the kernel choice and showthe fidelity of our theoretical results. To generate ourcurves we specify a zero mean gp kernel, K = B ⌦k(t, t0), with fixed B and we use the Matern Kernelwith ⌫ = 3/2, which we call the M32 kernel:

k(t, t0) = �

2

1 +

p3||t� t

0||�

!exp

�p3||t� t

0||�

!.

(60)

We draw a sample f

i

= (xi

, y

i

, z

i

) evaluated at evenlyspaced t. The arc length of the gp draw is then com-puted numerically.

Unit variance and length scale parameters are cho-sen and the arc length is computed over the intervalt = [0, 1]. Figure 4 shows the sample lengths, thetheoretical mean and variance, and the Nakagami dis-tribution of a single arc length integrand. Our theo-retical results are close to the numerically generatedvalues. The plot of the Nakagami distribution demon-strates the wide variance of an individual arc lengthintegrand with respect to the overall variance. We seethat the integration over the input domain has a sortof ‘shrinking’ e↵ect on overall variance when comparedto the individual variance. No estimation methods arerequired to calculate the arc length statistics. Our ap-proximated equations are closed form (for the mean)and a quadrature problem (for the variance).

6 CONCLUSION

In this paper we derive the moments of a vector val-ued gp. To the best of the authors’ knowledge, this isthe first treatment of the arc length in more than onedimension. The increment distribution was approx-imated via its moments to a Nakagami distributionwhich provide a closed form for the mean, Eqn 51 andan expression for the second moment, Eqn 56, of the

Figure 4: Histogram of GP Lengths. The theoreticaland empirical mean are shown and the correspondingvariance. The Nakagami distribution of the integrandis also shown. We can see the integral over integrandshas the e↵ect of shrinking the variance relative to asingle integrand.

arc length. Importantly, we are also able to derive thefirst, Eqn 58, and second, Eqn 59, moment of the arclength of a gp posterior, conditioned on observationsof the function..

The moments were shown to depend on the choice ofkernel, the hyperparameters and the length of the in-terval. Numerical experiments confirmed the fidelityof our approximation to the arc length integrand andthe arc length moments. We also provide a visual un-derstanding of the distribution.

We see knowledge of the arc length as a valuable toolwhich will allow us to encode more information intoour prior over kernel choices. The explicit relation be-tween the arc length moments and the kernel hyper-parameters allow us to use prior information to betterinitialize and constrain our models, in particular incases where lengths correspond to interpretable quan-tities, such as a path trajectory.

Potential avenues of future research include analysisof the non-stationarity of curves, curve minimisationsproblems, and generating curves of a given length.Furthermore, we see potential application in Bayesianoptimization, as a path planning tool and for con-structing interpretable features from functional data.

Acknowledgements

AT and MO are grateful for the support of fundingfrom the Korea Institute of Energy Technology Eval-uation and Planning (KETEP).

The authors are grateful for initial conversations withTom Gunter who highlighted the gap in the literature.

Justin D. Bewsher, Alessandra Tosi, Michael A. Osborne, Stephen J. Roberts

References

[1] M. A. Alvarez, L. Rosasco, and N. D.Lawrence, Kernels for Vector-Valued Func-tions: a Review, Now Publishers Inc, 2012.

[2] R. Barakat and E. Baumann, Mean and Vari-ance of the Arc Length of a Gaussian Process on aFinite Interval, International Journal of Control,12 (1970), pp. 377–383.

[3] S. Corrsin and O. M. Phillips, ContourLength and Surface Area of Multiple-Valued Ran-dom Variables, Journal of the Society for Indus-trial and Applied Mathematics, 9 (1961), pp. 395–404.

[4] M. P. do Carmo, Di↵erential Geometry ofCurves and Surfaces, Pearson, 1976.

[5] G. Fubini, Sugli Integrali Multipli, vol. 5, 1907.

[6] S. Hauberg, M. Schober, M. Liptrot,P. Hennig, and A. Feragen, A random Rie-mannian metric for probabilistic shortest-pathtractography, Lecture Notes in Computer Science(including subseries Lecture Notes in Artificial In-telligence and Lecture Notes in Bioinformatics),9349 (2015), pp. 597–604.

[7] P. Hennig and S. Hauberg, Probabilistic Solu-tions to Di↵erential Equations and their Applica-tion to Riemannian Statistics, 17th InternationalConference on Artificial Intelligence and Statistics(AISTATS) 2014, (2014).

[8] W. C. Hoffman, Statistical Methods in RadioWave Propagation, 1958, pp. 3–36.

[9] N. L. Johnson, S. Kotz, and N. Balakrish-nan, Continuous Univariate Distributions, vol. 1,Wiley, 2nd ed., 1994.

[10] N. D. Lawrence, Probabilistic non-linear prin-cipal component analysis with Gaussian processlatent variable models, Journal of machine learn-ing research, 6 (2005), pp. 1783–1816.

[11] R. Marchant and F. Ramos, Bayesian optimi-sation for informative continuous path planning,in IEEE International Conference on Roboticsand Automation (ICRA), 2014.

[12] A. M. Mathai and S. B. Provost, QuadraticForms in Random Variables: Theory and Appli-cations, Marcel Dekker Inc., 1992.

[13] D. Middleton, An Introduction to StatisticalCommunication Theory, 1960.

[14] I. Miller and J. E. Freund, Expected ArcLength of a Gaussian Process on a Finite Inter-val, Journal of the Royal Statistical Society. SeriesB (Methodology), 18 (1956), pp. 257–258.

[15] M. Moll and L. Kavraki, Path planning forminimal energy curves of constant length, IEEEInternational Conference on Robotics and Au-tomation, 2004. Proceedings. ICRA ’04. 2004, 3(2004), pp. 2826–2831.

[16] V. P. Nosko, On the Distribution of the ArcLength of a High-Level Excursion of a Station-ary Gaussian Process, Society for Industrial andApplied Mathematics, (1985), pp. 521–523.

[17] J. Reig, L. Rubio, and N. Cardona, Bivari-ate Nakagami-m distribution with arbitrary fad-ing parameters, Electronics Letters, 38 (2002),pp. 1715–1717.

[18] J. Snoek, H. Larochelle, and R. Adams,Practical Bayesian Optimization of MachineLearning Algorithms, in Advances in Neural In-formation Processing Systems 25, 2012, pp. 2960–2968.

[19] A. Tosi, S. Hauberg, A. Vellido, andN. D. Lawrence, Metrics for Probabilistic Ge-ometries, Uncertainty in Artificial Intelligence,(2014), p. 800.

[20] J. M. Wang, D. J. Fleet, and A. Hertz-mann, Gaussian Process Dynamical Models, Neu-ral Information Processing Systems (NIPS), 18(2005), p. 3.

[21] T. D. Wickramarachchi, C. Gallagher,and R. Lund, Arc length asymptotics for mul-tivariate time series, Applied Stochastic Modelsin Business and Industry, 31 (2015), pp. 264–281.

[22] C. K. I. Williams and C. E. Rasmussen,Gaussian Processes for Machine Learning, MITPress, Cambridge, 2006.

Distribution of Gaussian Process Arc Lengths

Supplementary Material

Density of Y =p1 +X

2

Cumulative distribution of Y :

P (Y < y) = P (|X � µ| <py

2 � 1) (61)

= P (�p

y

2 � 1 < X � µ <

py

2 � 1) (62)

= P (�p

y

2 � 1 + µ < X <

py

2 � 1 + µ) (63)

= F

X

(py

2 � 1 + µ) � (1� F

X

(p

y

2 � 1� µ)). (64)

Probability density of Y :

p

Y

(y) =d

dyP (Y < y) (65)

=d

dy

hF

X

(py

2 � 1 + µ)� (1� F

X

(py

2 � 1� µ))i

(66)

=1p2⇡�

"exp

� (py

2 � 1 + µ)2

2�2

!+ exp

� (p

y

2 � 1� µ)2

2�2

!#yp

y

2 � 1.

Mean of Y :

EpY (y)

[y] =

Z 1

1

y p

Y

(y)dy (67)

=1p2⇡�

Z 1

1

"exp

� (py

2 � 1 + µ)2

2�2

!+ exp

� (py

2 � 1� µ)2

2�2

!#y

2(y2 � 1)�12 dy. (68)

At first glance this looks to be an intractable integral, however, with the change of variables y = (x2 +1)1/2 andby expanding the exponential cross terms we arrive at:

E[y] = 1p2⇡�

exp

✓� µ

2

2�2

◆ 1X

l=0

��l + 1

2

(2l)!

⇣µ

2

⌘2l

U

✓l +

1

2, l + 2,

1

2�2

◆. (69)


Recommended