Download - Gaussian Processes - Imperial College Londondfg/ProbabilisticInference/GaussianProce… · A Gaussian process is a generalization of a multivariate Gaussian distribution to inﬁnitely

Data Analysis and Probabilistic Inference

Gaussian ProcessesRecommended reading:Rasmussen/Williams: Chapters 1, 2, 4, 5Deisenroth & Ng (2015)[3]

Marc DeisenrothDepartment of ComputingImperial College London

February 22, 2017

http://www.gaussianprocess.org/

Gaussian Processes Marc Deisenroth February 22, 2017 2

http://www.gaussianprocess.org/

Problem Setting

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2ε

˘

, find adistribution over functions pp f q that explains the data

Probabilistic regression problem


Problem Setting

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10

−2

0

2

x

f(x)

Objective


0, σ2ε

˘

, find adistribution over functions pp f q that explains the data

Probabilistic regression problem


Recap from CO-496: Bayesian Linear Regression

§ Linear Regression Model:

f pxq “ φpxqJw, w „ N`

0, Σp˘

y “ f pxq ` ε, ε „ N`

0, σ2n˘

§ Integrating out the parameters when predicting leads to adistribution over functions:

pp f px˚q|x˚, X, yq “ż

pp f px˚q|x˚, wqppw|X, yqdw

“ N`

µpx˚q, σ2px˚q˘

µpx˚q “ φJ˚ Σp ΦpK` σ2n Iq´1y

σ2px˚q “ φJ˚ Σp φ˚ ´ φJ˚ Σp ΦpK` σ2n Iq´1ΦJ Σp φ˚

K “ ΦJΣpΦ





0, Σp˘

y “ f pxq ` ε, ε „ N`

0, σ2n˘




“ N`

µpx˚q, σ2px˚q˘



K “ ΦJΣpΦ





0, Σp˘

y “ f pxq ` ε, ε „ N`

0, σ2n˘




“ N`

µpx˚q, σ2px˚q˘



K “ ΦJΣpΦ


Sampling from the Prior over Functions

Consider a linear regression setting

y “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

-4 -2 0 2 4a

-4

-2

0

2

4

b




y “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

-4 -2 0 2 4a

-4

-2

0

2

4

b

demo: sampling from prior, sampling from posteriorGaussian Processes Marc Deisenroth February 22, 2017 6



y “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

-4 -2 0 2 4a

-4

-2

0

2

4

b

-10 0 10x

-10

-5

0

5

10

y




y “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

-10 0 10x

-5

0

5

y


Sampling from the Posterior over Functions


y “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

-4 -2 0 2 4a

-4

-2

0

2

4

b




y “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

-4 -2 0 2 4a

-4

-2

0

2

4

b

demo: sampling from prior, sampling from posteriorGaussian Processes Marc Deisenroth February 22, 2017 10



y “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

-4 -2 0 2 4a

-4

-2

0

2

4

b

-10 0 10x

-5

0

5

y


Fitting Nonlinear Functions

§ Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features

§ Example: Radial-basis-function (RBF) network

f pxq “nÿ

i“1

wiφipxq , wi „ N`

0, σ2p˘

whereφipxq “ exp

`

´ 12px´ µiq

Jpx´ µiq˘

for given “centers” µi


Illustration: Fitting a Radial Basis Function Network

φipxq “ exp`

´ 12px´ µiq

Jpx´ µiq˘

-5 0 5x

-2

0

2f(

x)

§ Place Gaussian-shaped basis functions φi at 25 input locations µi,linearly spaced in the interval r´5, 3s


Samples from the RBF Prior

f pxq “nÿ

i“1

wiφipxq , ppwq “ N`

0, I˘

-5 0 5x

-4

-2

0

2

4f(

x)


Samples from the RBF Posterior

f pxq “nÿ

i“1

wiφipxq , ppw|X, yq “ N`

mN , SN˘

-5 0 5x

-4

-2

0

2

4f(

x)


RBF Posterior

-5 0 5x

-2

0

2f(

x)


Limitations

-5 0 5x

-2

0

2

f(x)

§ Feature engineering§ Finite number of features:

§ Above: Without basis functions on the right, we cannot expressany variability of the function

§ Ideally: Add more (infinitely many) basis functionsGaussian Processes Marc Deisenroth February 22, 2017 17

Approach

§ Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly

Make assumptions on the distribution of functions

§ Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values


Gaussian Process

§ We will place a distribution pp f q on functions f§ Informally, a function can be considered an infinitely long vector

of function values f “ r f1, f2, f3, ...s§ A Gaussian process is a generalization of a multivariate Gaussian

distribution to infinitely many variables.

DefinitionA Gaussian process (GP) is a collection of random variables f1, f2, . . . ,any finite number of which is Gaussian distributed.

§ A Gaussian distribution is specified by a mean vector µ and acovariance matrix Σ

§ A Gaussian process is specified by a mean function mp¨q and acovariance function (kernel) kp¨, ¨q


Gaussian Process








Gaussian Process








Covariance Function

§ The covariance function (kernel) is symmetric and positivesemi-definite

§ It allows us to compute covariances between (unknown) functionvalues by just looking at the corresponding inputs:

Covr f pxiq, f pxjqs “ kpxi, xjq


GP Regression as a Bayesian Inference Problem

Objective


0, σ2n˘

, find a(posterior) distribution over functions pp f |X, yq that explains the data

Training data: X, y. Bayes’ theorem yields

pp f |X, yq “ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.

Likelihood (noise model): ppy| f , Xq “ N`

f pXq, σ2n I˘

Marginal likelihood (evidence): ppy|Xq “ş

ppy| f pXqqpp f |Xqd fPosterior: pp f |y, Xq “ GPpmpost, kpostq



Objective


0, σ2n˘




ppy|Xq



f pXq, σ2n I˘





Objective


0, σ2n˘




ppy|Xq



f pXq, σ2n I˘





Objective


0, σ2n˘




ppy|Xq



f pXq, σ2n I˘





Objective


0, σ2n˘




ppy|Xq



f pXq, σ2n I˘


ppy| f pXqqpp f |Xqd f

Posterior: pp f |y, Xq “ GPpmpost, kpostq



Objective


0, σ2n˘




ppy|Xq



f pXq, σ2n I˘




Prior over Functions

§ Treat a function as a long vector of function values:

f “ r f1, f2, . . . s

Look at a distribution over function values fi “ f pxiq

§ Consider a finite number of N function values f and all other(infinitely many) function values f . Informally:

pp f , f q “ N

¨

˝

»

–

µ f

µ f

fi

fl ,

»

–

Σ f f Σ f f

Σ f f Σ f f

fi

fl

˛

‚

where Σ f f P Rmˆm and Σ f f P R

Nˆm, m Ñ8.

§ Σpi,jqf f “ Covr f pxiq, f pxjqs “ kpxi, xjq

§ Key property: The marginal remains finite

pp f q “ż

pp f , f qd f “ N`

µ f , Σ f f˘




f “ r f1, f2, . . . s



pp f , f q “ N

¨

˝

»

–

µ f

µ f

fi

fl ,

»

–

Σ f f Σ f f

Σ f f Σ f f

fi

fl

˛

‚


Nˆm, m Ñ8.



pp f q “ż


µ f , Σ f f˘




f “ r f1, f2, . . . s



pp f , f q “ N

¨

˝

»

–

µ f

µ f

fi

fl ,

»

–

Σ f f Σ f f

Σ f f Σ f f

fi

fl

˛

‚


Nˆm, m Ñ8.



pp f q “ż


µ f , Σ f f˘


Training and Test Marginal

§ In practice, we always have finite training and test inputsxtrain, xtest.

§ Define f˚ :“ f test, f :“ f train.

§ Then, we obtain the finite marginal

pp f , f˚ q “ż

pp f , f˚ , f other qd f other “ N˜«

µ f

µ˚

ff

,

«

Σ f f Σ f˚

Σ˚ f Σ˚˚

ff¸


Training and Test Marginal

§ In practice, we always have finite training and test inputsxtrain, xtest.

§ Define f˚ :“ f test, f :“ f train.

§ Then, we obtain the finite marginal

pp f , f˚ q “ż

pp f , f˚ , f other qd f other “ N˜«

µ f

µ˚

ff

,

«

Σ f f Σ f˚

Σ˚ f Σ˚˚

ff¸


GP Regression as a Bayesian Inference Problem (ctd.)

Posterior over functions (with training data X, y):

pp f |X, yq “ppy| f , Xq pp f |Xq

ppy|Xq

Using the properties of Gaussians, we obtain

ppy| f , Xq pp f |Xq “ N`

y | f pXq, σ2n I˘

N`

f pXq |mpXq, K˘

“ ZN`

f pXq | mpXq `KpK` σ2n Iq´1py´mpXqq

looooooooooooooooooooomooooooooooooooooooooon

posterior mean

, K´KpK` σ2n Iq´1K

loooooooooooomoooooooooooon

posterior covariance

˘

K “ kpX, Xq

Marginal likelihood:

Z “ ppy|Xq “ż

ppy| f , Xq pp f |Xq d f “ N`

y |mpXq, K` σ2n I˘





ppy|Xq



y | f pXq, σ2n I˘

N`

f pXq |mpXq, K˘

“ ZN`



posterior mean




˘

K “ kpX, Xq


Z “ ppy|Xq “ż







ppy|Xq



y | f pXq, σ2n I˘

N`

f pXq |mpXq, K˘

“ ZN`



posterior mean




˘

K “ kpX, Xq


Z “ ppy|Xq “ż







ppy|Xq



y | f pXq, σ2n I˘

N`

f pXq |mpXq, K˘

“ ZN`



posterior mean




˘

K “ kpX, Xq


Z “ ppy|Xq “ż




GP Predictions (1)

y “ f pxq ` ε, ε „ N`

0, σ2n˘

§ Objective: Find pp f pX˚q|X, yq for training data X, y and testinputs X˚.

§ GP prior: pp f |Xq “ N`

mpXq, K˘

§ Gaussian Likelihood: ppy| f pXqq “ N`

f pXq, σ2n I˘

§ With f „ GP it follows that f , f˚ are jointly Gaussian distributed:

pp f , f˚|X, X˚q “ N˜«

mpXqmpX˚q

ff

,

«

K kpX, X˚qkpX˚, Xq kpX˚, X˚q

ff¸

§ Due to the Gaussian likelihood, we also get ( f is unobserved)

ppy, f˚|X, X˚q “ N˜«

mpXqmpX˚q

ff

,

«

K`σ2n I kpX, X˚q

kpX˚, Xq kpX˚, X˚q

ff¸


GP Predictions (1)

y “ f pxq ` ε, ε „ N`

0, σ2n˘



mpXq, K˘


f pXq, σ2n I˘


pp f , f˚|X, X˚q “ N˜«

mpXqmpX˚q

ff

,

«


ff¸



mpXqmpX˚q

ff

,

«

K`σ2n I kpX, X˚q


ff¸


GP Predictions (1)

y “ f pxq ` ε, ε „ N`

0, σ2n˘



mpXq, K˘


f pXq, σ2n I˘


pp f , f˚|X, X˚q “ N˜«

mpXqmpX˚q

ff

,

«


ff¸



mpXqmpX˚q

ff

,

«

K`σ2n I kpX, X˚q


ff¸


GP Predictions (2)

Prior:

ppy, f˚|X, X˚q “ Nˆ„

mpXqmpX˚q

,„

K` σ2n I kpX, X˚q


˙

Posterior predictive distribution pp f˚|X, y, X˚q at test inputs X˚

obtained by Gaussian conditioning:

pp f˚|X, y, X˚q “ N`

Er f˚|X, y, X˚s, Vr f˚|X, y, X˚s˘

Er f˚|X, y, X˚s “ mpostpX˚q “ mpX˚qloomoon

prior mean

`kpX˚, XqpK` σ2n Iq´1py´mpXqq

Vr f˚|X, y, X˚s “ kpostpX˚, X˚q

“ kpX˚, X˚qloooomoooon

prior variance

´kpX˚, XqpK` σ2n Iq´1kpX, X˚q

From now: Set prior mean function m ” 0


GP Predictions (2)

Prior:


mpXqmpX˚q

,„

K` σ2n I kpX, X˚q


˙

Posterior predictive distribution pp f˚|X, y, X˚q at test inputs X˚obtained by Gaussian conditioning:




prior mean




prior variance


From now: Set prior mean function m ” 0


GP Predictions (2)

Prior:


mpXqmpX˚q

,„

K` σ2n I kpX, X˚q


˙

Posterior predictive distribution pp f˚|X, y, X˚q at test inputs X˚obtained by Gaussian conditioning:




prior mean




prior variance


From now: Set prior mean function m ” 0Gaussian Processes Marc Deisenroth February 22, 2017 26

Illustration: Inference with Gaussian Processes

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Prior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚,∅s “ mpx˚q “ 0Vr f px˚q|x˚,∅s “ σ2px˚q “ kpx˚, x˚q



−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Prior belief about the function


Er f px˚q|x˚,∅s “ mpx˚q “ 0Vr f px˚q|x˚,∅s “ σ2px˚q “ kpx˚, x˚q



−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function


Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q



−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)







−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)







−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)







−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)







−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)







−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)







−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)







−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)






Covariance Function

§ A Gaussian process is fully specified by a mean function m and akernel/covariance function k

§ The covariance function (kernel) is symmetric and positivesemi-definite

§ Covariance function encodes high-level structural assumptionsabout the latent function f (e.g., smoothness, differentiability,periodicity)


Gaussian Covariance FunctionkGausspxi, xjq “ σ2

f exp`

´ pxi ´ xjqJpxi ´ xjq{`

2˘

§ σf : Amplitude of the latent function§ `: Length scale. How far do we have to move in input space

before the function value changes significantlySmoothness parameter

§ Assumption on latent function: Smooth (8 differentiable)Gaussian Processes Marc Deisenroth February 22, 2017 29

Length-Scales

Length scales determine how wiggly the function is and how muchinformation we can transfer to other function values

x-10 -5 0 5 10

f(x)

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3Data


Length-Scales


x-10 -5 0 5 10

f(x)

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3Data


Length-Scales


x-10 -5 0 5 10

f(x)

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3Data


Length-Scales


x-10 -5 0 5 10

f(x)

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3Data


Matern Covariance Function

kMat,3{2pxi, xjq “ σ2f

´

1`?

3}xi´xj}

`

¯

exp´

´

?3}xi´xj}

`

¯

§ σf : Amplitude of the latent function§ `: Length scale. How far do we have to move in input space

before the function value changes significantly?

§ Assumption on latent function: 1-times differentiable


Periodic Covariance Function

kperpxi, xjq “ σ2f exp

´

´2 sin2 ` κpxi´xjq

2π

˘

`2

¯

“ kGausspupxiq, upxjqq, upxq “„

cospκxqsinpκxq

κ: Periodicity parameter


Meta-Parameters of a GP

The GP possesses a set of hyper-parameters:

§ Parameters of the mean function

§ Hyper-parameters of the covariance function (e.g., length-scalesand signal variance)

§ Likelihood parameters (e.g., noise variance σ2n)

Train a GP to find a good set of hyper-parameters

Model selection to find good mean and covariance functions(can also be automated Automatic Statistician (Lloyd et al., 2014))


















Gaussian Process Training: Hyper-Parameters

GP TrainingFind good GP hyper-parameters θ (kerneland mean function parameters)

θ

σnyixi

f

N

§ Place a prior ppθq on hyper-parameters§ Posterior over hyper-parameters:

ppθ|X, yq “ppθq ppy|X, θq

ppy|Xq, ppy|X, θq “

ż

ppy| f pXqqpp f |X, θqd f

§ Choose hyper-parameters θ˚, such that

θ˚ P arg maxθ

log ppθq ` log ppy|X, θq

Maximize marginal likelihood if ppθq “ U (uniform prior)




θ

σnyixi

f

N




ż



θ˚ P arg maxθ






θ

σnyixi

f

N




ż



θ˚ P arg maxθ






θ

σnyixi

f

N




ż



θ˚ P arg maxθ




Training via Marginal Likelihood Maximization

GP TrainingMaximize the evidence/marginal likelihood (probability of the datagiven the hyper-parameters, where the unwieldy f has beenintegrated out) Also called Maximum Likelihood-Type-II


ppy|X, θq “

ż


“

ż

N`

y | f pXq, σ2n I˘

N`

f pXq | 0, K˘

d f “ N`

y | 0, K` σ2n I˘

Learning the GP hyper-parameters:

θ˚ P arg maxθ

log ppy|X, θq

log ppy|X, θq “ ´12 yJK´1

θ y ´ 12 log |Kθ| ` const , Kθ :“ K` σ2

n I



GP TrainingMaximize the evidence/marginal likelihood (probability of the datagiven the hyper-parameters, where the unwieldy f has beenintegrated out) Also called Maximum Likelihood-Type-II


ppy|X, θq “

ż


“

ż

N`

y | f pXq, σ2n I˘

N`

f pXq | 0, K˘

d f “ N`

y | 0, K` σ2n I˘

Learning the GP hyper-parameters:

θ˚ P arg maxθ

log ppy|X, θq



n I



Log-marginal likelihood:



n I

§ Automatic trade-off between data fit and model complexity

§ Gradient-based optimization of hyper-parameters θ:

B log ppy|X, θq

Bθi“ 1

2 yJK´1θ

BKθ

BθiK´1

θ y´ 12 tr

`

K´1θ

BKθ

Bθi

˘

“ 12 tr

`

pααJ ´K´1θ qBKθ

Bθi

˘

,

α :“ K´1θ y






n I



B log ppy|X, θq

Bθi“ 1

2 yJK´1θ

BKθ

BθiK´1

θ y´ 12 tr

`

K´1θ

BKθ

Bθi

˘

“ 12 tr

`


Bθi

˘

,

α :“ K´1θ y






n I



B log ppy|X, θq

Bθi“ 1

2 yJK´1θ

BKθ

BθiK´1

θ y´ 12 tr

`

K´1θ

BKθ

Bθi

˘

“ 12 tr

`


Bθi

˘

,

α :“ K´1θ y


Example: Training Data

-10 -8 -6 -4 -2 0 2 4 6 8 10x

-3

-2

-1

0

1

2

3y


Example: Marginal Likelihood Contour

-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5

log-

leng

th-s

cale

sLog-Marginal Likelihood, N=20

-4

-3.5

-3

-2.5

-2

-1.5


Example: Exploring the Modes (1)

-10 -8 -6 -4 -2 0 2 4 6 8 10x

-3

-2

-1

0

1

2

3y


Example: Exploring the Modes (2)

-10 -8 -6 -4 -2 0 2 4 6 8 10x

-3

-2

-1

0

1

2

3y


Marginal Likelihood (1)

-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les

Log-Marginal Likelihood, N=2

-4

-3.5

-3

-2.5

-2

-1.5

-1



-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les


-4

-3.5

-3

-2.5

-2

-1.5



-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les


-4

-3.5

-3

-2.5

-2

-1.5

-1



-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les


-4

-3.5

-3

-2.5

-2

-1.5

-1



-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les


-4

-3.5

-3

-2.5

-2

-1.5



-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les


-4

-3.5

-3

-2.5

-2

-1.5



-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les


-4

-3.5

-3

-2.5

-2

-1.5



-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les


-4

-3.5

-3

-2.5

-2

-1.5

-1



-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les


-4

-3.5

-3

-2.5

-2

-1.5

-1


Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex

§ In particular in the very-small-data regime, a GP can end up inthree different modes when optimizing the hyper-parameters:

§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)§ Good fit

§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem

§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.

§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?



§ The marginal likelihood is non-convex§ In particular in the very-small-data regime, a GP can end up in

three different modes when optimizing the hyper-parameters:

§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)§ Good fit







three different modes when optimizing the hyper-parameters:§ Overfitting (unlikely, but possible)

§ Underfitting (everything is considered noise)§ Good fit







three different modes when optimizing the hyper-parameters:§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)

§ Good fit







three different modes when optimizing the hyper-parameters:§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)§ Good fit


























Model Selection—Mean Function and Kernel

§ Assume we have a finite set of models Mi, each one specifying amean function mi and a kernel ki. How do we find the best one?

§ Some options:§ BIC, AIC (see CO-496)§ Compare marginal likelihood values (assuming a uniform prior on

the set of models)


Model Selection—Mean Function and Kernel

§ Assume we have a finite set of models Mi, each one specifying amean function mi and a kernel ki. How do we find the best one?

§ Some options:§ BIC, AIC (see CO-496)§ Compare marginal likelihood values (assuming a uniform prior on

the set of models)


Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3

§ Four different kernels (mean function fixed to m ” 0)§ MAP hyper-parameters for each kernel§ Log-marginal likelihood values for each (optimized) model


Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3Constant kernel, LML=-1.1073



Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3Linear kernel, LML=-1.0065



Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3Matern kernel, LML=-0.8625



Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3Gaussian kernel, LML=-0.69308



Application Areas

−2 0 2−5

0

5

angle in rad

ang.

vel.

in r

ad/s

−2

0

2

4

6

8

§ Reinforcement learning and roboticsModel value functions and/or dynamics with GPs

§ Bayesian optimization (Experimental Design)Model unknown utility functions with GPs

§ GeostatisticsSpatial modeling (e.g., landscapes, resources)

§ Sensor networks§ Time-series modeling and forecasting


Limitations of Gaussian Processes

Computational and memory complexityTraining set size: N

§ Training scales in OpN3q

§ Prediction (variances) scales in OpN2q

§ Memory requirement: OpND` N2q

Practical limit N « 10, 000


Tips and Tricks for Practitioners

§ To set initial hyper-parameters, use domain knowledge ifpossible.

§ Standardize input data and set initial length-scales ` to « 0.5.§ Standardize targets y and set initial signal variance to σf « 1.§ Often useful: Set initial noise level relatively high (e.g.,

σn « 0.5ˆ σf amplitude, even if you think your data have lownoise. The optimization surface for your other parameters will beeasier to move in.

§ When optimizing hyper-parameters, try random restarts or othertricks to avoid local optima are advised.

§ Mitigate the problem of numerical instability (Choleskydecomposition of K` σ2

n I) by penalizing high signal-to-noiseratios σf {σn


Appendix


The Gaussian Distribution

ppx|µ, Σq “ p2πq´D2 |Σ|´

12 exp

`

´ 12px´ µqJΣ´1px´ µq

˘

§ Mean vector µ Average of the data

§ Covariance matrix Σ Spread of the data

−4 −3 −2 −1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

x

p(x

)

p(x)

Mean

95% confidence bound

x

86

42

0

y

42

02

46

8

p(x, y

)

0.04

0.03

0.02

0.01

0.00

0.01

0.02

0.03

0.04




12 exp

`


˘



−4 −3 −2 −1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

x

p(x

)

p(x)

Mean


−5 −4 −3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

x1

x2

Mean





12 exp

`


˘



−4 −3 −2 −1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

x

p(x

)

Data

p(x)

Mean

95% confidence interval

−5 −4 −3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

x1

x2

Data

Mean



Sampling from a Multivariate Gaussian

Objective

Generate a random sample y „ N`

µ, Σ˘

from a D-dimensional jointGaussian with covariance matrix Σ and mean vector µ.

However, we only have access to a random number generator thatcan sample x from N

`

0, I˘

...

Exploit that affine transformations y “ Ax` b of a Gaussian randomvariable x remain Gaussian

§ Mean: ExrAx` bs “ AExrxs ` b§ Covariance: VxrAx` bs “ AVxrxsAJ

1. Find conditions for A, b to match the mean of y

2. Find conditions for A, b to match the covariance of y



Objective


µ, Σ˘



`

0, I˘

...




2. Find conditions for A, b to match the covariance of y



Objective


µ, Σ˘



`

0, I˘

...




2. Find conditions for A, b to match the covariance of yGaussian Processes Marc Deisenroth February 22, 2017 58

Sampling from a Multivariate Gaussian (2)

Objective


µ, Σ˘


x = randn(D,1); Sample x „ N`

0, I˘

y = chol(Σ)’*x + µ; Scale x and add offset

Here chol(Σ) is the Cholesky factor L, such that LJL “ Σ

Therefore, the mean and covariance of y are

Erys “ y “ ErLJx` µs “ LJErxs ` µ “ µ

Covrys “ Erpy´ yqpy´ yqJs “ ErLJxxJLs “ LJErxxJsL “ LJL “ Σ


Sampling from a Multivariate Gaussian (2)

Objective


µ, Σ˘


x = randn(D,1); Sample x „ N`

0, I˘

y = chol(Σ)’*x + µ; Scale x and add offset

Here chol(Σ) is the Cholesky factor L, such that LJL “ Σ

Therefore, the mean and covariance of y are

Erys “ y “ ErLJx` µs “ LJErxs ` µ “ µ

Covrys “ Erpy´ yqpy´ yqJs “ ErLJxxJLs “ LJErxxJsL “ LJL “ Σ


Conditional

x-6 -4 -2 0 2 4

y

-5

-4

-3

-2

-1

0

1

2

3Joint p(x,y) ppx, yq “ N

˜«

µxµy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

ppx|yq “ N`

µx|y, Σx|y˘

µx|y “ µx ` Σxy Σ´1yy py´ µy q

Σx|y “ Σxx ´ Σxy Σ´1yy Σyx

Conditional ppx|yq is also GaussianComputationally convenient


Conditional

x-6 -4 -2 0 2 4

y

-5

-4

-3

-2

-1

0

1

2

3Joint p(x,y)Observation

ppx, yq “ N˜«

µxµy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

ppx|yq “ N`

µx|y, Σx|y˘





Conditional

x-6 -4 -2 0 2 4

y

-5

-4

-3

-2

-1

0

1

2

3Joint p(x,y)Observation yConditional p(x|y)

ppx, yq “ N˜«

µxµy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

ppx|yq “ N`

µx|y, Σx|y˘





Marginal

x-6 -4 -2 0 2 4

y

-5

-4

-3

-2

-1

0

1

2

3Joint p(x,y)Marginal p(x)

ppx, yq “ N˜«

µxµy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

Marginal distribution:

pp x q “ż

pp x , y qd y

“ N`

µx , Σxx˘

§ The marginal of a joint Gaussian distribution is Gaussian

§ Intuitively: Ignore (integrate out) everything you are notinterested in


Marginal

x-6 -4 -2 0 2 4

y

-5

-4

-3

-2

-1

0

1

2

3Joint p(x,y)Marginal p(x)

ppx, yq “ N˜«

µxµy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

Marginal distribution:

pp x q “ż

pp x , y qd y

“ N`

µx , Σxx˘

§ The marginal of a joint Gaussian distribution is Gaussian

§ Intuitively: Ignore (integrate out) everything you are notinterested in


The Gaussian Distribution in the Limit

Consider the joint Gaussian distribution ppx, xq, where x P RD andx P Rk, k Ñ8 are random variables.

Then

ppx, xq “ N˜«

µxµx

ff

,„

Σxx ΣxxΣxx Σxx

¸

where Σxx P Rkˆk and Σxx P R

Dˆk, k Ñ8.However, the marginal remains finite

pp x q “ż

pp x , x qd x “ N`

µx , Σxx˘

where we integrate out an infinite number of random variables xi.



Consider the joint Gaussian distribution ppx, xq, where x P RD andx P Rk, k Ñ8 are random variables.Then

ppx, xq “ N˜«

µxµx

ff

,„

Σxx ΣxxΣxx Σxx

¸


Dˆk, k Ñ8.

However, the marginal remains finite

pp x q “ż


µx , Σxx˘




Consider the joint Gaussian distribution ppx, xq, where x P RD andx P Rk, k Ñ8 are random variables.Then

ppx, xq “ N˜«

µxµx

ff

,„

Σxx ΣxxΣxx Σxx

¸


Dˆk, k Ñ8.However, the marginal remains finite

pp x q “ż


µx , Σxx˘



Marginal and Conditional in the Limit

§ In practice, we consider finite training and test data xtrain, xtest

§ Then, x “ txtrain, xtest, xotheru

(xother plays the role of x from previous slide)

ppxq “ N

¨

˚

˚

˝

»

—

—

–

µtrain

µtest

µother

fi

ffi

ffi

fl

,

»

—

—

–

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σtest,other

Σother,train Σother,test Σother

fi

ffi

ffi

fl

˛

‹

‹

‚

ppxtrain, xtestq “

ż

pp xtrain, xtest , xother qd xother

ppxtest|xtrainq “ N`

µ˚, Σ˚˘

µ˚ “ µtest ` Σtest,train Σ´1train pxtrain ´ µtrain q

Σ˚ “ Σtest ´ Σtest,train Σ´1train Σtrain,test






ppxq “ N

¨

˚

˚

˝

»

—

—

–

µtrain

µtest

µother

fi

ffi

ffi

fl

,

»

—

—

–


Σtest,train Σtest

Σtrain,other

Σtest,other


fi

ffi

ffi

fl

˛

‹

‹

‚


ż



µ˚, Σ˚˘








ppxq “ N

¨

˚

˚

˝

»

—

—

–

µtrain

µtest

µother

fi

ffi

ffi

fl

,

»

—

—

–


Σtest,train Σtest

Σtrain,other

Σtest,other


fi

ffi

ffi

fl

˛

‹

‹

‚


ż



µ˚, Σ˚˘








ppxq “ N

¨

˚

˚

˝

»

—

—

–

µtrain

µtest

µother

fi

ffi

ffi

fl

,

»

—

—

–


Σtest,train Σtest

Σtrain,other

Σtest,other


fi

ffi

ffi

fl

˛

‹

‹

‚


ż



µ˚, Σ˚˘








ppxq “ N

¨

˚

˚

˝

»

—

—

–

µtrain

µtest

µother

fi

ffi

ffi

fl

,

»

—

—

–


Σtest,train Σtest

Σtrain,other

Σtest,other


fi

ffi

ffi

fl

˛

‹

‹

‚


ż



µ˚, Σ˚˘




Gaussian Process Training: Hierarchical Inference

§ Level-1 inference (posterior on f ):

pp f |X, y, θq “ppy|X, f q pp f |X, θq

ppy|X, θq

ppy|X, θq “

ż

ppy| f , Xq pp f |X, f θqd f

§ Level-2 inference (posterior on θ)

ppθ|X, yq “ppy|X, θq ppθq

ppy|Xq

θ

σnyixi

f

N


Gaussian Process Training: Hierarchical Inference

§ Level-1 inference (posterior on f ):

pp f |X, y, θq “ppy|X, f q pp f |X, θq

ppy|X, θq

ppy|X, θq “

ż

ppy| f , Xq pp f |X, f θqd f

§ Level-2 inference (posterior on θ)

ppθ|X, yq “ppy|X, θq ppθq

ppy|Xq

θ

σnyixi

f

N


GP as the Limit of an Infinite RBF Network

Consider the universal function approximator

f pxq “ÿ

iPZ

limNÑ8

1N

Nÿ

n“1

γn exp

˜

´px´ pi` n

N qq2

λ2

¸

, x P R , λ P R`

with γn „ N`

0, 1˘

(random weights)Gaussian-shaped basis functions (with variance λ2{2) everywhere

on the real axis

f pxq “ÿ

iPZ

ż i`1

iγpsq exp

ˆ

´px´ sq2

λ2

˙

ds “ż 8

´8

γpsq expˆ

´px´ sq2

λ2

˙

ds

§ Mean: Er f pxqs “ 0

§ Covariance: Covr f pxq, f px1qs “ θ21 exp

´

´px´x1q2

2λ2

¯

for suitable θ21

GP with mean 0 and Gaussian covariance function




f pxq “ÿ

iPZ

limNÑ8

1N

Nÿ

n“1

γn exp

˜

´px´ pi` n

N qq2

λ2

¸

, x P R , λ P R`

with γn „ N`

0, 1˘


on the real axis

f pxq “ÿ

iPZ

ż i`1

iγpsq exp

ˆ

´px´ sq2

λ2

˙

ds “ż 8

´8

γpsq expˆ

´px´ sq2

λ2

˙

ds



´

´px´x1q2

2λ2

¯

for suitable θ21

GP with mean 0 and Gaussian covariance function




f pxq “ÿ

iPZ

limNÑ8

1N

Nÿ

n“1

γn exp

˜

´px´ pi` n

N qq2

λ2

¸

, x P R , λ P R`

with γn „ N`

0, 1˘


on the real axis

f pxq “ÿ

iPZ

ż i`1

iγpsq exp

ˆ

´px´ sq2

λ2

˙

ds “ż 8

´8

γpsq expˆ

´px´ sq2

λ2

˙

ds



´

´px´x1q2

2λ2

¯

for suitable θ21

GP with mean 0 and Gaussian covariance functionGaussian Processes Marc Deisenroth February 22, 2017 65

References I

[1] N. A. C. Cressie. Statistics for Spatial Data. Wiley-Interscience, 1993.[2] M. P. Deisenroth and S. Mohamed. Expectation Propagation in Gaussian Process Dynamical Systems. In Advances in

Neural Information Processing Systems, pages 2618–2626, 2012.[3] M. P. Deisenroth and J. W. Ng. Distributed Gaussian Processes. In Proceedings of the International Conference on Machine

Learning, 2015.[4] M. P. Deisenroth, C. E. Rasmussen, and J. Peters. Gaussian Process Dynamic Programming. Neurocomputing,

72(7–9):1508–1524, March 2009.[5] M. P. Deisenroth, R. Turner, M. Huber, U. D. Hanebeck, and C. E. Rasmussen. Robust Filtering and Smoothing with

Gaussian Processes. IEEE Transactions on Automatic Control, 57(7):1865–1871, 2012.[6] R. Frigola, F. Lindsten, T. B. Schon, and C. E. Rasmussen. Bayesian Inference and Learning in Gaussian Process

State-Space Models with Particle MCMC. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,editors, Advances in Neural Information Processing Systems, pages 3156–3164. Curran Associates, Inc., 2013.

[7] J. Kocijan, R. Murray-Smith, C. E. Rasmussen, and A. Girard. Gaussian Process Model Based Predictive Control. InProceedings of the 2004 American Control Conference (ACC 2004), pages 2214–2219, Boston, MA, USA, June–July 2004.

[8] A. Krause, A. Singh, and C. Guestrin. Near-Optimal Sensor Placements in Gaussian Processes: Theory, EfficientAlgorithms and Empirical Studies. Journal of Machine Learning Research, 9:235–284, February 2008.

[9] J. R. Lloyd, D. Duvenaud, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Automatic Construction andNatural-Language Description of Nonparametric Regression Models. In AAAI Conference on Artificial Intelligence, pages1–11, 2014.

[10] M. A. Osborne, S. J. Roberts, A. Rogers, S. D. Ramchurn, and N. R. Jennings. Towards Real-Time Information Processingof Sensor Network Data Using Computationally Efficient Multi-output Gaussian Processes. In Proceedings of theInternational Conference on Information Processing in Sensor Networks, pages 109–120. IEEE Computer Society, 2008.

[11] J. Quinonero-Candela and C. E. Rasmussen. A Unifying View of Sparse Approximate Gaussian Process Regression.Journal of Machine Learning Research, 6(2):1939–1960, 2005.

[12] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and MachineLearning. The MIT Press, Cambridge, MA, USA, 2006.

[13] S. Roberts, M. A. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain. Gaussian Processes for Time Series Modelling.Philosophical Transactions of the Royal Society (Part A), 371(1984), February 2013.