Data Analysis and Probabilistic Inference
Gaussian ProcessesRecommended reading:Rasmussen/Williams: Chapters 1, 2, 4, 5Deisenroth & Ng (2015)[3]
Marc DeisenrothDepartment of ComputingImperial College London
February 22, 2017
http://www.gaussianprocess.org/
Gaussian Processes Marc Deisenroth February 22, 2017 2
Problem Setting
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3
−2
−1
0
1
2
3
x
f(x)
Objective
For a set of observations yi “ f pxiq ` ε, ε „ N`
0, σ2ε
˘
, find adistribution over functions pp f q that explains the data
Probabilistic regression problem
Gaussian Processes Marc Deisenroth February 22, 2017 3
Problem Setting
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10
−2
0
2
x
f(x)
Objective
For a set of observations yi “ f pxiq ` ε, ε „ N`
0, σ2ε
˘
, find adistribution over functions pp f q that explains the data
Probabilistic regression problem
Gaussian Processes Marc Deisenroth February 22, 2017 3
Recap from CO-496: Bayesian Linear Regression
§ Linear Regression Model:
f pxq “ φpxqJw, w „ N`
0, Σp˘
y “ f pxq ` ε, ε „ N`
0, σ2n˘
§ Integrating out the parameters when predicting leads to adistribution over functions:
pp f px˚q|x˚, X, yq “ż
pp f px˚q|x˚, wqppw|X, yqdw
“ N`
µpx˚q, σ2px˚q˘
µpx˚q “ φJ˚ Σp ΦpK` σ2n Iq´1y
σ2px˚q “ φJ˚ Σp φ˚ ´ φJ˚ Σp ΦpK` σ2n Iq´1ΦJ Σp φ˚
K “ ΦJΣpΦ
Gaussian Processes Marc Deisenroth February 22, 2017 4
Recap from CO-496: Bayesian Linear Regression
§ Linear Regression Model:
f pxq “ φpxqJw, w „ N`
0, Σp˘
y “ f pxq ` ε, ε „ N`
0, σ2n˘
§ Integrating out the parameters when predicting leads to adistribution over functions:
pp f px˚q|x˚, X, yq “ż
pp f px˚q|x˚, wqppw|X, yqdw
“ N`
µpx˚q, σ2px˚q˘
µpx˚q “ φJ˚ Σp ΦpK` σ2n Iq´1y
σ2px˚q “ φJ˚ Σp φ˚ ´ φJ˚ Σp ΦpK` σ2n Iq´1ΦJ Σp φ˚
K “ ΦJΣpΦ
Gaussian Processes Marc Deisenroth February 22, 2017 4
Recap from CO-496: Bayesian Linear Regression
§ Linear Regression Model:
f pxq “ φpxqJw, w „ N`
0, Σp˘
y “ f pxq ` ε, ε „ N`
0, σ2n˘
§ Integrating out the parameters when predicting leads to adistribution over functions:
pp f px˚q|x˚, X, yq “ż
pp f px˚q|x˚, wqppw|X, yqdw
“ N`
µpx˚q, σ2px˚q˘
µpx˚q “ φJ˚ Σp ΦpK` σ2n Iq´1y
σ2px˚q “ φJ˚ Σp φ˚ ´ φJ˚ Σp ΦpK` σ2n Iq´1ΦJ Σp φ˚
K “ ΦJΣpΦ
Gaussian Processes Marc Deisenroth February 22, 2017 4
Sampling from the Prior over Functions
Consider a linear regression setting
y “ a` bx` ε , ε „ N`
0, σ2n˘
ppa, bq “ N`
0, I˘
-4 -2 0 2 4a
-4
-2
0
2
4
b
Gaussian Processes Marc Deisenroth February 22, 2017 5
Sampling from the Prior over Functions
Consider a linear regression setting
y “ a` bx` ε , ε „ N`
0, σ2n˘
ppa, bq “ N`
0, I˘
-4 -2 0 2 4a
-4
-2
0
2
4
b
demo: sampling from prior, sampling from posteriorGaussian Processes Marc Deisenroth February 22, 2017 6
Sampling from the Prior over Functions
Consider a linear regression setting
y “ a` bx` ε , ε „ N`
0, σ2n˘
ppa, bq “ N`
0, I˘
-4 -2 0 2 4a
-4
-2
0
2
4
b
-10 0 10x
-10
-5
0
5
10
y
Gaussian Processes Marc Deisenroth February 22, 2017 7
Sampling from the Prior over Functions
Consider a linear regression setting
y “ a` bx` ε , ε „ N`
0, σ2n˘
ppa, bq “ N`
0, I˘
-10 0 10x
-5
0
5
y
Gaussian Processes Marc Deisenroth February 22, 2017 8
Sampling from the Posterior over Functions
Consider a linear regression setting
y “ a` bx` ε , ε „ N`
0, σ2n˘
ppa, bq “ N`
0, I˘
-4 -2 0 2 4a
-4
-2
0
2
4
b
Gaussian Processes Marc Deisenroth February 22, 2017 9
Sampling from the Posterior over Functions
Consider a linear regression setting
y “ a` bx` ε , ε „ N`
0, σ2n˘
ppa, bq “ N`
0, I˘
-4 -2 0 2 4a
-4
-2
0
2
4
b
demo: sampling from prior, sampling from posteriorGaussian Processes Marc Deisenroth February 22, 2017 10
Sampling from the Posterior over Functions
Consider a linear regression setting
y “ a` bx` ε , ε „ N`
0, σ2n˘
ppa, bq “ N`
0, I˘
-4 -2 0 2 4a
-4
-2
0
2
4
b
-10 0 10x
-5
0
5
y
Gaussian Processes Marc Deisenroth February 22, 2017 11
Fitting Nonlinear Functions
§ Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features
§ Example: Radial-basis-function (RBF) network
f pxq “nÿ
i“1
wiφipxq , wi „ N`
0, σ2p˘
whereφipxq “ exp
`
´ 12px´ µiq
Jpx´ µiq˘
for given “centers” µi
Gaussian Processes Marc Deisenroth February 22, 2017 12
Illustration: Fitting a Radial Basis Function Network
φipxq “ exp`
´ 12px´ µiq
Jpx´ µiq˘
-5 0 5x
-2
0
2f(
x)
§ Place Gaussian-shaped basis functions φi at 25 input locations µi,linearly spaced in the interval r´5, 3s
Gaussian Processes Marc Deisenroth February 22, 2017 13
Samples from the RBF Prior
f pxq “nÿ
i“1
wiφipxq , ppwq “ N`
0, I˘
-5 0 5x
-4
-2
0
2
4f(
x)
Gaussian Processes Marc Deisenroth February 22, 2017 14
Samples from the RBF Posterior
f pxq “nÿ
i“1
wiφipxq , ppw|X, yq “ N`
mN , SN˘
-5 0 5x
-4
-2
0
2
4f(
x)
Gaussian Processes Marc Deisenroth February 22, 2017 15
RBF Posterior
-5 0 5x
-2
0
2f(
x)
Gaussian Processes Marc Deisenroth February 22, 2017 16
Limitations
-5 0 5x
-2
0
2
f(x)
§ Feature engineering§ Finite number of features:
§ Above: Without basis functions on the right, we cannot expressany variability of the function
§ Ideally: Add more (infinitely many) basis functionsGaussian Processes Marc Deisenroth February 22, 2017 17
Approach
§ Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly
Make assumptions on the distribution of functions
§ Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values
Gaussian Processes Marc Deisenroth February 22, 2017 18
Gaussian Process
§ We will place a distribution pp f q on functions f§ Informally, a function can be considered an infinitely long vector
of function values f “ r f1, f2, f3, ...s§ A Gaussian process is a generalization of a multivariate Gaussian
distribution to infinitely many variables.
DefinitionA Gaussian process (GP) is a collection of random variables f1, f2, . . . ,any finite number of which is Gaussian distributed.
§ A Gaussian distribution is specified by a mean vector µ and acovariance matrix Σ
§ A Gaussian process is specified by a mean function mp¨q and acovariance function (kernel) kp¨, ¨q
Gaussian Processes Marc Deisenroth February 22, 2017 19
Gaussian Process
§ We will place a distribution pp f q on functions f§ Informally, a function can be considered an infinitely long vector
of function values f “ r f1, f2, f3, ...s§ A Gaussian process is a generalization of a multivariate Gaussian
distribution to infinitely many variables.
DefinitionA Gaussian process (GP) is a collection of random variables f1, f2, . . . ,any finite number of which is Gaussian distributed.
§ A Gaussian distribution is specified by a mean vector µ and acovariance matrix Σ
§ A Gaussian process is specified by a mean function mp¨q and acovariance function (kernel) kp¨, ¨q
Gaussian Processes Marc Deisenroth February 22, 2017 19
Gaussian Process
§ We will place a distribution pp f q on functions f§ Informally, a function can be considered an infinitely long vector
of function values f “ r f1, f2, f3, ...s§ A Gaussian process is a generalization of a multivariate Gaussian
distribution to infinitely many variables.
DefinitionA Gaussian process (GP) is a collection of random variables f1, f2, . . . ,any finite number of which is Gaussian distributed.
§ A Gaussian distribution is specified by a mean vector µ and acovariance matrix Σ
§ A Gaussian process is specified by a mean function mp¨q and acovariance function (kernel) kp¨, ¨q
Gaussian Processes Marc Deisenroth February 22, 2017 19
Covariance Function
§ The covariance function (kernel) is symmetric and positivesemi-definite
§ It allows us to compute covariances between (unknown) functionvalues by just looking at the corresponding inputs:
Covr f pxiq, f pxjqs “ kpxi, xjq
Gaussian Processes Marc Deisenroth February 22, 2017 20
GP Regression as a Bayesian Inference Problem
Objective
For a set of observations yi “ f pxiq ` ε, ε „ N`
0, σ2n˘
, find a(posterior) distribution over functions pp f |X, yq that explains the data
Training data: X, y. Bayes’ theorem yields
pp f |X, yq “ppy| f , Xq pp f q
ppy|Xq
Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.
Likelihood (noise model): ppy| f , Xq “ N`
f pXq, σ2n I˘
Marginal likelihood (evidence): ppy|Xq “ş
ppy| f pXqqpp f |Xqd fPosterior: pp f |y, Xq “ GPpmpost, kpostq
Gaussian Processes Marc Deisenroth February 22, 2017 21
GP Regression as a Bayesian Inference Problem
Objective
For a set of observations yi “ f pxiq ` ε, ε „ N`
0, σ2n˘
, find a(posterior) distribution over functions pp f |X, yq that explains the data
Training data: X, y. Bayes’ theorem yields
pp f |X, yq “ppy| f , Xq pp f q
ppy|Xq
Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.
Likelihood (noise model): ppy| f , Xq “ N`
f pXq, σ2n I˘
Marginal likelihood (evidence): ppy|Xq “ş
ppy| f pXqqpp f |Xqd fPosterior: pp f |y, Xq “ GPpmpost, kpostq
Gaussian Processes Marc Deisenroth February 22, 2017 21
GP Regression as a Bayesian Inference Problem
Objective
For a set of observations yi “ f pxiq ` ε, ε „ N`
0, σ2n˘
, find a(posterior) distribution over functions pp f |X, yq that explains the data
Training data: X, y. Bayes’ theorem yields
pp f |X, yq “ppy| f , Xq pp f q
ppy|Xq
Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.
Likelihood (noise model): ppy| f , Xq “ N`
f pXq, σ2n I˘
Marginal likelihood (evidence): ppy|Xq “ş
ppy| f pXqqpp f |Xqd fPosterior: pp f |y, Xq “ GPpmpost, kpostq
Gaussian Processes Marc Deisenroth February 22, 2017 21
GP Regression as a Bayesian Inference Problem
Objective
For a set of observations yi “ f pxiq ` ε, ε „ N`
0, σ2n˘
, find a(posterior) distribution over functions pp f |X, yq that explains the data
Training data: X, y. Bayes’ theorem yields
pp f |X, yq “ppy| f , Xq pp f q
ppy|Xq
Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.
Likelihood (noise model): ppy| f , Xq “ N`
f pXq, σ2n I˘
Marginal likelihood (evidence): ppy|Xq “ş
ppy| f pXqqpp f |Xqd fPosterior: pp f |y, Xq “ GPpmpost, kpostq
Gaussian Processes Marc Deisenroth February 22, 2017 21
GP Regression as a Bayesian Inference Problem
Objective
For a set of observations yi “ f pxiq ` ε, ε „ N`
0, σ2n˘
, find a(posterior) distribution over functions pp f |X, yq that explains the data
Training data: X, y. Bayes’ theorem yields
pp f |X, yq “ppy| f , Xq pp f q
ppy|Xq
Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.
Likelihood (noise model): ppy| f , Xq “ N`
f pXq, σ2n I˘
Marginal likelihood (evidence): ppy|Xq “ş
ppy| f pXqqpp f |Xqd f
Posterior: pp f |y, Xq “ GPpmpost, kpostq
Gaussian Processes Marc Deisenroth February 22, 2017 21
GP Regression as a Bayesian Inference Problem
Objective
For a set of observations yi “ f pxiq ` ε, ε „ N`
0, σ2n˘
, find a(posterior) distribution over functions pp f |X, yq that explains the data
Training data: X, y. Bayes’ theorem yields
pp f |X, yq “ppy| f , Xq pp f q
ppy|Xq
Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.
Likelihood (noise model): ppy| f , Xq “ N`
f pXq, σ2n I˘
Marginal likelihood (evidence): ppy|Xq “ş
ppy| f pXqqpp f |Xqd fPosterior: pp f |y, Xq “ GPpmpost, kpostq
Gaussian Processes Marc Deisenroth February 22, 2017 21
Prior over Functions
§ Treat a function as a long vector of function values:
f “ r f1, f2, . . . s
Look at a distribution over function values fi “ f pxiq
§ Consider a finite number of N function values f and all other(infinitely many) function values f . Informally:
pp f , f q “ N
¨
˝
»
–
µ f
µ f
fi
fl ,
»
–
Σ f f Σ f f
Σ f f Σ f f
fi
fl
˛
‚
where Σ f f P Rmˆm and Σ f f P R
Nˆm, m Ñ8.
§ Σpi,jqf f “ Covr f pxiq, f pxjqs “ kpxi, xjq
§ Key property: The marginal remains finite
pp f q “ż
pp f , f qd f “ N`
µ f , Σ f f˘
Gaussian Processes Marc Deisenroth February 22, 2017 22
Prior over Functions
§ Treat a function as a long vector of function values:
f “ r f1, f2, . . . s
Look at a distribution over function values fi “ f pxiq
§ Consider a finite number of N function values f and all other(infinitely many) function values f . Informally:
pp f , f q “ N
¨
˝
»
–
µ f
µ f
fi
fl ,
»
–
Σ f f Σ f f
Σ f f Σ f f
fi
fl
˛
‚
where Σ f f P Rmˆm and Σ f f P R
Nˆm, m Ñ8.
§ Σpi,jqf f “ Covr f pxiq, f pxjqs “ kpxi, xjq
§ Key property: The marginal remains finite
pp f q “ż
pp f , f qd f “ N`
µ f , Σ f f˘
Gaussian Processes Marc Deisenroth February 22, 2017 22
Prior over Functions
§ Treat a function as a long vector of function values:
f “ r f1, f2, . . . s
Look at a distribution over function values fi “ f pxiq
§ Consider a finite number of N function values f and all other(infinitely many) function values f . Informally:
pp f , f q “ N
¨
˝
»
–
µ f
µ f
fi
fl ,
»
–
Σ f f Σ f f
Σ f f Σ f f
fi
fl
˛
‚
where Σ f f P Rmˆm and Σ f f P R
Nˆm, m Ñ8.
§ Σpi,jqf f “ Covr f pxiq, f pxjqs “ kpxi, xjq
§ Key property: The marginal remains finite
pp f q “ż
pp f , f qd f “ N`
µ f , Σ f f˘
Gaussian Processes Marc Deisenroth February 22, 2017 22
Training and Test Marginal
§ In practice, we always have finite training and test inputsxtrain, xtest.
§ Define f˚ :“ f test, f :“ f train.
§ Then, we obtain the finite marginal
pp f , f˚ q “ż
pp f , f˚ , f other qd f other “ N˜«
µ f
µ˚
ff
,
«
Σ f f Σ f˚
Σ˚ f Σ˚˚
ff¸
Gaussian Processes Marc Deisenroth February 22, 2017 23
Training and Test Marginal
§ In practice, we always have finite training and test inputsxtrain, xtest.
§ Define f˚ :“ f test, f :“ f train.
§ Then, we obtain the finite marginal
pp f , f˚ q “ż
pp f , f˚ , f other qd f other “ N˜«
µ f
µ˚
ff
,
«
Σ f f Σ f˚
Σ˚ f Σ˚˚
ff¸
Gaussian Processes Marc Deisenroth February 22, 2017 23
GP Regression as a Bayesian Inference Problem (ctd.)
Posterior over functions (with training data X, y):
pp f |X, yq “ppy| f , Xq pp f |Xq
ppy|Xq
Using the properties of Gaussians, we obtain
ppy| f , Xq pp f |Xq “ N`
y | f pXq, σ2n I˘
N`
f pXq |mpXq, K˘
“ ZN`
f pXq | mpXq `KpK` σ2n Iq´1py´mpXqq
looooooooooooooooooooomooooooooooooooooooooon
posterior mean
, K´KpK` σ2n Iq´1K
loooooooooooomoooooooooooon
posterior covariance
˘
K “ kpX, Xq
Marginal likelihood:
Z “ ppy|Xq “ż
ppy| f , Xq pp f |Xq d f “ N`
y |mpXq, K` σ2n I˘
Gaussian Processes Marc Deisenroth February 22, 2017 24
GP Regression as a Bayesian Inference Problem (ctd.)
Posterior over functions (with training data X, y):
pp f |X, yq “ppy| f , Xq pp f |Xq
ppy|Xq
Using the properties of Gaussians, we obtain
ppy| f , Xq pp f |Xq “ N`
y | f pXq, σ2n I˘
N`
f pXq |mpXq, K˘
“ ZN`
f pXq | mpXq `KpK` σ2n Iq´1py´mpXqq
looooooooooooooooooooomooooooooooooooooooooon
posterior mean
, K´KpK` σ2n Iq´1K
loooooooooooomoooooooooooon
posterior covariance
˘
K “ kpX, Xq
Marginal likelihood:
Z “ ppy|Xq “ż
ppy| f , Xq pp f |Xq d f “ N`
y |mpXq, K` σ2n I˘
Gaussian Processes Marc Deisenroth February 22, 2017 24
GP Regression as a Bayesian Inference Problem (ctd.)
Posterior over functions (with training data X, y):
pp f |X, yq “ppy| f , Xq pp f |Xq
ppy|Xq
Using the properties of Gaussians, we obtain
ppy| f , Xq pp f |Xq “ N`
y | f pXq, σ2n I˘
N`
f pXq |mpXq, K˘
“ ZN`
f pXq | mpXq `KpK` σ2n Iq´1py´mpXqq
looooooooooooooooooooomooooooooooooooooooooon
posterior mean
, K´KpK` σ2n Iq´1K
loooooooooooomoooooooooooon
posterior covariance
˘
K “ kpX, Xq
Marginal likelihood:
Z “ ppy|Xq “ż
ppy| f , Xq pp f |Xq d f “ N`
y |mpXq, K` σ2n I˘
Gaussian Processes Marc Deisenroth February 22, 2017 24
GP Regression as a Bayesian Inference Problem (ctd.)
Posterior over functions (with training data X, y):
pp f |X, yq “ppy| f , Xq pp f |Xq
ppy|Xq
Using the properties of Gaussians, we obtain
ppy| f , Xq pp f |Xq “ N`
y | f pXq, σ2n I˘
N`
f pXq |mpXq, K˘
“ ZN`
f pXq | mpXq `KpK` σ2n Iq´1py´mpXqq
looooooooooooooooooooomooooooooooooooooooooon
posterior mean
, K´KpK` σ2n Iq´1K
loooooooooooomoooooooooooon
posterior covariance
˘
K “ kpX, Xq
Marginal likelihood:
Z “ ppy|Xq “ż
ppy| f , Xq pp f |Xq d f “ N`
y |mpXq, K` σ2n I˘
Gaussian Processes Marc Deisenroth February 22, 2017 24
GP Predictions (1)
y “ f pxq ` ε, ε „ N`
0, σ2n˘
§ Objective: Find pp f pX˚q|X, yq for training data X, y and testinputs X˚.
§ GP prior: pp f |Xq “ N`
mpXq, K˘
§ Gaussian Likelihood: ppy| f pXqq “ N`
f pXq, σ2n I˘
§ With f „ GP it follows that f , f˚ are jointly Gaussian distributed:
pp f , f˚|X, X˚q “ N˜«
mpXqmpX˚q
ff
,
«
K kpX, X˚qkpX˚, Xq kpX˚, X˚q
ff¸
§ Due to the Gaussian likelihood, we also get ( f is unobserved)
ppy, f˚|X, X˚q “ N˜«
mpXqmpX˚q
ff
,
«
K`σ2n I kpX, X˚q
kpX˚, Xq kpX˚, X˚q
ff¸
Gaussian Processes Marc Deisenroth February 22, 2017 25
GP Predictions (1)
y “ f pxq ` ε, ε „ N`
0, σ2n˘
§ Objective: Find pp f pX˚q|X, yq for training data X, y and testinputs X˚.
§ GP prior: pp f |Xq “ N`
mpXq, K˘
§ Gaussian Likelihood: ppy| f pXqq “ N`
f pXq, σ2n I˘
§ With f „ GP it follows that f , f˚ are jointly Gaussian distributed:
pp f , f˚|X, X˚q “ N˜«
mpXqmpX˚q
ff
,
«
K kpX, X˚qkpX˚, Xq kpX˚, X˚q
ff¸
§ Due to the Gaussian likelihood, we also get ( f is unobserved)
ppy, f˚|X, X˚q “ N˜«
mpXqmpX˚q
ff
,
«
K`σ2n I kpX, X˚q
kpX˚, Xq kpX˚, X˚q
ff¸
Gaussian Processes Marc Deisenroth February 22, 2017 25
GP Predictions (1)
y “ f pxq ` ε, ε „ N`
0, σ2n˘
§ Objective: Find pp f pX˚q|X, yq for training data X, y and testinputs X˚.
§ GP prior: pp f |Xq “ N`
mpXq, K˘
§ Gaussian Likelihood: ppy| f pXqq “ N`
f pXq, σ2n I˘
§ With f „ GP it follows that f , f˚ are jointly Gaussian distributed:
pp f , f˚|X, X˚q “ N˜«
mpXqmpX˚q
ff
,
«
K kpX, X˚qkpX˚, Xq kpX˚, X˚q
ff¸
§ Due to the Gaussian likelihood, we also get ( f is unobserved)
ppy, f˚|X, X˚q “ N˜«
mpXqmpX˚q
ff
,
«
K`σ2n I kpX, X˚q
kpX˚, Xq kpX˚, X˚q
ff¸
Gaussian Processes Marc Deisenroth February 22, 2017 25
GP Predictions (2)
Prior:
ppy, f˚|X, X˚q “ Nˆ„
mpXqmpX˚q
,„
K` σ2n I kpX, X˚q
kpX˚, Xq kpX˚, X˚q
˙
Posterior predictive distribution pp f˚|X, y, X˚q at test inputs X˚
obtained by Gaussian conditioning:
pp f˚|X, y, X˚q “ N`
Er f˚|X, y, X˚s, Vr f˚|X, y, X˚s˘
Er f˚|X, y, X˚s “ mpostpX˚q “ mpX˚qloomoon
prior mean
`kpX˚, XqpK` σ2n Iq´1py´mpXqq
Vr f˚|X, y, X˚s “ kpostpX˚, X˚q
“ kpX˚, X˚qloooomoooon
prior variance
´kpX˚, XqpK` σ2n Iq´1kpX, X˚q
From now: Set prior mean function m ” 0
Gaussian Processes Marc Deisenroth February 22, 2017 26
GP Predictions (2)
Prior:
ppy, f˚|X, X˚q “ Nˆ„
mpXqmpX˚q
,„
K` σ2n I kpX, X˚q
kpX˚, Xq kpX˚, X˚q
˙
Posterior predictive distribution pp f˚|X, y, X˚q at test inputs X˚obtained by Gaussian conditioning:
pp f˚|X, y, X˚q “ N`
Er f˚|X, y, X˚s, Vr f˚|X, y, X˚s˘
Er f˚|X, y, X˚s “ mpostpX˚q “ mpX˚qloomoon
prior mean
`kpX˚, XqpK` σ2n Iq´1py´mpXqq
Vr f˚|X, y, X˚s “ kpostpX˚, X˚q
“ kpX˚, X˚qloooomoooon
prior variance
´kpX˚, XqpK` σ2n Iq´1kpX, X˚q
From now: Set prior mean function m ” 0
Gaussian Processes Marc Deisenroth February 22, 2017 26
GP Predictions (2)
Prior:
ppy, f˚|X, X˚q “ Nˆ„
mpXqmpX˚q
,„
K` σ2n I kpX, X˚q
kpX˚, Xq kpX˚, X˚q
˙
Posterior predictive distribution pp f˚|X, y, X˚q at test inputs X˚obtained by Gaussian conditioning:
pp f˚|X, y, X˚q “ N`
Er f˚|X, y, X˚s, Vr f˚|X, y, X˚s˘
Er f˚|X, y, X˚s “ mpostpX˚q “ mpX˚qloomoon
prior mean
`kpX˚, XqpK` σ2n Iq´1py´mpXqq
Vr f˚|X, y, X˚s “ kpostpX˚, X˚q
“ kpX˚, X˚qloooomoooon
prior variance
´kpX˚, XqpK` σ2n Iq´1kpX, X˚q
From now: Set prior mean function m ” 0Gaussian Processes Marc Deisenroth February 22, 2017 26
Illustration: Inference with Gaussian Processes
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3
−2
−1
0
1
2
3
x
f(x)
Prior belief about the function
Predictive (marginal) mean and variance:
Er f px˚q|x˚,∅s “ mpx˚q “ 0Vr f px˚q|x˚,∅s “ σ2px˚q “ kpx˚, x˚q
Gaussian Processes Marc Deisenroth February 22, 2017 27
Illustration: Inference with Gaussian Processes
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3
−2
−1
0
1
2
3
x
f(x)
Prior belief about the function
Predictive (marginal) mean and variance:
Er f px˚q|x˚,∅s “ mpx˚q “ 0Vr f px˚q|x˚,∅s “ σ2px˚q “ kpx˚, x˚q
Gaussian Processes Marc Deisenroth February 22, 2017 27
Illustration: Inference with Gaussian Processes
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3
−2
−1
0
1
2
3
x
f(x)
Posterior belief about the function
Predictive (marginal) mean and variance:
Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y
Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q
Gaussian Processes Marc Deisenroth February 22, 2017 27
Illustration: Inference with Gaussian Processes
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3
−2
−1
0
1
2
3
x
f(x)
Posterior belief about the function
Predictive (marginal) mean and variance:
Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y
Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q
Gaussian Processes Marc Deisenroth February 22, 2017 27
Illustration: Inference with Gaussian Processes
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3
−2
−1
0
1
2
3
x
f(x)
Posterior belief about the function
Predictive (marginal) mean and variance:
Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y
Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q
Gaussian Processes Marc Deisenroth February 22, 2017 27
Illustration: Inference with Gaussian Processes
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3
−2
−1
0
1
2
3
x
f(x)
Posterior belief about the function
Predictive (marginal) mean and variance:
Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y
Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q
Gaussian Processes Marc Deisenroth February 22, 2017 27
Illustration: Inference with Gaussian Processes
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3
−2
−1
0
1
2
3
x
f(x)
Posterior belief about the function
Predictive (marginal) mean and variance:
Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y
Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q
Gaussian Processes Marc Deisenroth February 22, 2017 27
Illustration: Inference with Gaussian Processes
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3
−2
−1
0
1
2
3
x
f(x)
Posterior belief about the function
Predictive (marginal) mean and variance:
Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y
Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q
Gaussian Processes Marc Deisenroth February 22, 2017 27
Illustration: Inference with Gaussian Processes
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3
−2
−1
0
1
2
3
x
f(x)
Posterior belief about the function
Predictive (marginal) mean and variance:
Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y
Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q
Gaussian Processes Marc Deisenroth February 22, 2017 27
Illustration: Inference with Gaussian Processes
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3
−2
−1
0
1
2
3
x
f(x)
Posterior belief about the function
Predictive (marginal) mean and variance:
Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y
Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q
Gaussian Processes Marc Deisenroth February 22, 2017 27
Illustration: Inference with Gaussian Processes
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3
−2
−1
0
1
2
3
x
f(x)
Posterior belief about the function
Predictive (marginal) mean and variance:
Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y
Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q
Gaussian Processes Marc Deisenroth February 22, 2017 27
Covariance Function
§ A Gaussian process is fully specified by a mean function m and akernel/covariance function k
§ The covariance function (kernel) is symmetric and positivesemi-definite
§ Covariance function encodes high-level structural assumptionsabout the latent function f (e.g., smoothness, differentiability,periodicity)
Gaussian Processes Marc Deisenroth February 22, 2017 28
Gaussian Covariance FunctionkGausspxi, xjq “ σ2
f exp`
´ pxi ´ xjqJpxi ´ xjq{`
2˘
§ σf : Amplitude of the latent function§ `: Length scale. How far do we have to move in input space
before the function value changes significantlySmoothness parameter
§ Assumption on latent function: Smooth (8 differentiable)Gaussian Processes Marc Deisenroth February 22, 2017 29
Length-Scales
Length scales determine how wiggly the function is and how muchinformation we can transfer to other function values
x-10 -5 0 5 10
f(x)
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3Data
Gaussian Processes Marc Deisenroth February 22, 2017 30
Length-Scales
Length scales determine how wiggly the function is and how muchinformation we can transfer to other function values
x-10 -5 0 5 10
f(x)
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3Data
Gaussian Processes Marc Deisenroth February 22, 2017 30
Length-Scales
Length scales determine how wiggly the function is and how muchinformation we can transfer to other function values
x-10 -5 0 5 10
f(x)
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3Data
Gaussian Processes Marc Deisenroth February 22, 2017 30
Length-Scales
Length scales determine how wiggly the function is and how muchinformation we can transfer to other function values
x-10 -5 0 5 10
f(x)
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3Data
Gaussian Processes Marc Deisenroth February 22, 2017 30
Matern Covariance Function
kMat,3{2pxi, xjq “ σ2f
´
1`?
3}xi´xj}
`
¯
exp´
´
?3}xi´xj}
`
¯
§ σf : Amplitude of the latent function§ `: Length scale. How far do we have to move in input space
before the function value changes significantly?
§ Assumption on latent function: 1-times differentiable
Gaussian Processes Marc Deisenroth February 22, 2017 31
Periodic Covariance Function
kperpxi, xjq “ σ2f exp
´
´2 sin2 ` κpxi´xjq
2π
˘
`2
¯
“ kGausspupxiq, upxjqq, upxq “„
cospκxqsinpκxq
κ: Periodicity parameter
Gaussian Processes Marc Deisenroth February 22, 2017 32
Meta-Parameters of a GP
The GP possesses a set of hyper-parameters:
§ Parameters of the mean function
§ Hyper-parameters of the covariance function (e.g., length-scalesand signal variance)
§ Likelihood parameters (e.g., noise variance σ2n)
Train a GP to find a good set of hyper-parameters
Model selection to find good mean and covariance functions(can also be automated Automatic Statistician (Lloyd et al., 2014))
Gaussian Processes Marc Deisenroth February 22, 2017 33
Meta-Parameters of a GP
The GP possesses a set of hyper-parameters:
§ Parameters of the mean function
§ Hyper-parameters of the covariance function (e.g., length-scalesand signal variance)
§ Likelihood parameters (e.g., noise variance σ2n)
Train a GP to find a good set of hyper-parameters
Model selection to find good mean and covariance functions(can also be automated Automatic Statistician (Lloyd et al., 2014))
Gaussian Processes Marc Deisenroth February 22, 2017 33
Meta-Parameters of a GP
The GP possesses a set of hyper-parameters:
§ Parameters of the mean function
§ Hyper-parameters of the covariance function (e.g., length-scalesand signal variance)
§ Likelihood parameters (e.g., noise variance σ2n)
Train a GP to find a good set of hyper-parameters
Model selection to find good mean and covariance functions(can also be automated Automatic Statistician (Lloyd et al., 2014))
Gaussian Processes Marc Deisenroth February 22, 2017 33
Gaussian Process Training: Hyper-Parameters
GP TrainingFind good GP hyper-parameters θ (kerneland mean function parameters)
θ
σnyixi
f
N
§ Place a prior ppθq on hyper-parameters§ Posterior over hyper-parameters:
ppθ|X, yq “ppθq ppy|X, θq
ppy|Xq, ppy|X, θq “
ż
ppy| f pXqqpp f |X, θqd f
§ Choose hyper-parameters θ˚, such that
θ˚ P arg maxθ
log ppθq ` log ppy|X, θq
Maximize marginal likelihood if ppθq “ U (uniform prior)
Gaussian Processes Marc Deisenroth February 22, 2017 34
Gaussian Process Training: Hyper-Parameters
GP TrainingFind good GP hyper-parameters θ (kerneland mean function parameters)
θ
σnyixi
f
N
§ Place a prior ppθq on hyper-parameters§ Posterior over hyper-parameters:
ppθ|X, yq “ppθq ppy|X, θq
ppy|Xq, ppy|X, θq “
ż
ppy| f pXqqpp f |X, θqd f
§ Choose hyper-parameters θ˚, such that
θ˚ P arg maxθ
log ppθq ` log ppy|X, θq
Maximize marginal likelihood if ppθq “ U (uniform prior)
Gaussian Processes Marc Deisenroth February 22, 2017 34
Gaussian Process Training: Hyper-Parameters
GP TrainingFind good GP hyper-parameters θ (kerneland mean function parameters)
θ
σnyixi
f
N
§ Place a prior ppθq on hyper-parameters§ Posterior over hyper-parameters:
ppθ|X, yq “ppθq ppy|X, θq
ppy|Xq, ppy|X, θq “
ż
ppy| f pXqqpp f |X, θqd f
§ Choose hyper-parameters θ˚, such that
θ˚ P arg maxθ
log ppθq ` log ppy|X, θq
Maximize marginal likelihood if ppθq “ U (uniform prior)
Gaussian Processes Marc Deisenroth February 22, 2017 34
Gaussian Process Training: Hyper-Parameters
GP TrainingFind good GP hyper-parameters θ (kerneland mean function parameters)
θ
σnyixi
f
N
§ Place a prior ppθq on hyper-parameters§ Posterior over hyper-parameters:
ppθ|X, yq “ppθq ppy|X, θq
ppy|Xq, ppy|X, θq “
ż
ppy| f pXqqpp f |X, θqd f
§ Choose hyper-parameters θ˚, such that
θ˚ P arg maxθ
log ppθq ` log ppy|X, θq
Maximize marginal likelihood if ppθq “ U (uniform prior)
Gaussian Processes Marc Deisenroth February 22, 2017 34
Training via Marginal Likelihood Maximization
GP TrainingMaximize the evidence/marginal likelihood (probability of the datagiven the hyper-parameters, where the unwieldy f has beenintegrated out) Also called Maximum Likelihood-Type-II
Marginal likelihood:
ppy|X, θq “
ż
ppy| f pXqqpp f |X, θqd f
“
ż
N`
y | f pXq, σ2n I˘
N`
f pXq | 0, K˘
d f “ N`
y | 0, K` σ2n I˘
Learning the GP hyper-parameters:
θ˚ P arg maxθ
log ppy|X, θq
log ppy|X, θq “ ´12 yJK´1
θ y ´ 12 log |Kθ| ` const , Kθ :“ K` σ2
n I
Gaussian Processes Marc Deisenroth February 22, 2017 35
Training via Marginal Likelihood Maximization
GP TrainingMaximize the evidence/marginal likelihood (probability of the datagiven the hyper-parameters, where the unwieldy f has beenintegrated out) Also called Maximum Likelihood-Type-II
Marginal likelihood:
ppy|X, θq “
ż
ppy| f pXqqpp f |X, θqd f
“
ż
N`
y | f pXq, σ2n I˘
N`
f pXq | 0, K˘
d f “ N`
y | 0, K` σ2n I˘
Learning the GP hyper-parameters:
θ˚ P arg maxθ
log ppy|X, θq
log ppy|X, θq “ ´12 yJK´1
θ y ´ 12 log |Kθ| ` const , Kθ :“ K` σ2
n I
Gaussian Processes Marc Deisenroth February 22, 2017 35
Training via Marginal Likelihood Maximization
Log-marginal likelihood:
log ppy|X, θq “ ´12 yJK´1
θ y ´ 12 log |Kθ| ` const , Kθ :“ K` σ2
n I
§ Automatic trade-off between data fit and model complexity
§ Gradient-based optimization of hyper-parameters θ:
B log ppy|X, θq
Bθi“ 1
2 yJK´1θ
BKθ
BθiK´1
θ y´ 12 tr
`
K´1θ
BKθ
Bθi
˘
“ 12 tr
`
pααJ ´K´1θ qBKθ
Bθi
˘
,
α :“ K´1θ y
Gaussian Processes Marc Deisenroth February 22, 2017 36
Training via Marginal Likelihood Maximization
Log-marginal likelihood:
log ppy|X, θq “ ´12 yJK´1
θ y ´ 12 log |Kθ| ` const , Kθ :“ K` σ2
n I
§ Automatic trade-off between data fit and model complexity
§ Gradient-based optimization of hyper-parameters θ:
B log ppy|X, θq
Bθi“ 1
2 yJK´1θ
BKθ
BθiK´1
θ y´ 12 tr
`
K´1θ
BKθ
Bθi
˘
“ 12 tr
`
pααJ ´K´1θ qBKθ
Bθi
˘
,
α :“ K´1θ y
Gaussian Processes Marc Deisenroth February 22, 2017 36
Training via Marginal Likelihood Maximization
Log-marginal likelihood:
log ppy|X, θq “ ´12 yJK´1
θ y ´ 12 log |Kθ| ` const , Kθ :“ K` σ2
n I
§ Automatic trade-off between data fit and model complexity
§ Gradient-based optimization of hyper-parameters θ:
B log ppy|X, θq
Bθi“ 1
2 yJK´1θ
BKθ
BθiK´1
θ y´ 12 tr
`
K´1θ
BKθ
Bθi
˘
“ 12 tr
`
pααJ ´K´1θ qBKθ
Bθi
˘
,
α :“ K´1θ y
Gaussian Processes Marc Deisenroth February 22, 2017 36
Example: Training Data
-10 -8 -6 -4 -2 0 2 4 6 8 10x
-3
-2
-1
0
1
2
3y
Gaussian Processes Marc Deisenroth February 22, 2017 37
Example: Marginal Likelihood Contour
-6 -5 -4 -3 -2 -1 0log-noise
-1
0
1
2
3
4
5
log-
leng
th-s
cale
sLog-Marginal Likelihood, N=20
-4
-3.5
-3
-2.5
-2
-1.5
Gaussian Processes Marc Deisenroth February 22, 2017 38
Example: Exploring the Modes (1)
-10 -8 -6 -4 -2 0 2 4 6 8 10x
-3
-2
-1
0
1
2
3y
Gaussian Processes Marc Deisenroth February 22, 2017 39
Example: Exploring the Modes (2)
-10 -8 -6 -4 -2 0 2 4 6 8 10x
-3
-2
-1
0
1
2
3y
Gaussian Processes Marc Deisenroth February 22, 2017 40
Marginal Likelihood (1)
-6 -5 -4 -3 -2 -1 0log-noise
-1
0
1
2
3
4
5lo
g-le
ngth
-sca
les
Log-Marginal Likelihood, N=2
-4
-3.5
-3
-2.5
-2
-1.5
-1
Gaussian Processes Marc Deisenroth February 22, 2017 41
Marginal Likelihood (2)
-6 -5 -4 -3 -2 -1 0log-noise
-1
0
1
2
3
4
5lo
g-le
ngth
-sca
les
Log-Marginal Likelihood, N=3
-4
-3.5
-3
-2.5
-2
-1.5
Gaussian Processes Marc Deisenroth February 22, 2017 42
Marginal Likelihood (3)
-6 -5 -4 -3 -2 -1 0log-noise
-1
0
1
2
3
4
5lo
g-le
ngth
-sca
les
Log-Marginal Likelihood, N=5
-4
-3.5
-3
-2.5
-2
-1.5
-1
Gaussian Processes Marc Deisenroth February 22, 2017 43
Marginal Likelihood (4)
-6 -5 -4 -3 -2 -1 0log-noise
-1
0
1
2
3
4
5lo
g-le
ngth
-sca
les
Log-Marginal Likelihood, N=10
-4
-3.5
-3
-2.5
-2
-1.5
-1
Gaussian Processes Marc Deisenroth February 22, 2017 44
Marginal Likelihood (5)
-6 -5 -4 -3 -2 -1 0log-noise
-1
0
1
2
3
4
5lo
g-le
ngth
-sca
les
Log-Marginal Likelihood, N=15
-4
-3.5
-3
-2.5
-2
-1.5
Gaussian Processes Marc Deisenroth February 22, 2017 45
Marginal Likelihood (6)
-6 -5 -4 -3 -2 -1 0log-noise
-1
0
1
2
3
4
5lo
g-le
ngth
-sca
les
Log-Marginal Likelihood, N=20
-4
-3.5
-3
-2.5
-2
-1.5
Gaussian Processes Marc Deisenroth February 22, 2017 46
Marginal Likelihood (7)
-6 -5 -4 -3 -2 -1 0log-noise
-1
0
1
2
3
4
5lo
g-le
ngth
-sca
les
Log-Marginal Likelihood, N=50
-4
-3.5
-3
-2.5
-2
-1.5
Gaussian Processes Marc Deisenroth February 22, 2017 47
Marginal Likelihood (8)
-6 -5 -4 -3 -2 -1 0log-noise
-1
0
1
2
3
4
5lo
g-le
ngth
-sca
les
Log-Marginal Likelihood, N=100
-4
-3.5
-3
-2.5
-2
-1.5
-1
Gaussian Processes Marc Deisenroth February 22, 2017 48
Marginal Likelihood (9)
-6 -5 -4 -3 -2 -1 0log-noise
-1
0
1
2
3
4
5lo
g-le
ngth
-sca
les
Log-Marginal Likelihood, N=200
-4
-3.5
-3
-2.5
-2
-1.5
-1
Gaussian Processes Marc Deisenroth February 22, 2017 49
Marginal Likelihood and Parameter Learning
§ The marginal likelihood is non-convex
§ In particular in the very-small-data regime, a GP can end up inthree different modes when optimizing the hyper-parameters:
§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)§ Good fit
§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem
§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.
§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?
Gaussian Processes Marc Deisenroth February 22, 2017 50
Marginal Likelihood and Parameter Learning
§ The marginal likelihood is non-convex§ In particular in the very-small-data regime, a GP can end up in
three different modes when optimizing the hyper-parameters:
§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)§ Good fit
§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem
§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.
§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?
Gaussian Processes Marc Deisenroth February 22, 2017 50
Marginal Likelihood and Parameter Learning
§ The marginal likelihood is non-convex§ In particular in the very-small-data regime, a GP can end up in
three different modes when optimizing the hyper-parameters:§ Overfitting (unlikely, but possible)
§ Underfitting (everything is considered noise)§ Good fit
§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem
§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.
§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?
Gaussian Processes Marc Deisenroth February 22, 2017 50
Marginal Likelihood and Parameter Learning
§ The marginal likelihood is non-convex§ In particular in the very-small-data regime, a GP can end up in
three different modes when optimizing the hyper-parameters:§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)
§ Good fit
§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem
§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.
§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?
Gaussian Processes Marc Deisenroth February 22, 2017 50
Marginal Likelihood and Parameter Learning
§ The marginal likelihood is non-convex§ In particular in the very-small-data regime, a GP can end up in
three different modes when optimizing the hyper-parameters:§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)§ Good fit
§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem
§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.
§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?
Gaussian Processes Marc Deisenroth February 22, 2017 50
Marginal Likelihood and Parameter Learning
§ The marginal likelihood is non-convex§ In particular in the very-small-data regime, a GP can end up in
three different modes when optimizing the hyper-parameters:§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)§ Good fit
§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem
§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.
§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?
Gaussian Processes Marc Deisenroth February 22, 2017 50
Marginal Likelihood and Parameter Learning
§ The marginal likelihood is non-convex§ In particular in the very-small-data regime, a GP can end up in
three different modes when optimizing the hyper-parameters:§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)§ Good fit
§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem
§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.
§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?
Gaussian Processes Marc Deisenroth February 22, 2017 50
Marginal Likelihood and Parameter Learning
§ The marginal likelihood is non-convex§ In particular in the very-small-data regime, a GP can end up in
three different modes when optimizing the hyper-parameters:§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)§ Good fit
§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem
§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.
§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?
Gaussian Processes Marc Deisenroth February 22, 2017 50
Model Selection—Mean Function and Kernel
§ Assume we have a finite set of models Mi, each one specifying amean function mi and a kernel ki. How do we find the best one?
§ Some options:§ BIC, AIC (see CO-496)§ Compare marginal likelihood values (assuming a uniform prior on
the set of models)
Gaussian Processes Marc Deisenroth February 22, 2017 51
Model Selection—Mean Function and Kernel
§ Assume we have a finite set of models Mi, each one specifying amean function mi and a kernel ki. How do we find the best one?
§ Some options:§ BIC, AIC (see CO-496)§ Compare marginal likelihood values (assuming a uniform prior on
the set of models)
Gaussian Processes Marc Deisenroth February 22, 2017 51
Example
x-4 -3 -2 -1 0 1 2 3 4
f(x)
-2
-1
0
1
2
3
§ Four different kernels (mean function fixed to m ” 0)§ MAP hyper-parameters for each kernel§ Log-marginal likelihood values for each (optimized) model
Gaussian Processes Marc Deisenroth February 22, 2017 52
Example
x-4 -3 -2 -1 0 1 2 3 4
f(x)
-2
-1
0
1
2
3Constant kernel, LML=-1.1073
§ Four different kernels (mean function fixed to m ” 0)§ MAP hyper-parameters for each kernel§ Log-marginal likelihood values for each (optimized) model
Gaussian Processes Marc Deisenroth February 22, 2017 52
Example
x-4 -3 -2 -1 0 1 2 3 4
f(x)
-2
-1
0
1
2
3Linear kernel, LML=-1.0065
§ Four different kernels (mean function fixed to m ” 0)§ MAP hyper-parameters for each kernel§ Log-marginal likelihood values for each (optimized) model
Gaussian Processes Marc Deisenroth February 22, 2017 52
Example
x-4 -3 -2 -1 0 1 2 3 4
f(x)
-2
-1
0
1
2
3Matern kernel, LML=-0.8625
§ Four different kernels (mean function fixed to m ” 0)§ MAP hyper-parameters for each kernel§ Log-marginal likelihood values for each (optimized) model
Gaussian Processes Marc Deisenroth February 22, 2017 52
Example
x-4 -3 -2 -1 0 1 2 3 4
f(x)
-2
-1
0
1
2
3Gaussian kernel, LML=-0.69308
§ Four different kernels (mean function fixed to m ” 0)§ MAP hyper-parameters for each kernel§ Log-marginal likelihood values for each (optimized) model
Gaussian Processes Marc Deisenroth February 22, 2017 52
Application Areas
−2 0 2−5
0
5
angle in rad
ang.
vel.
in r
ad/s
−2
0
2
4
6
8
§ Reinforcement learning and roboticsModel value functions and/or dynamics with GPs
§ Bayesian optimization (Experimental Design)Model unknown utility functions with GPs
§ GeostatisticsSpatial modeling (e.g., landscapes, resources)
§ Sensor networks§ Time-series modeling and forecasting
Gaussian Processes Marc Deisenroth February 22, 2017 53
Limitations of Gaussian Processes
Computational and memory complexityTraining set size: N
§ Training scales in OpN3q
§ Prediction (variances) scales in OpN2q
§ Memory requirement: OpND` N2q
Practical limit N « 10, 000
Gaussian Processes Marc Deisenroth February 22, 2017 54
Tips and Tricks for Practitioners
§ To set initial hyper-parameters, use domain knowledge ifpossible.
§ Standardize input data and set initial length-scales ` to « 0.5.§ Standardize targets y and set initial signal variance to σf « 1.§ Often useful: Set initial noise level relatively high (e.g.,
σn « 0.5ˆ σf amplitude, even if you think your data have lownoise. The optimization surface for your other parameters will beeasier to move in.
§ When optimizing hyper-parameters, try random restarts or othertricks to avoid local optima are advised.
§ Mitigate the problem of numerical instability (Choleskydecomposition of K` σ2
n I) by penalizing high signal-to-noiseratios σf {σn
Gaussian Processes Marc Deisenroth February 22, 2017 55
Appendix
Gaussian Processes Marc Deisenroth February 22, 2017 56
The Gaussian Distribution
ppx|µ, Σq “ p2πq´D2 |Σ|´
12 exp
`
´ 12px´ µqJΣ´1px´ µq
˘
§ Mean vector µ Average of the data
§ Covariance matrix Σ Spread of the data
−4 −3 −2 −1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
x
p(x
)
p(x)
Mean
95% confidence bound
x
86
42
0
y
42
02
46
8
p(x, y
)
0.04
0.03
0.02
0.01
0.00
0.01
0.02
0.03
0.04
Gaussian Processes Marc Deisenroth February 22, 2017 57
The Gaussian Distribution
ppx|µ, Σq “ p2πq´D2 |Σ|´
12 exp
`
´ 12px´ µqJΣ´1px´ µq
˘
§ Mean vector µ Average of the data
§ Covariance matrix Σ Spread of the data
−4 −3 −2 −1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
x
p(x
)
p(x)
Mean
95% confidence bound
−5 −4 −3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
x1
x2
Mean
95% confidence bound
Gaussian Processes Marc Deisenroth February 22, 2017 57
The Gaussian Distribution
ppx|µ, Σq “ p2πq´D2 |Σ|´
12 exp
`
´ 12px´ µqJΣ´1px´ µq
˘
§ Mean vector µ Average of the data
§ Covariance matrix Σ Spread of the data
−4 −3 −2 −1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
x
p(x
)
Data
p(x)
Mean
95% confidence interval
−5 −4 −3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
x1
x2
Data
Mean
95% confidence bound
Gaussian Processes Marc Deisenroth February 22, 2017 57
Sampling from a Multivariate Gaussian
Objective
Generate a random sample y „ N`
µ, Σ˘
from a D-dimensional jointGaussian with covariance matrix Σ and mean vector µ.
However, we only have access to a random number generator thatcan sample x from N
`
0, I˘
...
Exploit that affine transformations y “ Ax` b of a Gaussian randomvariable x remain Gaussian
§ Mean: ExrAx` bs “ AExrxs ` b§ Covariance: VxrAx` bs “ AVxrxsAJ
1. Find conditions for A, b to match the mean of y
2. Find conditions for A, b to match the covariance of y
Gaussian Processes Marc Deisenroth February 22, 2017 58
Sampling from a Multivariate Gaussian
Objective
Generate a random sample y „ N`
µ, Σ˘
from a D-dimensional jointGaussian with covariance matrix Σ and mean vector µ.
However, we only have access to a random number generator thatcan sample x from N
`
0, I˘
...
Exploit that affine transformations y “ Ax` b of a Gaussian randomvariable x remain Gaussian
§ Mean: ExrAx` bs “ AExrxs ` b§ Covariance: VxrAx` bs “ AVxrxsAJ
1. Find conditions for A, b to match the mean of y
2. Find conditions for A, b to match the covariance of y
Gaussian Processes Marc Deisenroth February 22, 2017 58
Sampling from a Multivariate Gaussian
Objective
Generate a random sample y „ N`
µ, Σ˘
from a D-dimensional jointGaussian with covariance matrix Σ and mean vector µ.
However, we only have access to a random number generator thatcan sample x from N
`
0, I˘
...
Exploit that affine transformations y “ Ax` b of a Gaussian randomvariable x remain Gaussian
§ Mean: ExrAx` bs “ AExrxs ` b§ Covariance: VxrAx` bs “ AVxrxsAJ
1. Find conditions for A, b to match the mean of y
2. Find conditions for A, b to match the covariance of yGaussian Processes Marc Deisenroth February 22, 2017 58
Sampling from a Multivariate Gaussian (2)
Objective
Generate a random sample y „ N`
µ, Σ˘
from a D-dimensional jointGaussian with covariance matrix Σ and mean vector µ.
x = randn(D,1); Sample x „ N`
0, I˘
y = chol(Σ)’*x + µ; Scale x and add offset
Here chol(Σ) is the Cholesky factor L, such that LJL “ Σ
Therefore, the mean and covariance of y are
Erys “ y “ ErLJx` µs “ LJErxs ` µ “ µ
Covrys “ Erpy´ yqpy´ yqJs “ ErLJxxJLs “ LJErxxJsL “ LJL “ Σ
Gaussian Processes Marc Deisenroth February 22, 2017 59
Sampling from a Multivariate Gaussian (2)
Objective
Generate a random sample y „ N`
µ, Σ˘
from a D-dimensional jointGaussian with covariance matrix Σ and mean vector µ.
x = randn(D,1); Sample x „ N`
0, I˘
y = chol(Σ)’*x + µ; Scale x and add offset
Here chol(Σ) is the Cholesky factor L, such that LJL “ Σ
Therefore, the mean and covariance of y are
Erys “ y “ ErLJx` µs “ LJErxs ` µ “ µ
Covrys “ Erpy´ yqpy´ yqJs “ ErLJxxJLs “ LJErxxJsL “ LJL “ Σ
Gaussian Processes Marc Deisenroth February 22, 2017 59
Conditional
x-6 -4 -2 0 2 4
y
-5
-4
-3
-2
-1
0
1
2
3Joint p(x,y) ppx, yq “ N
˜«
µxµy
ff
,
«
Σxx Σxy
Σyx Σyy
ff¸
ppx|yq “ N`
µx|y, Σx|y˘
µx|y “ µx ` Σxy Σ´1yy py´ µy q
Σx|y “ Σxx ´ Σxy Σ´1yy Σyx
Conditional ppx|yq is also GaussianComputationally convenient
Gaussian Processes Marc Deisenroth February 22, 2017 60
Conditional
x-6 -4 -2 0 2 4
y
-5
-4
-3
-2
-1
0
1
2
3Joint p(x,y)Observation
ppx, yq “ N˜«
µxµy
ff
,
«
Σxx Σxy
Σyx Σyy
ff¸
ppx|yq “ N`
µx|y, Σx|y˘
µx|y “ µx ` Σxy Σ´1yy py´ µy q
Σx|y “ Σxx ´ Σxy Σ´1yy Σyx
Conditional ppx|yq is also GaussianComputationally convenient
Gaussian Processes Marc Deisenroth February 22, 2017 60
Conditional
x-6 -4 -2 0 2 4
y
-5
-4
-3
-2
-1
0
1
2
3Joint p(x,y)Observation yConditional p(x|y)
ppx, yq “ N˜«
µxµy
ff
,
«
Σxx Σxy
Σyx Σyy
ff¸
ppx|yq “ N`
µx|y, Σx|y˘
µx|y “ µx ` Σxy Σ´1yy py´ µy q
Σx|y “ Σxx ´ Σxy Σ´1yy Σyx
Conditional ppx|yq is also GaussianComputationally convenient
Gaussian Processes Marc Deisenroth February 22, 2017 60
Marginal
x-6 -4 -2 0 2 4
y
-5
-4
-3
-2
-1
0
1
2
3Joint p(x,y)Marginal p(x)
ppx, yq “ N˜«
µxµy
ff
,
«
Σxx Σxy
Σyx Σyy
ff¸
Marginal distribution:
pp x q “ż
pp x , y qd y
“ N`
µx , Σxx˘
§ The marginal of a joint Gaussian distribution is Gaussian
§ Intuitively: Ignore (integrate out) everything you are notinterested in
Gaussian Processes Marc Deisenroth February 22, 2017 61
Marginal
x-6 -4 -2 0 2 4
y
-5
-4
-3
-2
-1
0
1
2
3Joint p(x,y)Marginal p(x)
ppx, yq “ N˜«
µxµy
ff
,
«
Σxx Σxy
Σyx Σyy
ff¸
Marginal distribution:
pp x q “ż
pp x , y qd y
“ N`
µx , Σxx˘
§ The marginal of a joint Gaussian distribution is Gaussian
§ Intuitively: Ignore (integrate out) everything you are notinterested in
Gaussian Processes Marc Deisenroth February 22, 2017 61
The Gaussian Distribution in the Limit
Consider the joint Gaussian distribution ppx, xq, where x P RD andx P Rk, k Ñ8 are random variables.
Then
ppx, xq “ N˜«
µxµx
ff
,„
Σxx ΣxxΣxx Σxx
¸
where Σxx P Rkˆk and Σxx P R
Dˆk, k Ñ8.However, the marginal remains finite
pp x q “ż
pp x , x qd x “ N`
µx , Σxx˘
where we integrate out an infinite number of random variables xi.
Gaussian Processes Marc Deisenroth February 22, 2017 62
The Gaussian Distribution in the Limit
Consider the joint Gaussian distribution ppx, xq, where x P RD andx P Rk, k Ñ8 are random variables.Then
ppx, xq “ N˜«
µxµx
ff
,„
Σxx ΣxxΣxx Σxx
¸
where Σxx P Rkˆk and Σxx P R
Dˆk, k Ñ8.
However, the marginal remains finite
pp x q “ż
pp x , x qd x “ N`
µx , Σxx˘
where we integrate out an infinite number of random variables xi.
Gaussian Processes Marc Deisenroth February 22, 2017 62
The Gaussian Distribution in the Limit
Consider the joint Gaussian distribution ppx, xq, where x P RD andx P Rk, k Ñ8 are random variables.Then
ppx, xq “ N˜«
µxµx
ff
,„
Σxx ΣxxΣxx Σxx
¸
where Σxx P Rkˆk and Σxx P R
Dˆk, k Ñ8.However, the marginal remains finite
pp x q “ż
pp x , x qd x “ N`
µx , Σxx˘
where we integrate out an infinite number of random variables xi.
Gaussian Processes Marc Deisenroth February 22, 2017 62
Marginal and Conditional in the Limit
§ In practice, we consider finite training and test data xtrain, xtest
§ Then, x “ txtrain, xtest, xotheru
(xother plays the role of x from previous slide)
ppxq “ N
¨
˚
˚
˝
»
—
—
–
µtrain
µtest
µother
fi
ffi
ffi
fl
,
»
—
—
–
Σtrain Σtrain,test
Σtest,train Σtest
Σtrain,other
Σtest,other
Σother,train Σother,test Σother
fi
ffi
ffi
fl
˛
‹
‹
‚
ppxtrain, xtestq “
ż
pp xtrain, xtest , xother qd xother
ppxtest|xtrainq “ N`
µ˚, Σ˚˘
µ˚ “ µtest ` Σtest,train Σ´1train pxtrain ´ µtrain q
Σ˚ “ Σtest ´ Σtest,train Σ´1train Σtrain,test
Gaussian Processes Marc Deisenroth February 22, 2017 63
Marginal and Conditional in the Limit
§ In practice, we consider finite training and test data xtrain, xtest
§ Then, x “ txtrain, xtest, xotheru
(xother plays the role of x from previous slide)
ppxq “ N
¨
˚
˚
˝
»
—
—
–
µtrain
µtest
µother
fi
ffi
ffi
fl
,
»
—
—
–
Σtrain Σtrain,test
Σtest,train Σtest
Σtrain,other
Σtest,other
Σother,train Σother,test Σother
fi
ffi
ffi
fl
˛
‹
‹
‚
ppxtrain, xtestq “
ż
pp xtrain, xtest , xother qd xother
ppxtest|xtrainq “ N`
µ˚, Σ˚˘
µ˚ “ µtest ` Σtest,train Σ´1train pxtrain ´ µtrain q
Σ˚ “ Σtest ´ Σtest,train Σ´1train Σtrain,test
Gaussian Processes Marc Deisenroth February 22, 2017 63
Marginal and Conditional in the Limit
§ In practice, we consider finite training and test data xtrain, xtest
§ Then, x “ txtrain, xtest, xotheru
(xother plays the role of x from previous slide)
ppxq “ N
¨
˚
˚
˝
»
—
—
–
µtrain
µtest
µother
fi
ffi
ffi
fl
,
»
—
—
–
Σtrain Σtrain,test
Σtest,train Σtest
Σtrain,other
Σtest,other
Σother,train Σother,test Σother
fi
ffi
ffi
fl
˛
‹
‹
‚
ppxtrain, xtestq “
ż
pp xtrain, xtest , xother qd xother
ppxtest|xtrainq “ N`
µ˚, Σ˚˘
µ˚ “ µtest ` Σtest,train Σ´1train pxtrain ´ µtrain q
Σ˚ “ Σtest ´ Σtest,train Σ´1train Σtrain,test
Gaussian Processes Marc Deisenroth February 22, 2017 63
Marginal and Conditional in the Limit
§ In practice, we consider finite training and test data xtrain, xtest
§ Then, x “ txtrain, xtest, xotheru
(xother plays the role of x from previous slide)
ppxq “ N
¨
˚
˚
˝
»
—
—
–
µtrain
µtest
µother
fi
ffi
ffi
fl
,
»
—
—
–
Σtrain Σtrain,test
Σtest,train Σtest
Σtrain,other
Σtest,other
Σother,train Σother,test Σother
fi
ffi
ffi
fl
˛
‹
‹
‚
ppxtrain, xtestq “
ż
pp xtrain, xtest , xother qd xother
ppxtest|xtrainq “ N`
µ˚, Σ˚˘
µ˚ “ µtest ` Σtest,train Σ´1train pxtrain ´ µtrain q
Σ˚ “ Σtest ´ Σtest,train Σ´1train Σtrain,test
Gaussian Processes Marc Deisenroth February 22, 2017 63
Marginal and Conditional in the Limit
§ In practice, we consider finite training and test data xtrain, xtest
§ Then, x “ txtrain, xtest, xotheru
(xother plays the role of x from previous slide)
ppxq “ N
¨
˚
˚
˝
»
—
—
–
µtrain
µtest
µother
fi
ffi
ffi
fl
,
»
—
—
–
Σtrain Σtrain,test
Σtest,train Σtest
Σtrain,other
Σtest,other
Σother,train Σother,test Σother
fi
ffi
ffi
fl
˛
‹
‹
‚
ppxtrain, xtestq “
ż
pp xtrain, xtest , xother qd xother
ppxtest|xtrainq “ N`
µ˚, Σ˚˘
µ˚ “ µtest ` Σtest,train Σ´1train pxtrain ´ µtrain q
Σ˚ “ Σtest ´ Σtest,train Σ´1train Σtrain,test
Gaussian Processes Marc Deisenroth February 22, 2017 63
Gaussian Process Training: Hierarchical Inference
§ Level-1 inference (posterior on f ):
pp f |X, y, θq “ppy|X, f q pp f |X, θq
ppy|X, θq
ppy|X, θq “
ż
ppy| f , Xq pp f |X, f θqd f
§ Level-2 inference (posterior on θ)
ppθ|X, yq “ppy|X, θq ppθq
ppy|Xq
θ
σnyixi
f
N
Gaussian Processes Marc Deisenroth February 22, 2017 64
Gaussian Process Training: Hierarchical Inference
§ Level-1 inference (posterior on f ):
pp f |X, y, θq “ppy|X, f q pp f |X, θq
ppy|X, θq
ppy|X, θq “
ż
ppy| f , Xq pp f |X, f θqd f
§ Level-2 inference (posterior on θ)
ppθ|X, yq “ppy|X, θq ppθq
ppy|Xq
θ
σnyixi
f
N
Gaussian Processes Marc Deisenroth February 22, 2017 64
GP as the Limit of an Infinite RBF Network
Consider the universal function approximator
f pxq “ÿ
iPZ
limNÑ8
1N
Nÿ
n“1
γn exp
˜
´px´ pi` n
N qq2
λ2
¸
, x P R , λ P R`
with γn „ N`
0, 1˘
(random weights)Gaussian-shaped basis functions (with variance λ2{2) everywhere
on the real axis
f pxq “ÿ
iPZ
ż i`1
iγpsq exp
ˆ
´px´ sq2
λ2
˙
ds “ż 8
´8
γpsq expˆ
´px´ sq2
λ2
˙
ds
§ Mean: Er f pxqs “ 0
§ Covariance: Covr f pxq, f px1qs “ θ21 exp
´
´px´x1q2
2λ2
¯
for suitable θ21
GP with mean 0 and Gaussian covariance function
Gaussian Processes Marc Deisenroth February 22, 2017 65
GP as the Limit of an Infinite RBF Network
Consider the universal function approximator
f pxq “ÿ
iPZ
limNÑ8
1N
Nÿ
n“1
γn exp
˜
´px´ pi` n
N qq2
λ2
¸
, x P R , λ P R`
with γn „ N`
0, 1˘
(random weights)Gaussian-shaped basis functions (with variance λ2{2) everywhere
on the real axis
f pxq “ÿ
iPZ
ż i`1
iγpsq exp
ˆ
´px´ sq2
λ2
˙
ds “ż 8
´8
γpsq expˆ
´px´ sq2
λ2
˙
ds
§ Mean: Er f pxqs “ 0
§ Covariance: Covr f pxq, f px1qs “ θ21 exp
´
´px´x1q2
2λ2
¯
for suitable θ21
GP with mean 0 and Gaussian covariance function
Gaussian Processes Marc Deisenroth February 22, 2017 65
GP as the Limit of an Infinite RBF Network
Consider the universal function approximator
f pxq “ÿ
iPZ
limNÑ8
1N
Nÿ
n“1
γn exp
˜
´px´ pi` n
N qq2
λ2
¸
, x P R , λ P R`
with γn „ N`
0, 1˘
(random weights)Gaussian-shaped basis functions (with variance λ2{2) everywhere
on the real axis
f pxq “ÿ
iPZ
ż i`1
iγpsq exp
ˆ
´px´ sq2
λ2
˙
ds “ż 8
´8
γpsq expˆ
´px´ sq2
λ2
˙
ds
§ Mean: Er f pxqs “ 0
§ Covariance: Covr f pxq, f px1qs “ θ21 exp
´
´px´x1q2
2λ2
¯
for suitable θ21
GP with mean 0 and Gaussian covariance functionGaussian Processes Marc Deisenroth February 22, 2017 65
References I
[1] N. A. C. Cressie. Statistics for Spatial Data. Wiley-Interscience, 1993.[2] M. P. Deisenroth and S. Mohamed. Expectation Propagation in Gaussian Process Dynamical Systems. In Advances in
Neural Information Processing Systems, pages 2618–2626, 2012.[3] M. P. Deisenroth and J. W. Ng. Distributed Gaussian Processes. In Proceedings of the International Conference on Machine
Learning, 2015.[4] M. P. Deisenroth, C. E. Rasmussen, and J. Peters. Gaussian Process Dynamic Programming. Neurocomputing,
72(7–9):1508–1524, March 2009.[5] M. P. Deisenroth, R. Turner, M. Huber, U. D. Hanebeck, and C. E. Rasmussen. Robust Filtering and Smoothing with
Gaussian Processes. IEEE Transactions on Automatic Control, 57(7):1865–1871, 2012.[6] R. Frigola, F. Lindsten, T. B. Schon, and C. E. Rasmussen. Bayesian Inference and Learning in Gaussian Process
State-Space Models with Particle MCMC. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,editors, Advances in Neural Information Processing Systems, pages 3156–3164. Curran Associates, Inc., 2013.
[7] J. Kocijan, R. Murray-Smith, C. E. Rasmussen, and A. Girard. Gaussian Process Model Based Predictive Control. InProceedings of the 2004 American Control Conference (ACC 2004), pages 2214–2219, Boston, MA, USA, June–July 2004.
[8] A. Krause, A. Singh, and C. Guestrin. Near-Optimal Sensor Placements in Gaussian Processes: Theory, EfficientAlgorithms and Empirical Studies. Journal of Machine Learning Research, 9:235–284, February 2008.
[9] J. R. Lloyd, D. Duvenaud, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Automatic Construction andNatural-Language Description of Nonparametric Regression Models. In AAAI Conference on Artificial Intelligence, pages1–11, 2014.
[10] M. A. Osborne, S. J. Roberts, A. Rogers, S. D. Ramchurn, and N. R. Jennings. Towards Real-Time Information Processingof Sensor Network Data Using Computationally Efficient Multi-output Gaussian Processes. In Proceedings of theInternational Conference on Information Processing in Sensor Networks, pages 109–120. IEEE Computer Society, 2008.
[11] J. Quinonero-Candela and C. E. Rasmussen. A Unifying View of Sparse Approximate Gaussian Process Regression.Journal of Machine Learning Research, 6(2):1939–1960, 2005.
[12] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and MachineLearning. The MIT Press, Cambridge, MA, USA, 2006.
[13] S. Roberts, M. A. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain. Gaussian Processes for Time Series Modelling.Philosophical Transactions of the Royal Society (Part A), 371(1984), February 2013.
Gaussian Processes Marc Deisenroth February 22, 2017 66