of 10
8/20/2019 Introduction to GPR
1/25
Introduction
Introduction to Gaussian Process Regression
Hanna M. Wallach
January 25, 2005
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
8/20/2019 Introduction to GPR
2/25
Introduction
Outline
Regression: weight-space view
Regression: function-space view (Gaussian processes)Weight-space and function-space correspondence
Making predictions
Model selection: hyperparameters
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
8/20/2019 Introduction to GPR
3/25
Introduction
Supervised Learning: Regression (1)
−1 −0.5 0 0.5 1−1.5
−1
−0.5
0
0.5
1
1.5
input, x
o u t p
u t ,
f ( x )
underlying function and noisy data
training data
Assume an underlying process which generates “clean” data.
Goal: recover underlying process from noisy observed data.
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
8/20/2019 Introduction to GPR
4/25
Introduction
Supervised Learning: Regression (2)
Training data are D = {x(i ), y (i ) | i = 1, . . . , n}.
Each input is a vector x of dimension d .
Each target is a real-valued scalar y = f (x) + noise.
Collect inputs in d × n matrix, X , and targets in vector, y :
D = {X , y}.
Wish to infer f for unseen input x, using P (f |x,D).
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
8/20/2019 Introduction to GPR
5/25
Introduction
Gaussian Process Models: Inference in Function Space
0 0.2 0.4 0.6 0.8 1−1
0
1
2
3samples from the prior
input, x
o u t p
u t ,
f ( x )
0 0.2 0.4 0.6 0.8 10.4
0.6
0.8
1
1.2
1.4samples from the posterior
input, x
o u t p
u t ,
f ( x )
A Gaussian process defines a distribution over functions.Inference takes place directly in function space.
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
8/20/2019 Introduction to GPR
6/25
Bayesian Linear Regression
Part I
Regression: The Weight-Space View
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
B i Li R i
8/20/2019 Introduction to GPR
7/25
Bayesian Linear Regression
Bayesian Linear Regression (1)
−1 −0.5 0 0.5 1−1.5
−1
−0.5
0
0.5
1
1.5
input, x
o
u t p u t ,
f ( x )
training data
Assuming noise ∼ N (0, σ2
), the linear regression model is:
f (x|w) = xw, y = f + .
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
B i Li R i
8/20/2019 Introduction to GPR
8/25
Bayesian Linear Regression
Bayesian Linear Regression (2)
Likelihood of parameters is:
P (y|X , w) = N (X w, σ2I ).
Assume a Gaussian prior over parameters:
P (w) = N (0, Σp ).
Apply Bayes’ theorem to obtain posterior:
P (w|y, X ) ∝ P (y|X , w)P (w).
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
Bayesian Linear Regression
8/20/2019 Introduction to GPR
9/25
Bayesian Linear Regression
Bayesian Linear Regression (3)
Posterior distribution over w is:
P (w|y, X ) = N ( 1
σ2
A−1X y, A−1) where A = Σ−1p + 1
σ2
XX .
Predictive distribution is:
P (f |x, X , y) =
f (x|w)P (w|X , y)d w
= N ( 1σ2
xA−1X y, xA−1x).
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
Bayesian Linear Regression
8/20/2019 Introduction to GPR
10/25
Bayesian Linear Regression
Increasing Expressiveness
Use a set of basis functions Φ(x) to project a d dimensionalinput x into m dimensional feature space:
e.g. Φ(x ) = (1, x , x 2, . . . )
P (f |x, X , y) can be expressed in terms of inner products infeature space:
Can now use the kernel trick.
How many basis functions should we use?
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
Gaussian Process Regression
8/20/2019 Introduction to GPR
11/25
Gaussian Process Regression
Part II
Regression: The Function-Space View
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
Gaussian Process Regression
8/20/2019 Introduction to GPR
12/25
Gaussian Process Regression
Gaussian Processes: Definition
A Gaussian process is a collection of random variables, anyfinite number of which have a joint Gaussian distribution.
Consistency:If the GP specifies y (1), y (2) ∼ N (µ, Σ), then it must alsospecify y (1) ∼ N (µ1, Σ11):
A GP is completely specified by a mean function and a
positive definite covariance function.
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
Gaussian Process Regression
8/20/2019 Introduction to GPR
13/25
g
Gaussian Processes: A Distribution over Functions
e.g. Choose mean function zero, and covariance function:
K p ,q = Cov(f (x(p )), f (x(q ))) = K (x(p ), x(q ))
For any set of inputs x(1), . . . , x(n) we may compute K whichdefines a joint distribution over function values:
f (x(1)), . . . , f (x(n)) ∼ N (0, K ).
Therefore a GP specifies a distribution over functions.
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
Gaussian Process Regression
8/20/2019 Introduction to GPR
14/25
g
Gaussian Processes: Simple Example
Can obtain a GP from the Bayesin linear regression model:
f (x) = xw with w ∼ N (0, Σp ).
Mean function is given by:
E[f (x)] = xE[w] = 0.
Covariance function is given by:
E[f (x)f (x)] = xE[ww]x = xΣp x.
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
8/20/2019 Introduction to GPR
15/25
Gaussian Process Regression
8/20/2019 Introduction to GPR
16/25
The Covariance Function
Specifies the covariance between pairs of random variables.
e.g. Squared exponential covariance function:
Cov(f (x(p )), f (x(q ))) = K (x(p ), x(q )) = exp (−1
2|x(p ) − x(q )|2).
0 2 4 6 8 100
0.2
0.4
0.6
0.8
1K(x
(p) = 5, x
(q)) as a function of x
(q)
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
Gaussian Process Regression
8/20/2019 Introduction to GPR
17/25
Gaussian Process Prior
Given a set of inputs x(1), . . . , x(n) we may draw samplesf (x(1)), . . . , f (x(n)) from the GP prior:
f (x(1)), . . . , f (x(n)) ∼ N (0, K ).
Four samples:
0 0.2 0.4 0.6 0.8 1−1
0
1
2
3samples from the prior
input, x
o u t p u t ,
f ( x )
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
Gaussian Process Regression
8/20/2019 Introduction to GPR
18/25
Posterior: Noise-Free Observations (1)
Given noise-free training data:
D = {x(i ), f (i ) | i = 1, . . . , n} = {X , f }.
Want to make predictions f at test points X .
According to GP prior, joint distribution of f and f is:
f f ∼ N
0, K (X , X ) K (X , X )
K (X , X ) K (X , X )
.
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
Gaussian Process Regression
8/20/2019 Introduction to GPR
19/25
Posterior: Noise-Free Observations (2)
Condition {X , f } on D = {X , f } obtain the posterior.
Restrict prior to contain only functions which agree with D .
The posterior, P (f |X , X , f ), is Gaussian with:
µ = K (X , X )K (X , X )−1f , and
Σ = K (X , X ) − K (X , X )K (X , X )−1K (X , X ).
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
Gaussian Process Regression
8/20/2019 Introduction to GPR
20/25
Posterior: Noise-Free Observations (3)
0 0.2 0.4 0.6 0.8 10.4
0.6
0.8
1
1.2
1.4samples from the posterior
input, x
o u t
p u t ,
f ( x )
Samples all agree with the observations D = {X , f }.Greatest variance is in regions with few training points.
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
Gaussian Process Regression
8/20/2019 Introduction to GPR
21/25
Prediction: Noisy Observations
Typically we have noisy observations:
D = {X , y}, where y = f +
Assume additive noise ∼ N (0, σ2I ).
Conditioning on D = {X , y} gives a Gaussian with:
µ = K (X , X )[K (X , X ) + σ2I ]−1y, and
Σ = K (X , X ) − K (X , X )[K (X , X ) + σ2I ]−1K (X , X ).
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
Gaussian Process Regression
8/20/2019 Introduction to GPR
22/25
Model Selection: Hyperparameters
e.g. the ARD covariance function:
k (x (p ), x (q )) = exp (− 1
2θ2(x (p ) − x (q ))2).
How best to choose θ?
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5
−1
−0.5
0
0.5
1
1.5
samples from the posterior, θ = 0.1
input, x
o
u t p u t ,
f ( x )
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
samples from the posterior, θ = 0.3
input, x
o
u t p u t ,
f ( x )
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
samples from the posterior, θ = 0.5
input, x
o
u t p u t ,
f ( x )
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
Gaussian Process Regression
8/20/2019 Introduction to GPR
23/25
Model Selection: Optimizing Marginal Likelihood (1)
In absence of a strong prior P (θ), the posterior forhyperparameter θ is proportional to the marginal likelihood:
P (θ|X , y) ∝ P (y|X , θ)
Choose θ to optimize the marginal log-likelihood:
log P (y|X , θ) = −1
2 log |K (X , X ) + σ2I |−
12
y(K (X , X ) + σ2I )−1y − n2
log 2π.
Hanna M. Wallach [email protected]
Introduction to Gaussian Process Regression
Gaussian Process Regression
8/20/2019 Introduction to GPR
24/25
Model Selection: Optimizing Marginal Likelihood (2)
θML = 0.3255:
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
samples from the posterior, θ = 0.3255
input, x
o u t p u t ,
f ( x
)
Using θML
is an approximation to the true Bayesian method of integrating over all θ values weighted by their posterior.
Hanna M. Wallach [email protected] to Gaussian Process Regression
Gaussian Process Regression
8/20/2019 Introduction to GPR
25/25
References
1 Carl Edward Rasmussen. Gaussian Processes in Machinelearning. Machine Learning Summer School , T ubingen, 2003.
http://www.kyb.tuebingen.mpg.de/~carl/mlss03/2 Carl Edward Rasmussen and Chris Williams. Gaussian
Processes for Machine Learning. Forthcoming.
3 Carl Edward Rasmussen. The Gaussian Process Website.http://www.gatsby.ucl.ac.uk/~edward/gp/
Hanna M. Wallach [email protected] to Gaussian Process Regression
http://www.kyb.tuebingen.mpg.de/~carl/mlss03/http://www.gatsby.ucl.ac.uk/~edward/gp/http://www.gatsby.ucl.ac.uk/~edward/gp/http://www.kyb.tuebingen.mpg.de/~carl/mlss03/