Course in Bayesian Optimization
Javier González(most of slides today: Neil Lawrence)
University of Sheffield, Sheffield, UK
27th October 2015
Many thanks to
I Mauricio Álvarez.I Cristian Guarizno.I Neil Lawrence, University of Sheffield.I Zhenwen Dai, University of Sheffield.I Machine Learning group, University of Sheffield.I Philipp Hennig, Max Planck institute.I Michael Osborne, University of Oxford.
Outline of the Course
I Lecture 1: Uncertainty and Gaussian Processes.
I Lecture 2: Introduction to Bayesian (probabilistic)optimization.
I Lecture 3: Advanced topics in Bayesian Optimization.
Outline of the Course
I Lab 1: Introduction to GPy.
I Lab 2: Introduction to GPyOpt.
I Lab 3: Advanced GPyOpt.
I Day 4: Projects + presentations.
Points of the day
What is machine learning?
What is the uncertainty? Types?
How the uncertainty plays a role in the learning process.
Gaussian processes as models to handle uncertainty.
Points of the day
What is machine learning?
What is the uncertainty? Types?
How the uncertainty plays a role in the learning process.
Gaussian processes as models to handle uncertainty.
Points of the day
What is machine learning?
What is the uncertainty? Types?
How the uncertainty plays a role in the learning process.
Gaussian processes as models to handle uncertainty.
Points of the day
What is machine learning?
What is the uncertainty? Types?
How the uncertainty plays a role in the learning process.
Gaussian processes as models to handle uncertainty.
Points of the day
What is machine learning?
What is the uncertainty? Types?
How the uncertainty plays a role in the learning process.
Gaussian processes as models to handle uncertainty.
What is to learn?
The human learning process?
Cancerous Tumor?
What is Machine Learning?
data
+ model = prediction
I data: observations, could be actively or passively acquired(meta-data).
I model: assumptions, based on previous experience (otherdata! transfer learning etc), or beliefs about the regularitiesof the universe. Inductive bias.
I prediction: an action to be taken or a categorization or aquality score.
What is Machine Learning?
data +
model = prediction
I data: observations, could be actively or passively acquired(meta-data).
I model: assumptions, based on previous experience (otherdata! transfer learning etc), or beliefs about the regularitiesof the universe. Inductive bias.
I prediction: an action to be taken or a categorization or aquality score.
What is Machine Learning?
data + model
= prediction
I data: observations, could be actively or passively acquired(meta-data).
I model: assumptions, based on previous experience (otherdata! transfer learning etc), or beliefs about the regularitiesof the universe. Inductive bias.
I prediction: an action to be taken or a categorization or aquality score.
What is Machine Learning?
data + model =
prediction
I data: observations, could be actively or passively acquired(meta-data).
I model: assumptions, based on previous experience (otherdata! transfer learning etc), or beliefs about the regularitiesof the universe. Inductive bias.
I prediction: an action to be taken or a categorization or aquality score.
What is Machine Learning?
data + model = prediction
I data: observations, could be actively or passively acquired(meta-data).
I model: assumptions, based on previous experience (otherdata! transfer learning etc), or beliefs about the regularitiesof the universe. Inductive bias.
I prediction: an action to be taken or a categorization or aquality score.
Historical Perspective
I A data driven approach to Artificial Intelligence.I Inspired by attempts to model the brain (the
connectionists).I A community that transcended traditional boundaries
(psychology, statistical physics, signal processing)I Led to an approach that dominates in the modern data-rich
world.
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
Two Dominant Approaches
I Machine Learning as Optimization:I Formulate your learning Problem as an optimization
problem.I Typically intractable, so minimize a relaxed version of the
cost function.I Prove characteristics of the resulting solution.
I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations
(MCMC, variational etc).
Modelling Assumptions
I Modelling assumptions are either included as:I a regularizer (optimization) orI in the probability distribution (probabilistic approach).
I Typical assumptions: sparsity, smoothness.
Modelling Assumptions
I Modelling assumptions are either included as:I a regularizer (optimization) orI in the probability distribution (probabilistic approach).
I Typical assumptions: sparsity, smoothness.
Modelling Assumptions
I Modelling assumptions are either included as:I a regularizer (optimization) orI in the probability distribution (probabilistic approach).
I Typical assumptions: sparsity, smoothness.
Modelling Assumptions
I Modelling assumptions are either included as:I a regularizer (optimization) orI in the probability distribution (probabilistic approach).
I Typical assumptions: sparsity, smoothness.
Applications of Machine Learning
Handwriting Recognition : Recognising handwrittencharacters. For example LeNethttp://bit.ly/d26fwK.
Friend Indentification : Suggesting friends on social networkshttps:
//www.facebook.com/help/501283333222485
Ranking : Learning relative skills of on line game players,the TrueSkill system http://research.microsoft.com/en-us/projects/trueskill/.http://www.netflixprize.com/.
Internet Search : For example Ad Click Through rateprediction http://bit.ly/a7XLH4.
News Personalisation : For example Zitehttp://www.zite.com/.
Game Play Learning : For example, learning to play Gohttp://bit.ly/cV77zM.
http://bit.ly/d26fwKhttps://www.facebook.com/help/501283333222485https://www.facebook.com/help/501283333222485http://research.microsoft.com/en-us/projects/trueskill/http://research.microsoft.com/en-us/projects/trueskill/http://www.netflixprize.com/http://bit.ly/a7XLH4http://www.zite.com/http://bit.ly/cV77zM
Learning is Optimization
y = mx + c
0
1
2
3
4
5
0 1 2 3 4 5
y
x
y = mx + c
0
1
2
3
4
5
0 1 2 3 4 5
y
x
y = mx + cc
m
0
1
2
3
4
5
0 1 2 3 4 5
y
x
y = mx + cc
m
0
1
2
3
4
5
0 1 2 3 4 5
y
x
y = mx + cc
m
0
1
2
3
4
5
0 1 2 3 4 5
y
x
y = mx + c
0
1
2
3
4
5
0 1 2 3 4 5
y
x
y = mx + c
0
1
2
3
4
5
0 1 2 3 4 5
y
x
y = mx + c
y = mx + c
point 1: x = 1, y = 3
3 = m + c
point 2: x = 3, y = 1
1 = 3m + c
point 3: x = 2, y = 2.5
2.5 = 2m + c
y = mx + c + �
point 1: x = 1, y = 3
3 = m + c + �1
point 2: x = 3, y = 1
1 = 3m + c + �2
point 3: x = 2, y = 2.5
2.5 = 2m + c + �3
Regression Revisited
I We introduce an error function of the form
E(w) =n∑
i=1
(yi −mxi − c
)2I Minimize the error function with respect to m and c
Mathematical Interpretation
I What is the mathematical interpretation?I There is a cost function.I It expresses mismatch between your prediction and reality.
E(m, c) =n∑
i=1
(yi −mxi − c
)2I This is known as the sum of squares error.
Learning is Optimization
I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set
to zero.dE(m)
dm= −2
n∑i=1
xi(yi −mxi − c
)
Learning is Optimization
I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set
to zero.
0 = −2n∑
i=1
xi(yi −mxi − c
)
Learning is Optimization
I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set
to zero.
0 = −2n∑
i=1
xiyi + 2n∑
i=1
mx2i + 2n∑
i=1
cxi
Learning is Optimization
I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set
to zero.
m =∑n
i=1(yi − c
)xi∑n
i=1 x2i
Learning is Optimization
I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set
to zero.dE(c)
dc= −2
n∑i=1
(yi −mxi − c
)
Learning is Optimization
I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set
to zero.
0 = −2n∑
i=1
(yi −mxi − c
)
Learning is Optimization
I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set
to zero.
0 = −2n∑
i=1
yi + 2n∑
i=1
mxi + 2nc
Learning is Optimization
I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set
to zero.
c =∑n
i=1(yi −mxi
)n
Fixed Point Updates
Worked example.
c∗ =∑n
i=1(yi −m∗xi
)n
,
m∗ =∑n
i=1 xi(yi − c∗
)∑ni=1 x
2i
,
σ2∗
=
∑ni=1
(yi −m∗xi − c∗
)2n
Coordinate Descent
-1
-0.5
0
0.5
1
1.5
2
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
c
m
E(m, c)
Coordinate Descent
-1
-0.5
0
0.5
1
1.5
2
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
c
m
Iteration 1
Coordinate Descent
-1
-0.5
0
0.5
1
1.5
2
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
c
m
Iteration 1
Coordinate Descent
-1
-0.5
0
0.5
1
1.5
2
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
c
m
Iteration 2
Coordinate Descent
-1
-0.5
0
0.5
1
1.5
2
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
c
m
Iteration 2
Coordinate Descent
-1
-0.5
0
0.5
1
1.5
2
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
c
m
Iteration 3
Coordinate Descent
-1
-0.5
0
0.5
1
1.5
2
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
c
m
Iteration 3
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 4
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 4
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 5
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 5
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 6
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 6
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 7
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 7
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 8
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 8
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 9
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 9
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 10
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 20
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 30
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 30
Coordinate Descent
0.9
1
1.1
1.2
1.3
1.4
1.5
0.2 0.25 0.3 0.35 0.4
c
m
Iteration 30
Important Concepts Not Covered
I Optimization methods.I Second order methods, conjugate gradient, quasi-Newton
and Newton.I Effective heuristics such as momentum, CMA, etc
I Local vs global solutions (Bayesian optimization!).
Learning is probabilistic modeling
Machine Learning and Probability
I The world is an uncertain place.Epistemic uncertainty: uncertainty arising through lack of
knowledge. (What colour socks is that personwearing?)
Aleatoric uncertainty: uncertainty arising through anunderlying stochastic system. (Where will asheet of paper fall if I drop it?)
Machine Learning and Probability
I The world is an uncertain place.Epistemic uncertainty: uncertainty arising through lack of
knowledge. (What colour socks is that personwearing?)
Aleatoric uncertainty: uncertainty arising through anunderlying stochastic system. (Where will asheet of paper fall if I drop it?)
Machine Learning and Probability
I The world is an uncertain place.Epistemic uncertainty: uncertainty arising through lack of
knowledge. (What colour socks is that personwearing?)
Aleatoric uncertainty: uncertainty arising through anunderlying stochastic system. (Where will asheet of paper fall if I drop it?)
Probability: A Framework to Characterise Uncertainty
I We need a framework to characterise the uncertainty.I In this course we make use of probability theory to
characterise uncertainty.
Probability: A Framework to Characterise Uncertainty
I We need a framework to characterise the uncertainty.I In this course we make use of probability theory to
characterise uncertainty.
Richard Price
I Welsh philosopher and essay writer.I Edited Thomas Bayes’s essay which contained
foundations of Bayesian philosophy.
Figure: Richard Price, 1723–1791. (source Wikipedia)
Laplace
I French Mathematician and Astronomer.
Figure: Pierre-Simon Laplace, 1749–1827. (source Wikipedia)
Probabilistic Intepretation
I Quadratic error functions can be seen as Gaussian noisemodels [1, 2].
I Imagine we are seeing data given by,
y(xi) = mxi + c + �
where � is Gaussian noise with standard deviation σ,
� ∼ N(0, σ2
).
Noise Corrupted Mapping
I This implies that
yi ∼ N(mxi + c, σ2
)I Which we also write
p(yi|w, σ) = N(yi|mxi + c, σ2
)
Gaussian Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
p(y|m, c, σ2) =n∏
i=1
1√2πσ2
exp(− (yi −mxi − c)
2
2σ2
)
Gaussian Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
p(y|m, c, σ2) =n∏
i=1
1√2πσ2
exp(− (yi −mxi − c)
2
2σ2
)
Gaussian Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
p(y|m, c, σ2) ∝n∏
i=1
exp(− (yi −mxi − c)
2
2σ2
)
Gaussian Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
p(y|m, c, σ2) ∝n∏
i=1
exp(− (yi −mxi − c)
2
2σ2
)
Gaussian Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
p(y|m, c, σ2) ∝ exp− n∑
i=1
(yi −mxi − c)22σ2
Gaussian Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
p(y|m, c, σ2) ∝ exp− n∑
i=1
(yi −mxi − c)22σ2
Gaussian Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
p(y|m, c, σ2) ∝ exp− n∑
i=1
(yi −mxi − c)22σ2
Gaussian Log Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
logp(y|m, c, σ2) = − 12σ2
n∑i=1
(yi −mxi − c)2 + const
Gaussian Log Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
− logp(y|m, c, σ2) = 12σ2
n∑i=1
(yi −mxi − c)2 + const
Gaussian Log Likelihood
I If the noise is sampled independently for each data pointfrom the same density we have
p(y|m, c, σ2) =n∏
i=1
N(yi|mxi + c, σ2
)I This is an i.i.d. assumption about the noise.I Writing the functional form we have
− logp(y|m, c, σ2) = 12σ2
E(m, c) + const
Probabilistic Interpretation of the Error Function
I Probabilistic Interpretation for Error Function is NegativeLog Likelihood.
I Minimizing error function is equivalent to maximizing loglikelihood.
I Maximizing log likelihood is equivalent to maximizing thelikelihood because log is monotonic.
I Probabilistic interpretation: Minimizing error function isequivalent to maximum likelihood with respect toparameters.
Sample Based Approximation implies i.i.d
I The log likelihood is
L(θ) = log P(y|θ)I If the likelihood is independent over the individual data
points,
P(y|θ) =n∏
i=1
P(yi|θ)
I This is equivalent to the assumption that the data isindependent and identically distributed. This is known asi.i.d..
I Now the log likelihood is
L(θ) =n∑
i=1
log P(yi|θ)
I We take the negative log likelihood to recover the sum ofsquares error.
Bayesian perspective
Underdetermined System
What about two unknowns andone observation?
y1 = mx1 + c
012345
0 1 2 3y
x
Underdetermined System
Can compute m given c.
m =y1 − c
x
012345
0 1 2 3y
x
Underdetermined System
Can compute m given c.
c = 1.75 =⇒ m = 1.25
012345
0 1 2 3y
x
Underdetermined System
Can compute m given c.
c = −0.777 =⇒ m = 3.78
012345
0 1 2 3y
x
Underdetermined System
Can compute m given c.
c = −4.01 =⇒ m = 7.01
012345
0 1 2 3y
x
Underdetermined System
Can compute m given c.
c = −0.718 =⇒ m = 3.72
012345
0 1 2 3y
x
Underdetermined System
Can compute m given c.
c = 2.45 =⇒ m = 0.545
012345
0 1 2 3y
x
Underdetermined System
Can compute m given c.
c = −0.657 =⇒ m = 3.66
012345
0 1 2 3y
x
Underdetermined System
Can compute m given c.
c = −3.13 =⇒ m = 6.13
012345
0 1 2 3y
x
Underdetermined System
Can compute m given c.
c = −1.47 =⇒ m = 4.47
012345
0 1 2 3y
x
Underdetermined System
Can compute m given c.Assume
c ∼ N (0, 4) ,
we find a distribution of solu-tions.
012345
0 1 2 3y
x
Bayesian Approach
I Likelihood for the regression example has the form
p(y|w, σ2) =n∏
i=1
N(yi|w>φi, σ2
).
I Suggestion was to maximize this likelihood with respect tow.
I This can be done with gradient based optimization of thelog likelihood.
I Alternative approach: integration across w.I Consider expected value of likelihood under a range of
potential ws.I This is known as the Bayesian approach.
Note on the Term Bayesian
I We will use Bayes’ rule to invert probabilities in theBayesian approach.
I Bayesian is not named after Bayes’ rule (v. commonconfusion).
I The term Bayesian refers to the treatment of the parametersas stochastic variables.
I This approach was proposed by Laplace and Bayesindependently.
I For early statisticians this was very controversial (Fisher etal).
Bayesian Controversy
I Bayesian controversy relates to treating epistemicuncertainty as aleatoric uncertainty.
I Another analogy:I Before a football match the uncertainty about the result is
aleatoric.I If I watch a recorded match without knowing the result the
uncertainty is epistemic.
Simple Bayesian Inference
posterior =likelihood × prior
marginal likelihood
I Four components:1. Prior distribution: represents belief about parameter values
before seeing data.2. Likelihood: gives relation between parameters and data.3. Posterior distribution: represents updated belief about
parameters after data is observed.4. Marginal likelihood: represents assessment of the quality of
the model. Can be compared with other models(likelihood/prior combinations). Ratios of marginallikelihoods are known as Bayes factors.
Recap
I We can see the learning process as an optimizationproblem.
I We can also see the learning process as probabilisticmodelling (that also depends on parameters that need tobe optimised).
I The Bayesian frameworks allows us to handle ‘epistemic’uncertainty in systems.
I Examples only for linear functions, so far.
Generalizations: What if we want to use a moreexpressive model (non-linear function)
I ML as optimization: Regularization in RKHSs (kernelmethods)
I Bayesian perspective: Gaussian Processes.
Both approaches are related: as standard regression andBayesian regression are.
More important for us: Gaussian processes.
Basis FunctionsNonlinear Regression
I Problem with Linear Regression—x may not be linearlyrelated to y.
I Potential solution: create a feature space: define φ(x)where φ(·) is a nonlinear function of x.
I Model for target is a linear combination of these nonlinearfunctions
f (x) =K∑
j=1
w jφ j(x) (1)
Quadratic Basis
I Basis functions can be global. E.g. quadratic basis:
[1, x, x2]
-2
-1
0
1
2
-1 0 1
φ(x
)
x
φ(x) = 1
Figure: A quadratic basis.
Quadratic Basis
I Basis functions can be global. E.g. quadratic basis:
[1, x, x2]
-2
-1
0
1
2
-1 0 1
φ(x
)
x
φ(x) = 1
φ(x) = x
Figure: A quadratic basis.
Quadratic Basis
I Basis functions can be global. E.g. quadratic basis:
[1, x, x2]
-2
-1
0
1
2
-1 0 1
φ(x
)
x
φ(x) = 1
φ(x) = xφ(x) = x2
Figure: A quadratic basis.
Functions Derived from Quadratic Basisf (x) = w1 + w2x + w3x2
-4-3-2-10123
-1 0 1
f(x)
xFigure: Function from quadratic basis with weights w1 = 0.87466,w2 = −0.38835, w3 = −2.0058 .
Functions Derived from Quadratic Basisf (x) = w1 + w2x + w3x2
-4-3-2-10123
-1 0 1
f(x)
xFigure: Function from quadratic basis with weights w1 = −0.35908,w2 = 1.2274, w3 = −0.32825 .
Functions Derived from Quadratic Basisf (x) = w1 + w2x + w3x2
-4-3-2-10123
-1 0 1
f(x)
xFigure: Function from quadratic basis with weights w1 = −1.5638,w2 = −0.73577, w3 = 1.6861 .
Radial Basis Functions
I Or they can be local. E.g. radial (or Gaussian) basis
φ j(x) = exp(− (x−µ j)
2
`2
)
0
1
-2 -1 0 1 2
φ(x
)
x
φ1(x) = e−2(x+1)2
Figure: Radial basis functions.
Radial Basis Functions
I Or they can be local. E.g. radial (or Gaussian) basis
φ j(x) = exp(− (x−µ j)
2
`2
)
0
1
-2 -1 0 1 2
φ(x
)
x
φ1(x) = e−2(x+1)2
φ2(x) = e−2x2
Figure: Radial basis functions.
Radial Basis Functions
I Or they can be local. E.g. radial (or Gaussian) basis
φ j(x) = exp(− (x−µ j)
2
`2
)
0
1
-2 -1 0 1 2
φ(x
)
x
φ1(x) = e−2(x+1)2
φ2(x) = e−2x2
φ3(x) = e−2(x−1)2
Figure: Radial basis functions.
Functions Derived from Radial Basisf (x) = w1e−2(x+1)
2+ w2e−2x
2+ w3e−2(x−1)
2
-2
-1
0
1
2
-3 -2 -1 0 1 2 3
f(x)
xFigure: Function from radial basis with weights w1 = −0.47518,w2 = −0.18924, w3 = −1.8183 .
Functions Derived from Radial Basisf (x) = w1e−2(x+1)
2+ w2e−2x
2+ w3e−2(x−1)
2
-2
-1
0
1
2
-3 -2 -1 0 1 2 3
f(x)
xFigure: Function from radial basis with weights w1 = 0.50596,w2 = −0.046315, w3 = 0.26813 .
Functions Derived from Radial Basisf (x) = w1e−2(x+1)
2+ w2e−2x
2+ w3e−2(x−1)
2
-2
-1
0
1
2
-3 -2 -1 0 1 2 3
f(x)
xFigure: Function from radial basis with weights w1 = 0.07179,w2 = 1.3591, w3 = 0.50604 .
Probabilistic Model with Basis Functions
I Define a general function:
f (xi) = w>φ(xi)
I Corrupt with independent noise:
y(xi) = f (xi) + �i
� ∼n∏
i=1
N(0, σ2
)I Implies the following likelihood:
p(y|w, σ2) =n∏
i=1
N(yi|w>φ(xi), σ2
)
Multivariate Regression Likelihood
I Noise corrupted data point
yi = w>xi,: + �i
I Multivariate regression likelihood:
p(y|X,w) = 1(2πσ2)n/2
exp
− 12σ2n∑
i=1
(yi −w>xi,:
)2I Now use a multivariate Gaussian prior:
p(w) =1
(2πα)p2
exp(− 1
2αw>w
)
Multivariate Regression Likelihood
I Noise corrupted data point
yi = w>xi,: + �i
I Multivariate regression likelihood:
p(y|X,w) = 1(2πσ2)n/2
exp
− 12σ2n∑
i=1
(yi −w>xi,:
)2I Now use a multivariate Gaussian prior:
p(w) =1
(2πα)p2
exp(− 1
2αw>w
)
Multivariate Regression Likelihood
I Noise corrupted data point
yi = w>xi,: + �i
I Multivariate regression likelihood:
p(y|X,w) = 1(2πσ2)n/2
exp
− 12σ2n∑
i=1
(yi −w>xi,:
)2I Now use a multivariate Gaussian prior:
p(w) =1
(2πα)p2
exp(− 1
2αw>w
)
Posterior Density
I Once again we want to know the posterior:
p(w|y,X) ∝ p(y|X,w)p(w)
I And we can compute by completing the square.
log p(w|y,X) = − 12σ2
n∑i=1
y2i +1σ2
n∑i=1
yix>i,:w
− 12σ2
n∑i=1
w>xi,:x>i,:w −1
2αw>w + const.
Posterior Density
I Once again we want to know the posterior:
p(w|y,X) ∝ p(y|X,w)p(w)
I And we can compute by completing the square.
log p(w|y,X) = − 12σ2
n∑i=1
y2i +1σ2
n∑i=1
yix>i,:w
− 12σ2
n∑i=1
w>xi,:x>i,:w −1
2αw>w + const.
Computing the Posterior
I By inspection we extract the inverse covariance
log p(w|y,X) = − 12σ2
n∑i=1
y2i +1σ2
n∑i=1
yix>i,:w
− 12σ2
n∑i=1
w>xi,:x>i,:w −1
2αw>w + const.
I Completing the square allows us to compute the mean.
Computing the Posterior
I By inspection we extract the inverse covariance
log p(w|y,X) = − 12σ2
y>y +1σ2
y>Xw
− 12σ2
w>X>Xw − 12α
w>w + const.
I Completing the square allows us to compute the mean.
Computing the Posterior
I By inspection we extract the inverse covariance
log p(w|y,X) = − 12σ2
y>y +1σ2
y>Xw
− 12
w>[σ−1X>X + α−1I
]w + const.
I Completing the square allows us to compute the mean.
Making Predictions
I Giving a Gaussian density
p(w|y,X) = N(w|µw,Cw
)Cw =
[σ−2X>X + α−1I
]−1µw = Cwσ−2X>y
I Posterior is combined with ‘test data’ likelihood to makefuture predictions:
p(y∗|x∗,X,y) =∫
p(y∗|x∗,w)p(w|X,y)dw
Bayesian vs Maximum Likelihood
I Note the similarity between posterior mean
µw = (σ−2X>X + α−1I)−1σ−2X>y
I and Maximum likelihood solution
ŵ = (X>X)−1X>y
Marginal Likelihood
I In some sense though the real model is now the marginallikelihood.
I Marginalization of W follows sum rule
p(y|X) =∫
p(y|X,w)p(w)dw
givingp(y|X) = N
(y|0, αXX> + σ2I
)I Often the integral is intractable.I Leads to variational approximations, MCMC (Michael
Betancourt, Mark Girolami), Laplace approximation(Harvard Rue).
I For the case of Gaussians it’s trivial!!
Marginal Likelihood
I Can compute the marginal likelihood as:
p(y|X, α, σ) = N(y|0, αXX> + σ2I
)I Or if we use a basis set we have
p(y|X, α, σ) = N(y|0, αΦΦ> + σ2I
)I This Gaussian is no longer i.i.d. across data and this is
where things get interesting.
Marginal Likelihood
I The marginal likelihood can also be computed, it has theform:
p(y|X, σ2, α) = 1(2π)
n2 |K| 12
exp(−1
2y>K−1y
)where K = αΦΦ> + σ2I.
I So it is a zero mean n-dimensional Gaussian withcovariance matrix K.
Sampling a Function
Multi-variate Gaussians
I We will consider a Gaussian with a particular structure ofcovariance matrix.
I Generate a single sample from this 25 dimensionalGaussian distribution, f =
[f1, f2 . . . f25
].
I We will plot these points against their index.
Gaussian Distribution Sample
-2
-1
0
1
2
0 5 10 15 20 25
f i
i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)
ji
0
0.2
0.4
0.6
0.8
1
(b) colormap showing correlations be-tween dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
Gaussian Distribution Sample
-2
-1
0
1
2
0 5 10 15 20 25
f i
i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)
ji
0
0.2
0.4
0.6
0.8
1
(b) colormap showing correlations be-tween dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
Gaussian Distribution Sample
-2
-1
0
1
2
0 5 10 15 20 25
f i
i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)
0
0.2
0.4
0.6
0.8
1
(b) colormap showing correlations be-tween dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
Gaussian Distribution Sample
-2
-1
0
1
2
0 5 10 15 20 25
f i
i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)
0
0.2
0.4
0.6
0.8
1
(b) colormap showing correlations be-tween dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
Gaussian Distribution Sample
-2
-1
0
1
2
0 5 10 15 20 25
f i
i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)
0
0.2
0.4
0.6
0.8
1
(b) colormap showing correlations be-tween dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
Gaussian Distribution Sample
-2
-1
0
1
2
0 5 10 15 20 25
f i
i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)
0
0.2
0.4
0.6
0.8
1
(b) colormap showing correlations be-tween dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
Gaussian Distribution Sample
-2
-1
0
1
2
0 5 10 15 20 25
f i
i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)
0
0.2
0.4
0.6
0.8
1
(b) colormap showing correlations be-tween dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
Computing the Expected Output
I Given the posterior for the parameters, how can wecompute the expected output at a given location?
I Output of model at location xi is given by
f (xi; w) = φ>i w
I We want the expected output under the posterior density,p(w|y,X, σ2, α).
I Mean of mapping function will be given by〈f (xi; w)
〉p(w|y,X,σ2,α) = φ
>i 〈w〉p(w|y,X,σ2,α)
= φ>i µw
Variance of Expected Output
I Variance of model at location xi is given by
var( f (xi; w)) =〈( f (xi; w))2
〉− 〈 f (xi; w)〉2
= φ>i〈ww>
〉φi −φ>i 〈w〉 〈w〉
>φi
= φ>i Cwφi
where all these expectations are taken under the posteriordensity, p(w|y,X, σ2, α).
Book
Olympic Marathon Data
I Gold medal times forOlympic Marathon since1896.
I Marathons before 1924didn’t have astandardised distance.
I Present results usingpace per km.
I In 1904 Marathon wasbadly organised leadingto very slow times.
Image from WikimediaCommons
http://bit.ly/16kMKHQ
http://bit.ly/16kMKHQ
Olympic Marathon Data
3
3.5
4
4.5
5
1900 1920 1940 1960 1980 2000 2020
y,pa
cem
in/k
m
x, year
Olympic Marathon Data.
Olympics Data analysis
I Use Bayesian approach on olympics data withpolynomials.
I Choose a prior w ∼ N (0, αI) with α = 1.I Choose noise variance σ2 = 0.01
Sampling the Prior
I Always useful to perform a ‘sanity check’ and sample fromthe prior before observing the data.
I Since y = Φw + � just need to sample
w ∼ N (0, α)
� ∼ N(0, σ2
)with α = 1 and � = 0.01.
Recall: Validation Set for Maximum Likelihood
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 0, trainingerror -1.8774, validation error -0.13132, σ2 = 0.302, σ = 0.549.
Recall: Validation Set for Maximum Likelihood
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 1, trainingerror -15.325, validation error 2.5863, σ2 = 0.0733, σ = 0.271.
Recall: Validation Set for Maximum Likelihood
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 2, trainingerror -17.579, validation error -8.4831, σ2 = 0.0578, σ = 0.240.
Recall: Validation Set for Maximum Likelihood
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 3, trainingerror -18.064, validation error 11.27, σ2 = 0.0549, σ = 0.234.
Recall: Validation Set for Maximum Likelihood
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 4, trainingerror -18.245, validation error 232.92, σ2 = 0.0539, σ = 0.232.
Recall: Validation Set for Maximum Likelihood
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 5, trainingerror -20.471, validation error 9898.1, σ2 = 0.0426, σ = 0.207.
Recall: Validation Set for Maximum Likelihood
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 6, trainingerror -22.881, validation error 67775, σ2 = 0.0331, σ = 0.182.
Validation Set
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 0, trainingerror 29.757, validation error -0.29243, σ2 = 0.302, σ = 0.550.
Validation Set
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 1, trainingerror 14.942, validation error 4.4027, σ2 = 0.0762, σ = 0.276.
Validation Set
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 2, trainingerror 9.7206, validation error -8.6623, σ2 = 0.0580, σ = 0.241.
Validation Set
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 3, trainingerror 10.416, validation error -6.4726, σ2 = 0.0555, σ = 0.236.
Validation Set
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 4, trainingerror 11.34, validation error -8.431, σ2 = 0.0555, σ = 0.236.
Validation Set
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 5, trainingerror 11.986, validation error -10.483, σ2 = 0.0551, σ = 0.235.
Validation Set
2.53
3.54
4.55
5.5
1892 1932 1972 2012-40-20
020406080
100
0 1 2 3 4 5 6 7
polynomial order
Left: fit to data, Right: model error. Polynomial order 6, trainingerror 12.369, validation error -3.3823, σ2 = 0.0537, σ = 0.232.
Regularized Mean
I Validation fit here based on mean solution for w only.I For Bayesian solution
µw =[σ−2Φ>Φ + α−1I
]−1σ−2Φ>y
instead ofw∗ =
[Φ>Φ
]−1Φ>y
I Two are equivalent when α→∞.I Equivalent to a prior for w with infinite variance.I In other cases αI regularizes the system (keeps parameters
smaller).
Sampling the Posterior
I Now check samples by extracting w from the posterior.I Now for y = Φw + � need
w ∼ N(µw,Cw
)with Cw =
[σ−2Φ>Φ + α−1I
]−1and µw = Cwσ−2Φ>y
� ∼ N(0, σ2
)with α = 1 and � = 0.01.
Direct Construction of Covariance Matrix
Use matrix notation to write function,
f (xi; w) =m∑
k=1
wkφk (xi)
Direct Construction of Covariance Matrix
Use matrix notation to write function,
f (xi; w) =m∑
k=1
wkφk (xi)
computed at training data gives a vector
f = Φw.
Direct Construction of Covariance Matrix
Use matrix notation to write function,
f (xi; w) =m∑
k=1
wkφk (xi)
computed at training data gives a vector
f = Φw.
w ∼ N (0, αI)
Direct Construction of Covariance Matrix
Use matrix notation to write function,
f (xi; w) =m∑
k=1
wkφk (xi)
computed at training data gives a vector
f = Φw.
w ∼ N (0, αI)
w and f are only related by an inner product.
Direct Construction of Covariance Matrix
Use matrix notation to write function,
f (xi; w) =m∑
k=1
wkφk (xi)
computed at training data gives a vector
f = Φw.
w ∼ N (0, αI)
w and f are only related by an inner product.
Φ ∈
Direct Construction of Covariance Matrix
Use matrix notation to write function,
f (xi; w) =m∑
k=1
wkφk (xi)
computed at training data gives a vector
f = Φw.
w ∼ N (0, αI)
w and f are only related by an inner product.
Φ ∈
Direct Construction of Covariance Matrix
Use matrix notation to write function,
f (xi; w) =m∑
k=1
wkφk (xi)
computed at training data gives a vector
f = Φw.
w ∼ N (0, αI)
w and f are only related by an inner product.
Φ ∈
Expectations
I We have〈f〉 = Φ 〈w〉 .
I Prior mean of w was zero giving
〈f〉 = 0.
I Prior covariance of f is
K =〈ff>
〉− 〈f〉 〈f〉>
We use 〈·〉 to denote expectations under prior distributions.
Expectations
I We have〈f〉 = Φ 〈w〉 .
I Prior mean of w was zero giving
〈f〉 = 0.
I Prior covariance of f is
K =〈ff>
〉− 〈f〉 〈f〉>
We use 〈·〉 to denote expectations under prior distributions.
Expectations
I We have〈f〉 = Φ 〈w〉 .
I Prior mean of w was zero giving
〈f〉 = 0.
I Prior covariance of f is
K =〈ff>
〉− 〈f〉 〈f〉>
We use 〈·〉 to denote expectations under prior distributions.
Expectations
I We have〈f〉 = Φ 〈w〉 .
I Prior mean of w was zero giving
〈f〉 = 0.I Prior covariance of f is
K =〈ff>
〉− 〈f〉 〈f〉>
〈ff>
〉= Φ
〈ww>
〉Φ>,
givingK = αΦΦ>.
We use 〈·〉 to denote expectations under prior distributions.
Covariance between Two Points
I The prior covariance between two points xi and x j is
k(xi, x j
)= αφ: (xi)
> φ:(x j
),
or in sum notation
k(xi, x j
)= α
m∑k=1
φk (xi)φk(x j
)I For the radial basis used this gives
k(xi, x j
)= α
m∑k=1
exp
−∣∣∣xi − µk∣∣∣2 + ∣∣∣x j − µk∣∣∣2
2`2
.
Covariance between Two Points
I The prior covariance between two points xi and x j is
k(xi, x j
)= αφ: (xi)
> φ:(x j
),
or in sum notation
k(xi, x j
)= α
m∑k=1
φk (xi)φk(x j
)I For the radial basis used this gives
k(xi, x j
)= α
m∑k=1
exp
−∣∣∣xi − µk∣∣∣2 + ∣∣∣x j − µk∣∣∣2
2`2
.
Covariance between Two Points
I The prior covariance between two points xi and x j is
k(xi, x j
)= αφ: (xi)
> φ:(x j
),
or in sum notation
k(xi, x j
)= α
m∑k=1
φk (xi)φk(x j
)I For the radial basis used this gives
k(xi, x j
)= α
m∑k=1
exp
−∣∣∣xi − µk∣∣∣2 + ∣∣∣x j − µk∣∣∣2
2`2
.
Covariance between Two Points
I The prior covariance between two points xi and x j is
k(xi, x j
)= αφ: (xi)
> φ:(x j
),
or in sum notation
k(xi, x j
)= α
m∑k=1
φk (xi)φk(x j
)I For the radial basis used this gives
k(xi, x j
)= α
m∑k=1
exp
−∣∣∣xi − µk∣∣∣2 + ∣∣∣x j − µk∣∣∣2
2`2
.
Covariance Functions and Mercer Kernels
I Mercer Kernels and Covariance Functions are similar.I the kernel perspective does not make a probabilistic
interpretation of the covariance function.I Algorithms can be simpler, but probabilistic interpretation
is crucial for kernel parameter optimization.
Covariance Functions and Mercer Kernels
I Mercer Kernels and Covariance Functions are similar.I the kernel perspective does not make a probabilistic
interpretation of the covariance function.I Algorithms can be simpler, but probabilistic interpretation
is crucial for kernel parameter optimization.
Covariance Functions and Mercer Kernels
I Mercer Kernels and Covariance Functions are similar.I the kernel perspective does not make a probabilistic
interpretation of the covariance function.I Algorithms can be simpler, but probabilistic interpretation
is crucial for kernel parameter optimization.
More on Mercer Kernels
Let X be a metric space and K : X × X→< a continuous andsymmetric function. If we assume that K is positive definite,that is, for any set x = {x1, · · · , xn} ⊂ X the n × n matrix K withcomponents
Ki j = K(xi, x j),
is positive semi-definite, then K is a Mercer kernel.
More on Mercer Kernels
Mercer’s Theorem (1909): Let K : X × X −→ < a Mercer’skernel. Let λ j the j-th eigenvalue of LK and {φ j} j≥1 thecorresponding eigenvector. Then, for all x, x′ ∈ X
K(x,y) =∞∑j=1
λ jφ j(x)φ j(y)
where the convergence is absolute (for each (x, x′) ∈ X × X) anduniform (on (x, x′) ∈ X × X).
By using directly a kernel we are using a basis functionimplicitly (possibly with infinity elements: Bayesian
non-parametrics).
Prediction with Correlated Gaussians
I Prediction of f∗ from f requires multivariate conditionaldensity.
I Multivariate conditional density is also Gaussian.
p(f∗|f) = N(f∗|K∗,fK−1f,f f,K∗,∗ −K∗,fK−1f,f Kf,∗
)
I Here covariance of joint density is given by
K =[Kf,f K∗,fKf,∗ K∗,∗
]
Prediction with Correlated Gaussians
I Prediction of f∗ from f requires multivariate conditionaldensity.
I Multivariate conditional density is also Gaussian.
p(f∗|f) = N(f∗|µ,Σ
)µ = K∗,fK−1f,f f
Σ = K∗,∗ −K∗,fK−1f,f Kf,∗I Here covariance of joint density is given by
K =[Kf,f K∗,fKf,∗ K∗,∗
]
Constructing Covariance Functions
I Sum of two covariances is also a covariance function.
k(x, x′) = k1(x, x′) + k2(x, x′)
Constructing Covariance Functions
I Product of two covariances is also a covariance function.
k(x, x′) = k1(x, x′)k2(x, x′)
Multiply by Deterministic Function
I If f (x) is a Gaussian process.I g(x) is a deterministic function.I h(x) = f (x)g(x)I Then
kh(x, x′) = g(x)k f (x, x′)g(x′)
where kh is covariance for h(·) and k f is covariance for f (·).
Covariance Functions
MLP Covariance Function
k (x, x′) = αasin(
wx>x′ + b√wx>x + b + 1
√wx′>x′ + b + 1
)
I Based on infinite neuralnetwork model.
w = 40
b = 4
Covariance Functions
MLP Covariance Function
k (x, x′) = αasin(
wx>x′ + b√wx>x + b + 1
√wx′>x′ + b + 1
)
I Based on infinite neuralnetwork model.
w = 40
b = 4
Covariance Functions
Linear Covariance Function
k (x, x′) = αx>x′
I Bayesian linearregression.
α = 1
Covariance Functions
Linear Covariance Function
k (x, x′) = αx>x′
I Bayesian linearregression.
α = 1
Covariance FunctionsWhere did this covariance matrix come from?
Ornstein-Uhlenbeck (stationary Gauss-Markov) covariancefunction
k (x, x′) = α exp(−|x − x
′|2`2
)I In one dimension arises
from a stochasticdifferential equation.Brownian motion in aparabolic tube.
I In higher dimension aFourier filter of the form
1π(1+x2) .
Covariance FunctionsWhere did this covariance matrix come from?
Ornstein-Uhlenbeck (stationary Gauss-Markov) covariancefunction
k (x, x′) = α exp(−|x − x
′|2`2
)I In one dimension arises
from a stochasticdifferential equation.Brownian motion in aparabolic tube.
I In higher dimension aFourier filter of the form
1π(1+x2) .
Covariance FunctionsWhere did this covariance matrix come from?
Markov Process
k (t, t′) = αmin(t, t′)
I Covariance matrix isbuilt using the inputs tothe function t.
Covariance FunctionsWhere did this covariance matrix come from?
Markov Process
k (t, t′) = αmin(t, t′)
I Covariance matrix isbuilt using the inputs tothe function t.
-3-2-10123
0 0.5 1 1.5 2
Covariance FunctionsWhere did this covariance matrix come from?
Matern 5/2 Covariance Function
k (x, x′) = α(1 +√
5r +53
r2)
exp(−√
5r)
where r =‖x − x′‖2
`
I Matern 5/2 is a twicedifferentiablecovariance.
I Matern familyconstructed withStudent-t filters inFourier space.
Covariance FunctionsWhere did this covariance matrix come from?
Matern 5/2 Covariance Function
k (x, x′) = α(1 +√
5r +53
r2)
exp(−√
5r)
where r =‖x − x′‖2
`
I Matern 5/2 is a twicedifferentiablecovariance.
I Matern familyconstructed withStudent-t filters inFourier space.
Covariance Functions
RBF Basis Functions
k (x, x′) = αφ(x)>φ(x′)
φk(x) = exp
−∥∥∥x − µk∥∥∥22
`2
µ =
−101
Covariance Functions
RBF Basis Functions
k (x, x′) = αφ(x)>φ(x′)
φk(x) = exp
−∥∥∥x − µk∥∥∥22
`2
µ =
−101
-3-2-10123
-3 -2 -1 0 1 2 3
Covariance FunctionsWhere did this covariance matrix come from?
Exponentiated Quadratic Kernel Function (RBF, SquaredExponential, Gaussian)
k (x, x′) = α exp
−‖x − x′‖222`2
I Covariance matrix isbuilt using the inputs tothe function x.
I For the example above itwas based on Euclideandistance.
I The covariance functionis also know as a kernel.
Covariance FunctionsWhere did this covariance matrix come from?
Exponentiated Quadratic Kernel Function (RBF, SquaredExponential, Gaussian)
k (x, x′) = α exp
−‖x − x′‖222`2
I Covariance matrix isbuilt using the inputs tothe function x.
I For the example above itwas based on Euclideandistance.
I The covariance functionis also know as a kernel.
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).
Gaussian Process Interpolation
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
f(x)
x
Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).
Gaussian Noise
I Gaussian noise model,
p(yi| fi
)= N
(yi| fi, σ2
)where σ2 is the variance of the noise.
I Equivalent to a covariance function of the form
k(xi, x j) = δi, jσ2
where δi, j is the Kronecker delta function.I Additive nature of Gaussians means we can simply add
this term to existing covariance matrices.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
Gaussian Process Regression
-3
-2
-1
0
1
2
3
-2 -1 0 1 2
y(x)
x
Figure: Examples include WiFi localization, C14 callibration curve.
I Learning with Gaussian processes allows to characterizethe problem uncertainty:.
I We can choose a basis of functions o directly to select akernel (equivalent, but better to choose the kernel:non-parametric models).
I Given a covariance (prior), how to select the rightparameters?
I Back to optimization...
Learning Covariance ParametersCan we determine covariance parameters from the data?
N (y|0,K) = 1(2π)
n2 |K|12
exp(−y>K−1y
2
)
The parameters are inside the covariancefunction (matrix).
ki, j = k(xi, x j;θ)
Learning Covariance ParametersCan we determine covariance parameters from the data?
N (y|0,K) = 1(2π)
n2 |K|12
exp(−y>K−1y
2
)
The parameters are inside the covariancefunction (matrix).
ki, j = k(xi, x j;θ)
Learning Covariance ParametersCan we determine covariance parameters from the data?
logN (y|0,K) =−12
log |K|−y>K−1y
2− n
2log 2π
The parameters are inside the covariancefunction (matrix).
ki, j = k(xi, x j;θ)
Learning Covariance ParametersCan we determine covariance parameters from the data?
E(θ) =12
log |K| + y>K−1y
2
The parameters are inside the covariancefunction (matrix).
ki, j = k(xi, x j;θ)
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
Learning Covariance ParametersCan we determine length scales and noise levels from the data?
-2
-1
0
1
2
-2 -1 0 1 2
y(x)
x
-10-505
101520
10−1 100 101
length scale, `
E(θ) =12
log |K| + y>K−1y
2
Limitations of Gaussian Processes
I Inference is O(n3) due to matrix inverse (in practice useCholesky).
I Gaussian processes don’t deal well with discontinuities(financial crises, phosphorylation, collisions, edges inimages).
I Widely used exponentiated quadratic covariance (RBF) canbe too smooth in practice (but there are manyalternatives!!).
Conclusions
I Machine learning has focussed on prediction.I Two main approaches: optimize objective, or model
probabilistically.I Both approaches require to optimize parameters.I Gaussian processes: fundamental models to deal with
uncertainty in complex scenarios.
Tomorow
I Global optimization.I Parameter tuning in Machine Learning as a global
optimization problem.I Can we automate the parameter choice of Machine
Learning algorithms?I Yes! Bayesian Optimization.
References I
[1] Carl Friedrich Gauss. Theoria motus corporum coelestium. Perthes et Besser,Hamburg.
[2] Pierre Simon Laplace. Mémoire sur la probabilité des causes par lesévènemens. In Mémoires de mathèmatique et de physique, presentés àlAcadémie Royale des Sciences, par divers savans, & lù dans ses assemblées 6,pages 621–656, 1774. Translated in [5].
[3] Jeremey Oakley and Anthony O’Hagan. Bayesian inference for theuncertainty distribution of computer model outputs. Biometrika,89(4):769–784, 2002.
[4] Carl Edward Rasmussen and Christopher K. I. Williams. GaussianProcesses for Machine Learning. MIT Press, Cambridge, MA, 2006.
[5] Stephen M. Stigler. Laplace’s 1774 memoir on inverse probability.Statistical Science, 1:359–378, 1986.
IntroductionLearning as optimizationMultivariate Bayesian Linear RegressionMercer KernelsTwo Point MarginalsConstructing CovarianceGP InterpolationGP RegressionParameter OptimizationGP Limitations
0.0: 0.1: 0.2: 0.3: 0.4: 0.5: 0.6: 0.7: 0.8: 0.9: 0.10: 0.11: 0.12: 0.13: 0.14: 0.15: 0.16: 0.17: 0.18: 0.19: 0.20: 0.21: 0.22: 0.23: 0.24: 0.25: 0.26: 0.27: 0.28: 0.29: 0.30: 0.31: 0.32: 0.33: 0.34: 0.35: 0.36: 0.37: 0.38: 0.39: anm0: 1.0: 1.1: 1.2: 1.3: 1.4: 1.5: 1.6: 1.7: 1.8: 1.9: 1.10: 1.11: 1.12: 1.13: 1.14: 1.15: 1.16: 1.17: 1.18: 1.19: 1.20: 1.21: 1.22: 1.23: 1.24: 1.25: 1.26: 1.27: 1.28: 1.29: anm1: 2.0: 2.1: 2.2: 2.3: 2.4: 2.5: 2.6: 2.7: 2.8: 2.9: 2.10: 2.11: 2.12: 2.13: 2.14: 2.15: 2.16: 2.17: 2.18: 2.19: 2.20: 2.21: 2.22: 2.23: 2.24: 2.25: 2.26: 2.27: 2.28: 2.29: anm2: 3.0: 3.1: 3.2: 3.3: 3.4: 3.5: 3.6: 3.7: 3.8: 3.9: 3.10: 3.11: 3.12: 3.13: 3.14: 3.15: 3.16: 3.17: 3.18: 3.19: 3.20: 3.21: 3.22: 3.23: 3.24: 3.25: 3.26: 3.27: 3.28: 3.29: anm3: 4.0: 4.1: 4.2: 4.3: 4.4: 4.5: 4.6: 4.7: 4.8: 4.9: 4.10: 4.11: 4.12: 4.13: 4.14: 4.15: 4.16: 4.17: 4.18: 4.19: 4.20: 4.21: 4.22: 4.23: 4.24: 4.25: 4.26: 4.27: 4.28: 4.29: anm4: