1
ADVANCED MACHINE LEARNING
1
ADVANCED MACHINE LEARNING
Non-linear regression techniques
Part - II
2
ADVANCED MACHINE LEARNING
2
Regression Algorithms in this Course
Support Vector Machine Relevance Vector Machine
Boosting – random projections Boosting – random gaussians
Random forest Gaussian Process
Support vector regression Relevance vector regression
Gaussian process regression Gradient boosting
Locally weighted projected regression
Not covered – replaced by one
hour to answer questions about
mini-project
3
ADVANCED MACHINE LEARNING
3
Regression Algorithms in this Course
Random forest Gaussian Process Gaussian process regression
4
ADVANCED MACHINE LEARNING
4
, , , T Ny f x w w x w x
PR is a statistical approach to classical linear regression that estimates the
relationship between zero-mean variables y and x by building a linear model
of the form:
2, with 0,Ty w x N
If one assumes that the observed values of y differ from f(x) by an additive
noise that follows a zero-mean Gaussian distribution (such an
assumption consists of putting a prior distribution over the noise), then:
Probabilistic Regression (PR)
Where have we seen this before?
6
ADVANCED MACHINE LEARNING
6
1
2
Training set of pairs of data points , ,
Likelihood of the regressive model
0,
~ | , ,
i iM
i
T
M X x y
w X N
p X w
y
y
y y
Probabilistic Regression
1
2
21
Data points are independently and identically distributed (i.i.d)
| , , ~ | , ,
1 exp
22
i
Mi i
i
T iM
i
p X w p y x w
y w x
y
Parameters of
the model
7
ADVANCED MACHINE LEARNING
7
1
2
Training set of pairs of data points , ,
Likelihood of the regressive model
0,
~ | , ,
i iM
i
T
M X x y
w X N
p X w
y
y
y y
Probabilistic Regression
Prior model on distribution of parameter w:
110, exp
2
T
w wp w N w w
Hyperparameters
Given by user
Parameters of
the model
8
ADVANCED MACHINE LEARNING
8
1 1
1 1
2 22
1 11| , ,T T
w wXX XXp w X N X
y y
11Prior on : 0, exp
2
T
w ww p w N w w
(drop , not a variable)
Estimates conditional distribution on given the data using Bayes' rule.
likelihood x priorposterior =
marginal likelihood
| ,| ,
|
w
p X w p wp w X
p X
y
yy
Probabilistic Regression
Posterior distribution on
is Gaussian.
w
9
ADVANCED MACHINE LEARNING
9
1 1
1 1
2 22
1 11| , ,T T
w wXX XXp w X N X
y y
The conditional distribution of a
Gaussian distribution is also
Gaussian (image from Wikipedia)
Posterior distribution on
is Gaussian.
w
10
ADVANCED MACHINE LEARNING
10
Probabilistic Regression
The expectation over the posterior distribution gives the best estimate:
This is called the maximum a posteriori (MAP) estimate of w.
1
1
2 2
1 1| ., T
w
A
p wE XX XX
y y
1 1
1 1
2 22
1 11| , ,T T
w wXX XXp w X N X
y y
11
ADVANCED MACHINE LEARNING
11
Probabilistic Regression
We can now compute the posterior distribution on y :
| , |, | ,,p y x p wX p y x w wX d y y
1
2
1 1
2
1 with
1| , , ,
T
w
T T
A XX
p y x X N x A X x A x
y y
1 1
1 1
2 22
1 11| , ,T T
w wXX XXp w X N X
y y
12
ADVANCED MACHINE LEARNING
12
Probabilistic Regression
1
2
1 1
2
1 with
1,| , ,
T
w
T T
XXA
p y Nx x x xX XA A
y y
Testing point
Training datapoints
1
2
The estimate of given a test point is given by :
1( | } T
y x
y E p y Ax x X
y
14
ADVANCED MACHINE LEARNING
14
Probabilistic Regression
1
The variance gives a measure of the
uncertainty of the prediction:
var ( | } Tp y x x A x
1
2
1| , , TE p y x X x A X
y y
1
2
1 1
2
1 with
1,| , ,
T
w
T T
XXA
p y Nx x x xX XA A
y y
15
MACHINE LEARNING – 2012MACHINE LEARNINGADVANCED MACHINE LEARNING
15
2
How to extend the simple linear Bayesian regressive model
for nonlinear regression?
0,Ty w x N
Gaussian Process Regression
16
MACHINE LEARNING – 2012MACHINE LEARNINGADVANCED MACHINE LEARNING
16
x
2, ~ 0,Ty w x N 2
How to extend the simple linear Bayesian regressive model
for nonlinear regression?
0,Ty w x N
Gaussian Process Regression
17
MACHINE LEARNING – 2012MACHINE LEARNINGADVANCED MACHINE LEARNING
17
x
2, ~ 0,Ty w x N 2
How to extend the simple linear Bayesian regressive model
for nonlinear regression?
0,Ty w x N
Gaussian Process Regression
Distribution over functions
19
ADVANCED MACHINE LEARNING
19
1
2
1 1
2
11| , , , , T
w
T TA XXp y x X N x A X x A x
y y
x Non-Linear Transformation
Gaussian Process Regression
1 1
2
2 1
1| , , ,
with
T T
T
w
p y x X N x A X x A x
A X X
y y
2
How to extend the simple linear Bayesian regressive model
for nonlinear regression?
0,Ty w x N
20
ADVANCED MACHINE LEARNING
20
Gaussian Process Regression
Again, a Gaussian distribution.
1 1
2
2 1
1| , , ,
with
T T
T
w
p y x X N x A X x A x
A X X
y y
21
ADVANCED MACHINE LEARNING
21
Gaussian Process Regression
Inner product in feature space
Define the kernel as: , ' 'T
wk x x x x
1 1
2
2 1
1| , , ,
with
T T
T
w
p y x X N x A X x A x
A X X
y y
12
2
12
1, ,
| , ,
,
T
w
T T T
w w w
x X K X X Ip y x X N
x x x X K X X I X x
yy
See supplement
for steps
22
ADVANCED MACHINE LEARNING
22
Gaussian Process Regression
Inner product in feature space
Define the kernel as: , ' 'T
wk x x x x
1
2
1
with ,
| , , , i
M
i
i
K X X I
y E y x X k x x
y
y
1 1
2
2 1
1| , , ,
with
T T
T
w
p y x X N x A X x A x
A X X
y y
See supplement
for steps
24
ADVANCED MACHINE LEARNING
24
Gaussian Process Regression
1
2
1
with ,
| , , , i
M
i
i
K X X I
y E y x X k x x
y
y >0i
All
datapoints are
used in the
computation!
25
ADVANCED MACHINE LEARNING
25
Gaussian Process Regression
1
2
1
with ,
| , , , i
M
i
i
K X X I
y E y x X k x x
y
y
The kernel and its hyperparameters are given by the user.
These can be optimized through maximum likelihood over the
marginal likelihood, see class’s supplement
RBF kernel, width = 0.1 RBF kernel, width = 0.5
26
ADVANCED MACHINE LEARNING
26
Sensitivity to the choice of kernel width (called lengthscale in most books)
when using Gaussian kernels (also called RBF or square exponential).
Kernel Width=0.1
'
, '
x x
lk x x e
Gaussian Process Regression
27
ADVANCED MACHINE LEARNING
27
Kernel Width=0.5
'
, '
x x
lk x x e
Sensitivity to the choice of kernel width (called lengthscale in most books)
when using Gaussian kernels (also called RBF or square exponential).
Gaussian Process Regression
28
ADVANCED MACHINE LEARNING
28
Gaussian Process Regression
1
2
1
with ,
| , , , i
M
i
i
K X X I
y E y x X k x x
y
y
The value for the noise needs to be pre-set by hand.
Sigma = 0.05 Sigma = 0.01
The larger the noise, the more uncertainty. The noise is <=1.
1
2cov | , , , ,p y x K x x K x X K X X I K X x
29
ADVANCED MACHINE LEARNING
29
Gaussian Process Regression
Low noise: 0.05
30
ADVANCED MACHINE LEARNING
30
Gaussian Process Regression
High noise: 0.2
31
ADVANCED MACHINE LEARNING
31
Gaussian Process Regression
1
2
1
with ,
| , , , i
M
i
i
K X X I
y E y x X k x x
y
y
Kernel is usually Gaussian kernel with stationary covariance function
Non-Stationary Covariance Functions can encapsulate local
variations in the density of the datapoints
Gibbs’ non stationary covariance function (length-scale a function of x):
12
2
2 211
2 2 ', ' exp
' '
N Ni i i i
ii i i i i
l x l x x xk x x
l x l x l x l x
32
ADVANCED MACHINE LEARNING
32
Gaussian Process Regression
20,Ty w x N 2, ~ 0,Ty w x N
Linear Model Non-Linear Model
Both models follow a zero mean Gaussian distribution!
20,
0 0
T
T
E y E w x N
E w x
20,
0 0
T
T
E y E w x N
E w x
Predict y=0 away from datapoints!
SVR predicts y=b away from datapoints
(see exercise session)
33
ADVANCED MACHINE LEARNING
33
Examples of application of GPR
Kronander, K. and Billard, A. (2013) Learning Compliant Manipulation through Kinesthetic and Tactile Human-Robot
Interaction. IEEE Transactions on Haptics. 10.1109/TOH.2013.54.
Striking a match is a task that requires careful control of the force in interaction
to push enough to light the match but not too much in order not to break it.
34
ADVANCED MACHINE LEARNING
34
Examples of application of GPR
The stiffness profile is encoded as a time-varying input using GPR.
The shaded area corresponds to the striking phase. We see that the
stiffness must be decreased just before entering into contact and again
during contact when the match lights up. Stiffness can increase again
when the robot moves back into free space
35
ADVANCED MACHINE LEARNING
35
Examples of application of GPR
Building a 3D model of an object from tactile information can be useful to
guide manipulation of object when the object is no longer visible.
36
ADVANCED MACHINE LEARNING
36
Examples of application of GPR
3Point on the surface: xDistance to surface: y
On the surface: 0y
Outside the surface: 0y
Inside the object: 0y
Learn a mapping with GPR
to determine how far one is from the surface.
y f x
37
ADVANCED MACHINE LEARNING
37
Examples of application of GPR
Normal to the surface: z
Learn a mapping z with GPR to determine the normal from the
surface (need 3 GPRs for each of the coordinate of the vector ).
g x
z
The distance and normal to the surfaces can be used in an optimization
framework to determine the optimal posture of robot fingers on the object.
38
ADVANCED MACHINE LEARNING
38
Examples of application of GPR
GPR can be used to model the shape of objects. Top: 3D points sampled either from a
camera or from tactile sensing. Bottom: 3D shape reconstructed by GPR. The arrows
represent the predicted normals at the surface (El Khoury, S., Li, M., and Billard, A. (2013). On the
generation of a variety of grasps.Robotics and Autonomous Systems, 61(12):1335–34)
39
ADVANCED MACHINE LEARNING
39
Regression Algorithms in this Course
Support Vector Machine Relevance Vector Machine
Boosting – random projections Boosting – random gaussians
Random forest Gaussian Process
Support vector regression Relevance vector regression
Gaussian process regression Gradient boosting
Locally weighted projected regression
40
ADVANCED MACHINE LEARNING
40
Regression Algorithms in this Course
Relevance Vector Machine
Boosting – random gaussians
Gradient boosting
41
ADVANCED MACHINE LEARNING
41
Gradient Boosting
Relevance Vector Machine
Boosting – random gaussians
Gradient boosting
1 2
Choose some regressive technique (any we have seen sofar)
ˆ ˆ ˆApply boosting to train and combine the set of estimates , ... .mf f f
42
ADVANCED MACHINE LEARNING
42
Gradient Boosting
Relevance Vector Machine
Boosting – random gaussians
Gradient boosting
1
1ˆ ˆAggregate to get the final estimate m
i
i
f fm
43
ADVANCED MACHINE LEARNING
43
Gradient Boosting
Relevance Vector Machine
Boosting – random gaussians
Typical example of dataset with imbalanced data. Data was hand-drawn.
There are more datapoints in regions where the hand slowed down.
As a result, the fit is very good in these regions and less good in other regions.
Region with high density of
datapoints
Gradient boosting using squared loss
44
ADVANCED MACHINE LEARNING
44
Gradient Boosting
Boosting – random gaussians
Better results, i.e. smoother fit, are obtained when increasing the number
of functions for the fit
Gradient boosting using squared loss and using twice more functions
45
ADVANCED MACHINE LEARNING
45
Summary
We have seen a few different techniques to perform non-linear
regression in machine learning.
The techniques differ in their algorithm and in the number of
hyperparameters.
Some techniques (GP, RVR) provide a metric of uncertainty of the
model, which can be used to determine when inference is trustable.
Some techniques (n-SVR, RVR, LWPR) are designed to be
computationally cheap at retrieval (very few support vectors, few
models).
Other techniques (GP) are meant to provide very accurate estimate of
the data, at the cost of retaining all datapoints for retrieval.