Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Outlines
OverviewIntroductionLinear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel MethodsSparse Kernel Methods
Neural Networks 1Neural Networks 2Continuous Latent VariablesAutoencodersGraphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Mixture Models and EM 1Mixture Models and EM 2Sequential Data 1
Sequential Data 2
Combining Models
1of 859
Introduction to Statistical Machine Learning
Cheng Soon Ong & Christian Walder
Machine Learning Research GroupData61 | CSIRO
andCollege of Engineering and Computer Science
The Australian National University
CanberraFebruary – June 2017
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
331of 859
Part IX
Neural Networks 1
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
332of 859
Basis Functions
The basis functions play a crucial role in the algorithmsexplored so far.Number and parameters of basis functions fixed beforelearning starts (e.g. Linear Regression and LinearClassification).Number of basis functions fixed, parameters of the basisfunctions are adaptive (e.g. Neural Networks).Center basis function on the data, select a subset of basisfunctions in the training phase (e.g. Support VectorMachines, Relevance Vector Machines).
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
333of 859
Outline
The functional form of the network model (includingspecial parametrisation of the basis functions).How to determine the network parameters within themaximum likelihood framework? (Solution of a nonlinearoptimisation problem.)Error backpropagation : efficiently evaluate the derivativesof the log likelihood function with respect to the networkparameters.Various approaches to regularise neural networks.
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
334of 859
Feed-forward Network Functions
Same goal as before: Decompose the target t
t(x) = y(x,w) + ε(x)
where ε(x) is the residual error.(Generalised) Linear Model
y(x,w) = f
M∑j=0
wjφj(x)
where φ = (φ0, . . . , φM)T is the fixed model basis andw = (w0, . . . ,wM)T are the model parameter.For regression: f (·) is the identity function.For classification: f (·) is a nonlinear activation function.Goal : Let φj(x) depend on parameters, and then adjustthese parameters together with w.
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
335of 859
Feed-forward Network Functions
Goal : Let φj(x) depend on parameters, and then adjustthese parameters together with w.Many ways to do this.Neural networks use basis functions which follow thesame form as the (generalised) linear model.EACH basis function is itself a nonlinear function of anadaptive linear combination of the inputs.
x0
x1
xD
z0
z1
zM
y1
yK
w(1)MD w
(2)KM
w(2)10
hidden units
inputs outputs
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
336of 859
Functional Transformations
Construct M linear combinations of the input variablesx1, . . . , xD in the form
aj︸︷︷︸activations
=
D∑i=1
w(1)ji︸︷︷︸
weights
xi + w(1)j0︸︷︷︸
bias
j = 1, . . . ,M
Apply a differentiable, nonlinear activation function h(·) toget the output of the hidden units
zj = h(aj)
h(·) is typically sigmoidal or tanh.
x0
x1
xD
z0
z1
zM
y1
yK
w(1)MD w
(2)KM
w(2)10
hidden units
inputs outputs
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
337of 859
Functional Transformations
The outputs of the hidden units are again linearlycombined
ak =
M∑j=1
w(2)kj zj + w(2)
k0 k = 1, . . . ,K
Apply again a differentiable, nonlinear activation functiong(·) to get the network outputs yk
yk = g(ak)
x0
x1
xD
z0
z1
zM
y1
yK
w(1)MD w
(2)KM
w(2)10
hidden units
inputs outputs
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
338of 859
Functional Transformations
The activation function g(·) is determined by the nature ofthe data and the distribution of the target variables.For standard regression: g(·) is the identity function so thatyk = ak.For multiple binary classification, g(·) is a logistic sigmoidfunction
yk = σ(ak) =1
1 + exp(−ak)
x0
x1
xD
z0
z1
zM
y1
yK
w(1)MD w
(2)KM
w(2)10
hidden units
inputs outputs
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
339of 859
Functional Transformations
Combine all transformations into one formula
yk(x,w) = g
M∑j=1
w(2)kj h
(D∑
i=1
w(1)ji xi + w(1)
j0
)+ w(2)
k0
where w contains all weight and bias parameters.
x0
x1
xD
z0
z1
zM
y1
yK
w(1)MD w
(2)KM
w(2)10
hidden units
inputs outputs
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
340of 859
Functional Transformations
As before, the biases can be absorbed into the weights byintroducing an extra input x0 = 1 and a hidden unit z0 = 1.
yk(x,w) = g
M∑j=0
w(2)kj h
(D∑
i=0
w(1)ji xi
)where w now contains all weight and bias parameters.
x0
x1
xD
z0
z1
zM
y1
yK
w(1)MD w
(2)KM
w(2)10
hidden units
inputs outputs
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
341of 859
Comparison to Perceptron
A neural network looks like a multilayer perceptron.But perceptron’s nonlinear activation function was a stepfunction. Not smooth. Not differentiable.
f (a) =
{+1, a ≥ 0−1, a < 0
The activation functions h(·) and g(·) of a neural networkare smooth and differentiable.
x0
x1
xD
z0
z1
zM
y1
yK
w(1)MD w
(2)KM
w(2)10
hidden units
inputs outputs
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
342of 859
Linear Activation Functions
If all activation functions are linear functions then thereexists an equivalent network without hidden units.(Composition of linear functions is a linear function.)But if the number of hidden units in this case is smallerthan the number of input or output units, the resultinglinear function are not the most general.Dimensionality reduction.Principal Component Analysis (comes later in the lecture).Generally, most neural networks use nonlinear activationfunctions as the goal is to approximate a nonlinearmapping from the input space to the outputs.
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
343of 859
Extensions of the Neural Network Architecture
Add more hidden layers.Some units may be not fully connected to the next layer.Some links may skip over one or several subsequentlayer(s).Still in the framework of feed-forward networks.Note: If information is allowed to flow also backwards, weget a graph with cycles. This is called a recurrent neuralnetwork which can exhibit a very different dynamicalbehaviour (e.g. may have state, may exhibit chaos). Notfurther considered here.
x1
x2
z1
z3
z2
y1
y2
inputs outputs
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
344of 859
Neural Networks as Universal FunctionApproximators
Feed-forward neural networks are universalapproximators.Example: A two-layer neural network with linear outputscan uniformly approximate any continuous function on acompact input domain to arbitrary accuracy if it hasenough hidden units.Holds for a wide range of hidden unit activation functions,but NOT for polynomials.Remaining big question : Where do we get the appropriatesettings for the weights from? With other words, how dowe learn the weights from training examples?
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
345of 859
Aproximation Capabilities of Neural Networks
Neural network approximating
f (x) = x2
Two-layer network with 3 hidden units (tanh activation functions)and linear outputs trained on 50 data points sampled from theinterval (−1, 1). Red: resulting output. Dashed: Output of the
hidden units.
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
346of 859
Aproximation Capabilities of Neural Networks
Neural network approximating
f (x) = sin(x)
Two-layer network with 3 hidden units (tanh activation functions)and linear outputs trained on 50 data points sampled from theinterval (−1, 1). Red: resulting output. Dashed: Output of the
hidden units.
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
347of 859
Aproximation Capabilities of Neural Networks
Neural network approximating
f (x) = |x|
Two-layer network with 3 hidden units (tanh activation functions)and linear outputs trained on 50 data points sampled from theinterval (−1, 1). Red: resulting output. Dashed: Output of the
hidden units.
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
348of 859
Aproximation Capabilities of Neural Networks
Neural network approximating Heaviside function
f (x) =
{1, x ≥ 00, x < 0
Two-layer network with 3 hidden units (tanh activation functions)and linear outputs trained on 50 data points sampled from theinterval (−1, 1). Red: resulting output. Dashed: Output of the
hidden units.
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
349of 859
Variable Basis Functions in a Neural Networks
Hidden layer nodes represent parametrised basisfunctions
-10
-5x2
0
5
10-10
-5
0x1
5
10
z = σ(w0 + w1x1 + w2x2) for (w0,w1,w2) = (0.0, 1.0, 0.1)
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
350of 859
Variable Basis Functions in a Neural Networks
Hidden layer nodes represent parametrised basisfunctions
-10
-5x2
0
5
10-10
-5
0x1
5
10
z = σ(w0 + w1x1 + w2x2) for (w0,w1,w2) = (0.0, 0.1, 1.0)
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
351of 859
Variable Basis Functions in a Neural Networks
Hidden layer nodes represent parametrised basisfunctions
-10
-5x2
0
5
10-10
-5
0x1
5
10
z = σ(w0 + w1x1 + w2x2) for (w0,w1,w2) = (0.0,−0.5, 0.5)
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
352of 859
Variable Basis Functions in a Neural Networks
Hidden layer nodes represent parametrised basisfunctions
-10
-5x2
0
5
10-10
-5
0x1
5
10
z = σ(w0 + w1x1 + w2x2) for (w0,w1,w2) = (10.0,−0.5, 0.5)
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
353of 859
Aproximation Capabilities of Neural Networks
Neural network for two-class classification.2 inputs, 2 hidden units with tanh activation function, 1output with logistic sigmoid activation function.
−2 −1 0 1 2
−2
−1
0
1
2
3
Red: y = 0.5 decision boundary. Dashed blue: z = 0.5 hiddenunit contours. Green: Optimal decision boundary from the
known data distribution.
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
354of 859
Weight-space Symmetries
Given a set of weights w. This fixes a mapping from theinput space to the output space.Does there exist another set of weights realising the samemapping?Assume tanh activation function for the hidden units. Astanh is an odd function: tanh(−a) = − tanh(a).Change the sign of all inputs to a hidden unit and outputsof this hidden unit: Mapping stays the same.
x0
x1
xD
z0
z1
zM
y1
yK
w(1)MD w
(2)KM
w(2)10
hidden units
inputs outputs
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
355of 859
Weight-space Symmetries
M hidden units, therefore 2M equivalent weight vectors.Furthermore, exchange all of the weights going into andout of a hidden unit with the corresponding weights ofanother hidden unit. Mapping stays the same. M!symmetries.Overall weight space symmetry : M! 2M
M 1 2 3 4 5 6 7M! 2M 2 8 48 384 3840 46080 645120
x0
x1
xD
z0
z1
zM
y1
yK
w(1)MD w
(2)KM
w(2)10
hidden units
inputs outputs
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
356of 859
Parameter Optimisation
Assume the error E(w) is a smooth function of the weights.Smallest value will occur at a critical point for which
∇E(w) = 0.
This could be a minimum, maxium, or saddle point.Furthermore, because of symmetry in weight space, thereare at least M! 2M other critical points with the same valuefor the error.
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
357of 859
Parameter Optimisation
Definition (Global Minimum)
A point w∗ for which the error E(w∗) is smaller than any othererror E(w).
Definition (Local Minimum)
A point w∗ for which the error E(w∗) is smaller than any othererror E(w) in some neighbourhood of w∗.
w1
w2
E(w)
wA wB wC
∇E
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
358of 859
Parameter Optimisation
Finding the global minimium is difficult in general (wouldhave to check everywhere) unless the error functioncomes from a special class (e.g. smooth convex functionshave only one minimum).Error functions for neural networks are not convex(symmetries!).But finding a local minimum might be sufficient.Use iterative methods with weight vector update ∆w(τ) tofind a local minimum.
w(τ+1) = w(τ) + ∆w(τ)
w1
w2
E(w)
wA wB wC
∇E
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
359of 859
Local Quadratic Approximation
Around a minimum w∗ we can approximate
E(w) ' E(w∗) +12
(w− w∗)TH(w− w∗),
where the Hessian H is evaluated at w∗.Using a set {ui} of orthonormal eigenvectors of H,
Hui = λiui,
to expandw− w∗ =
∑i
αiui.
We get
E(w) = E(w∗) +12
∑i
λiα2i .
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
360of 859
Local Quadratic Approximation
Around a minimum w∗ we can approximate
E(w) = E(w∗) +12
∑i
λiα2i .
w1
w2
λ−1/21
λ−1/22
u1
w?
u2
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
361of 859
Local Quadratic Approximation
Around a minimum w∗, the Hessian H must be positivedefinite if evaluated at w∗.
w1
w2
λ−1/21
λ−1/22
u1
w?
u2
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
362of 859
Gradient Information improves Performances
Hessian is symmetric and contains W(W + 1)/2independent entries where W is the total number ofweights in the network.Need to gather this O(W2) pieces of information by doingO(W) function evaluations if nothing else is know. Getorder O(W3).The gradient ∇E provides W pieces of information at once.Still need O(W) steps, but the order is now O(W2).
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
363of 859
Gradient Descent Optimisation
Batch processing : Update the weight vector with a smallstep in the direction of the negative gradient
w(τ+1) = w(τ) − η∇E(w(τ))
where η is the learning rate.After each step, re-evaluate the gradient ∇E(w(τ)) again.Gradient Descent has problems in ’long valleys’.
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
364of 859
Gradient Descent Optimisation
Gradient Descent has problems in ’long valleys’.
Example of zig-zag of Gradient Descent Algorithm.
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
365of 859
Nonlinear Optimisation
Use Conjugate Gradient Descent instead of GradientDescent to avoid zig-zag behaviour.Use Newton method which also calculates the inverseHessian in each iteration (but inverting the Hessian isusually costly).Use Quasi-Newton methods (e.g. BFGS) which alsocalculates an estimate of the inverse Hessian whileiterating.Run the algorithm from a set of starting points to find thesmallest local minimum.
Introduction to StatisticalMachine Learning
c©2016Ong & Walder
Data61 | CSIROThe Australian National
University
Neural Networks
Weight-space Symmetries
Parameter Optimisation
Gradient DescentOptimisation
366of 859
Nonlinear Optimisation
Remaining big problem: Error function is defined over thewhole training set. Therefore, need to process the wholetraining set for each calculation of the gradient ∇E(w(τ)).If the error function is a sum of errors for each data point
E(w) =
N∑n=1
En(w)
we can use on-line gradient descent (also calledsequential gradient descent or stochastic gradient descentupdating the weights by one data point at a time
w(τ+1) = w(τ) − η∇En(w(τ)).