+ All Categories
Home > Documents > Introduction to Statistical Machine Learning ·  · 2017-12-14Introduction to Statistical Machine...

Introduction to Statistical Machine Learning ·  · 2017-12-14Introduction to Statistical Machine...

Date post: 10-Apr-2018
Category:
Upload: lekhue
View: 217 times
Download: 0 times
Share this document with a friend
37
Introduction to Statistical Machine Learning c 2016 Ong & Walder Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Neural Networks 1 Neural Networks 2 Continuous Latent Variables Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Mixture Models and EM 1 Mixture Models and EM 2 Sequential Data 1 Sequential Data 2 Combining Models 1of 859 Introduction to Statistical Machine Learning Cheng Soon Ong & Christian Walder Machine Learning Research Group Data61 | CSIRO and College of Engineering and Computer Science The Australian National University Canberra February – June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
Transcript

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Outlines

OverviewIntroductionLinear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel MethodsSparse Kernel Methods

Neural Networks 1Neural Networks 2Continuous Latent VariablesAutoencodersGraphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Mixture Models and EM 1Mixture Models and EM 2Sequential Data 1

Sequential Data 2

Combining Models

1of 859

Introduction to Statistical Machine Learning

Cheng Soon Ong & Christian Walder

Machine Learning Research GroupData61 | CSIRO

andCollege of Engineering and Computer Science

The Australian National University

CanberraFebruary – June 2017

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

331of 859

Part IX

Neural Networks 1

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

332of 859

Basis Functions

The basis functions play a crucial role in the algorithmsexplored so far.Number and parameters of basis functions fixed beforelearning starts (e.g. Linear Regression and LinearClassification).Number of basis functions fixed, parameters of the basisfunctions are adaptive (e.g. Neural Networks).Center basis function on the data, select a subset of basisfunctions in the training phase (e.g. Support VectorMachines, Relevance Vector Machines).

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

333of 859

Outline

The functional form of the network model (includingspecial parametrisation of the basis functions).How to determine the network parameters within themaximum likelihood framework? (Solution of a nonlinearoptimisation problem.)Error backpropagation : efficiently evaluate the derivativesof the log likelihood function with respect to the networkparameters.Various approaches to regularise neural networks.

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

334of 859

Feed-forward Network Functions

Same goal as before: Decompose the target t

t(x) = y(x,w) + ε(x)

where ε(x) is the residual error.(Generalised) Linear Model

y(x,w) = f

M∑j=0

wjφj(x)

where φ = (φ0, . . . , φM)T is the fixed model basis andw = (w0, . . . ,wM)T are the model parameter.For regression: f (·) is the identity function.For classification: f (·) is a nonlinear activation function.Goal : Let φj(x) depend on parameters, and then adjustthese parameters together with w.

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

335of 859

Feed-forward Network Functions

Goal : Let φj(x) depend on parameters, and then adjustthese parameters together with w.Many ways to do this.Neural networks use basis functions which follow thesame form as the (generalised) linear model.EACH basis function is itself a nonlinear function of anadaptive linear combination of the inputs.

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

336of 859

Functional Transformations

Construct M linear combinations of the input variablesx1, . . . , xD in the form

aj︸︷︷︸activations

=

D∑i=1

w(1)ji︸︷︷︸

weights

xi + w(1)j0︸︷︷︸

bias

j = 1, . . . ,M

Apply a differentiable, nonlinear activation function h(·) toget the output of the hidden units

zj = h(aj)

h(·) is typically sigmoidal or tanh.

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

337of 859

Functional Transformations

The outputs of the hidden units are again linearlycombined

ak =

M∑j=1

w(2)kj zj + w(2)

k0 k = 1, . . . ,K

Apply again a differentiable, nonlinear activation functiong(·) to get the network outputs yk

yk = g(ak)

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

338of 859

Functional Transformations

The activation function g(·) is determined by the nature ofthe data and the distribution of the target variables.For standard regression: g(·) is the identity function so thatyk = ak.For multiple binary classification, g(·) is a logistic sigmoidfunction

yk = σ(ak) =1

1 + exp(−ak)

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

339of 859

Functional Transformations

Combine all transformations into one formula

yk(x,w) = g

M∑j=1

w(2)kj h

(D∑

i=1

w(1)ji xi + w(1)

j0

)+ w(2)

k0

where w contains all weight and bias parameters.

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

340of 859

Functional Transformations

As before, the biases can be absorbed into the weights byintroducing an extra input x0 = 1 and a hidden unit z0 = 1.

yk(x,w) = g

M∑j=0

w(2)kj h

(D∑

i=0

w(1)ji xi

)where w now contains all weight and bias parameters.

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

341of 859

Comparison to Perceptron

A neural network looks like a multilayer perceptron.But perceptron’s nonlinear activation function was a stepfunction. Not smooth. Not differentiable.

f (a) =

{+1, a ≥ 0−1, a < 0

The activation functions h(·) and g(·) of a neural networkare smooth and differentiable.

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

342of 859

Linear Activation Functions

If all activation functions are linear functions then thereexists an equivalent network without hidden units.(Composition of linear functions is a linear function.)But if the number of hidden units in this case is smallerthan the number of input or output units, the resultinglinear function are not the most general.Dimensionality reduction.Principal Component Analysis (comes later in the lecture).Generally, most neural networks use nonlinear activationfunctions as the goal is to approximate a nonlinearmapping from the input space to the outputs.

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

343of 859

Extensions of the Neural Network Architecture

Add more hidden layers.Some units may be not fully connected to the next layer.Some links may skip over one or several subsequentlayer(s).Still in the framework of feed-forward networks.Note: If information is allowed to flow also backwards, weget a graph with cycles. This is called a recurrent neuralnetwork which can exhibit a very different dynamicalbehaviour (e.g. may have state, may exhibit chaos). Notfurther considered here.

x1

x2

z1

z3

z2

y1

y2

inputs outputs

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

344of 859

Neural Networks as Universal FunctionApproximators

Feed-forward neural networks are universalapproximators.Example: A two-layer neural network with linear outputscan uniformly approximate any continuous function on acompact input domain to arbitrary accuracy if it hasenough hidden units.Holds for a wide range of hidden unit activation functions,but NOT for polynomials.Remaining big question : Where do we get the appropriatesettings for the weights from? With other words, how dowe learn the weights from training examples?

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

345of 859

Aproximation Capabilities of Neural Networks

Neural network approximating

f (x) = x2

Two-layer network with 3 hidden units (tanh activation functions)and linear outputs trained on 50 data points sampled from theinterval (−1, 1). Red: resulting output. Dashed: Output of the

hidden units.

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

346of 859

Aproximation Capabilities of Neural Networks

Neural network approximating

f (x) = sin(x)

Two-layer network with 3 hidden units (tanh activation functions)and linear outputs trained on 50 data points sampled from theinterval (−1, 1). Red: resulting output. Dashed: Output of the

hidden units.

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

347of 859

Aproximation Capabilities of Neural Networks

Neural network approximating

f (x) = |x|

Two-layer network with 3 hidden units (tanh activation functions)and linear outputs trained on 50 data points sampled from theinterval (−1, 1). Red: resulting output. Dashed: Output of the

hidden units.

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

348of 859

Aproximation Capabilities of Neural Networks

Neural network approximating Heaviside function

f (x) =

{1, x ≥ 00, x < 0

Two-layer network with 3 hidden units (tanh activation functions)and linear outputs trained on 50 data points sampled from theinterval (−1, 1). Red: resulting output. Dashed: Output of the

hidden units.

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

349of 859

Variable Basis Functions in a Neural Networks

Hidden layer nodes represent parametrised basisfunctions

-10

-5x2

0

5

10-10

-5

0x1

5

10

z = σ(w0 + w1x1 + w2x2) for (w0,w1,w2) = (0.0, 1.0, 0.1)

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

350of 859

Variable Basis Functions in a Neural Networks

Hidden layer nodes represent parametrised basisfunctions

-10

-5x2

0

5

10-10

-5

0x1

5

10

z = σ(w0 + w1x1 + w2x2) for (w0,w1,w2) = (0.0, 0.1, 1.0)

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

351of 859

Variable Basis Functions in a Neural Networks

Hidden layer nodes represent parametrised basisfunctions

-10

-5x2

0

5

10-10

-5

0x1

5

10

z = σ(w0 + w1x1 + w2x2) for (w0,w1,w2) = (0.0,−0.5, 0.5)

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

352of 859

Variable Basis Functions in a Neural Networks

Hidden layer nodes represent parametrised basisfunctions

-10

-5x2

0

5

10-10

-5

0x1

5

10

z = σ(w0 + w1x1 + w2x2) for (w0,w1,w2) = (10.0,−0.5, 0.5)

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

353of 859

Aproximation Capabilities of Neural Networks

Neural network for two-class classification.2 inputs, 2 hidden units with tanh activation function, 1output with logistic sigmoid activation function.

−2 −1 0 1 2

−2

−1

0

1

2

3

Red: y = 0.5 decision boundary. Dashed blue: z = 0.5 hiddenunit contours. Green: Optimal decision boundary from the

known data distribution.

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

354of 859

Weight-space Symmetries

Given a set of weights w. This fixes a mapping from theinput space to the output space.Does there exist another set of weights realising the samemapping?Assume tanh activation function for the hidden units. Astanh is an odd function: tanh(−a) = − tanh(a).Change the sign of all inputs to a hidden unit and outputsof this hidden unit: Mapping stays the same.

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

355of 859

Weight-space Symmetries

M hidden units, therefore 2M equivalent weight vectors.Furthermore, exchange all of the weights going into andout of a hidden unit with the corresponding weights ofanother hidden unit. Mapping stays the same. M!symmetries.Overall weight space symmetry : M! 2M

M 1 2 3 4 5 6 7M! 2M 2 8 48 384 3840 46080 645120

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

356of 859

Parameter Optimisation

Assume the error E(w) is a smooth function of the weights.Smallest value will occur at a critical point for which

∇E(w) = 0.

This could be a minimum, maxium, or saddle point.Furthermore, because of symmetry in weight space, thereare at least M! 2M other critical points with the same valuefor the error.

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

357of 859

Parameter Optimisation

Definition (Global Minimum)

A point w∗ for which the error E(w∗) is smaller than any othererror E(w).

Definition (Local Minimum)

A point w∗ for which the error E(w∗) is smaller than any othererror E(w) in some neighbourhood of w∗.

w1

w2

E(w)

wA wB wC

∇E

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

358of 859

Parameter Optimisation

Finding the global minimium is difficult in general (wouldhave to check everywhere) unless the error functioncomes from a special class (e.g. smooth convex functionshave only one minimum).Error functions for neural networks are not convex(symmetries!).But finding a local minimum might be sufficient.Use iterative methods with weight vector update ∆w(τ) tofind a local minimum.

w(τ+1) = w(τ) + ∆w(τ)

w1

w2

E(w)

wA wB wC

∇E

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

359of 859

Local Quadratic Approximation

Around a minimum w∗ we can approximate

E(w) ' E(w∗) +12

(w− w∗)TH(w− w∗),

where the Hessian H is evaluated at w∗.Using a set {ui} of orthonormal eigenvectors of H,

Hui = λiui,

to expandw− w∗ =

∑i

αiui.

We get

E(w) = E(w∗) +12

∑i

λiα2i .

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

360of 859

Local Quadratic Approximation

Around a minimum w∗ we can approximate

E(w) = E(w∗) +12

∑i

λiα2i .

w1

w2

λ−1/21

λ−1/22

u1

w?

u2

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

361of 859

Local Quadratic Approximation

Around a minimum w∗, the Hessian H must be positivedefinite if evaluated at w∗.

w1

w2

λ−1/21

λ−1/22

u1

w?

u2

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

362of 859

Gradient Information improves Performances

Hessian is symmetric and contains W(W + 1)/2independent entries where W is the total number ofweights in the network.Need to gather this O(W2) pieces of information by doingO(W) function evaluations if nothing else is know. Getorder O(W3).The gradient ∇E provides W pieces of information at once.Still need O(W) steps, but the order is now O(W2).

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

363of 859

Gradient Descent Optimisation

Batch processing : Update the weight vector with a smallstep in the direction of the negative gradient

w(τ+1) = w(τ) − η∇E(w(τ))

where η is the learning rate.After each step, re-evaluate the gradient ∇E(w(τ)) again.Gradient Descent has problems in ’long valleys’.

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

364of 859

Gradient Descent Optimisation

Gradient Descent has problems in ’long valleys’.

Example of zig-zag of Gradient Descent Algorithm.

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

365of 859

Nonlinear Optimisation

Use Conjugate Gradient Descent instead of GradientDescent to avoid zig-zag behaviour.Use Newton method which also calculates the inverseHessian in each iteration (but inverting the Hessian isusually costly).Use Quasi-Newton methods (e.g. BFGS) which alsocalculates an estimate of the inverse Hessian whileiterating.Run the algorithm from a set of starting points to find thesmallest local minimum.

Introduction to StatisticalMachine Learning

c©2016Ong & Walder

Data61 | CSIROThe Australian National

University

Neural Networks

Weight-space Symmetries

Parameter Optimisation

Gradient DescentOptimisation

366of 859

Nonlinear Optimisation

Remaining big problem: Error function is defined over thewhole training set. Therefore, need to process the wholetraining set for each calculation of the gradient ∇E(w(τ)).If the error function is a sum of errors for each data point

E(w) =

N∑n=1

En(w)

we can use on-line gradient descent (also calledsequential gradient descent or stochastic gradient descentupdating the weights by one data point at a time

w(τ+1) = w(τ) − η∇En(w(τ)).


Recommended