Bayesian Inference in Machine Learning

7/27/2019 Bayesian Inference in Machine Learning

1/81


2/81

Tutorial ContentsI. Introduction to Machine Learning

II. Bayesian Inference

1. Maximum Likelihood Estimation

2. Bayesian Approach

3. Estimation of Posterior DistributionIII. Machine Learning using Bayesian Inference

1. Clustering

2. Latent Variables

3. Expectation Maximization Algorithm

4. Variational Methods

Tutorial Contents 2ACM Compute 2013


3/81

Introduction to Machine Learning

Introduction to Machine Learning 3ACM Compute 2013

- Finding non-obvious patterns in data

- Learning complex relationships among

variables

- Predicting the outcomes when somevariables undergo changes

Machine Learning is a subject inArtificialIntelligence at the intersection ofComputerScience and Statistics dealing with


4/81

Common Tasks in Machine Learning


Identifying natural groups (clusters) in the dataand assigning each data point to a particular

clusterClustering

Assigning new observations to a set of predefinedcategories (classes) by learning from a training

data setClassification

Estimating the likely hood of purchase, probabilityof failure, odds ratio of win etc.

Prediction

Extracting non-obvious patterns and rules havingsignificant support and confidence from data

Pattern Mining

Reduce number of variables by finding a subset

(features) which contains most of the relevant

information (useful for visualization)

Dimensionality

Reduction

Predicting future values in a Time Series

Forecasting


5/81

Machine Learning Algorithms


K-Means Algorithm Decision TreesSupport Vector

Machines

Neural Networks Logistic Regression Bayesian Networks

Apriori Algorithm

PrincipalComponent

Analysis

Gibbs Sampling


6/81



x = x1

XX

X X

X

X

X

X

X

X

X

X

X

XX

X

X

X

X

X

X

X

X

X

XX

X

X

X

X X

X

X X

X

X

X

X

y = y1

y = y2

x = x2

x = x3

Decision Trees for Classification : Divide and Conquer Approach


7/81



Decision Trees for Classification : Divide and Conquer Approach

X < X1 X > X1

Y < Y1 Y > Y1 Y < Y2 Y > Y2

X < X3 X > X3X < X2 X > X2

Red Red Blue Red Blue


8/81

Classification of Machine Learning

Algorithms


Based on Data Type1. Supervised

When there are labeled data sets for training

Classification, Regression

2. Unsupervised

No labeled data sets for training Clustering

Based on Model Type

1. Parametric

Finite number of parameters describing models

Regression Coefficients2. Non-Parametric

Infinite number of parameters describing models

Make no assumption on the distribution of data

Rank Ordering


9/81

Tools for Machine Learning


Open Source

1. R

2. Mahout (for Big Data)

3. Weka

4. SciKit (Python)

Licensed

1. SAS

2. SPSS

3. KXEN


10/81

Maximum Likelihood Estimation

Workhorse of Machine Learning

Objective is to estimate model parameters that maximizes the

probability of observed data given a model

MLE Steps: Data i.i.d. observations , , , Choose model parameters which maximizes Likelihood Function =

argmax log arg max1 log

=



11/81

MLE Example using R

Synthetic Data Set:

Linear Model 1 0 . 0 , 2 . 0 ,

Noise ~ , 0 , 2 5 . 0



12/81

MLE Example using R

MLE Solution: Case when s is known

9.8342 , 1.9653

MLE Solution: Case when s is unknown

6.7631 , 1.9993, 43.61



13/81

Maximum Likelihood Estimation

Limitations:

Point estimates for model parameters Need regularization to avoid model over fit

Difficult to incorporate domain knowledge Online/Sequential learning no provision for incremental

update



14/81

Bayesian Methods Bayesian Methods provide a framework to reason coherently about the world in

the presence of uncertainty

Based on a Theorem by Rev. Thomas Bayes (1701-1761)

Independently discovered and popularized by Laplace (1749 1827)

The core approach in Bayesian Methods is to

Start with a Belief (Hypothesis) about a problem and aPrior Degree of Belief

Update the Degree of Belief (Hypothesis) as mode

Evidence (Data) gathers Different from Frequentist Approach, where Probability is a measure of proportion

of outcomes

Analogous to how human beings learn about the world

Bayesian Inference 14ACM Compute 2013


15/81

Bayesian Approach

In Bayesian approach, probability is considered as a Degree of

Belief

Starting point - A Hypothesis representing existing knowledge

or belief

Update the Hypothesis in the event of observing new data

using Bayes Theorem


)(

)()|()|(

01

DataP

HypothesisPHypothesisDataPDataHypothesisP

Posterior Likelihood Prior

Bayes Theorem


16/81

Bayesian Approach

Prediction for a new data point is made by using the updated

posterior distribution

,



17/81

Bayesian Approach

Advantageous:

Distribution of parameters capturing uncertainty in parameter

estimation

Marginalization over distribution of parameters to avoid over

fitting, no need of regularization

Framework to incorporate Domain knowledge through Prior

distribution

Framework to address Model uncertainties

Natural framework for online/sequential learning

Modeling can start with very little data



18/81

Challenges in implementation

Estimation of Posterior Distribution is not trivial since P(Data)

often can not be computed

Choosing the right prior

Computationally more intensive



19/81

Choosing the Right Prior

General Guidelines in choosing a Prior:

1. Justify assumptions and evaluate their plausibility in view of

what is known

2. Explore the sensitivity of the results of the analysis to the

assumptions on the Prior.

Different Type of Priors:

1. Conjugate Priors

2. Non-informative Priors3. Jeffrys Rule

4. Subjective Priors



20/81


21/81

Conjugate Priors

Data i.i.d. observations , , , Model , ~ 0, Prior , Posterior

: exp

= 1

2exp 1

2



22/81

Conjugate Priors

After rearranging

12 2

=

= 12

2

12 2 1

2

/

2



23/81

Conjugate Priors

: ~ ,

/ /

/

1



24/81

Conjugate PriorsPosterior mean

/ /

/ Weighted average of the sample mean and prior men Component having less uncertainty has more weight As n increases the weight of prior mean decreases



25/81

Conjugate PriorsPosterior precision

1 1

Sum of the precision of the sample mean and prior

Always greater than prior precision, even with poor quality data For large n, dominated by the precision of the sample mean



26/81

Non-informative Priors

A prior which contains No Information aboutq

Used when no or minimal prior information is available

Often such priors are Improper Priors (infinite mass)

Priors being Improper is not an issue per say, as long as

the posteriors are well defined densities.

Often Uniform distribution is used as a non-informative prior



27/81

Jeffreys PriorJeffreys (1961) suggested the following prior which is

commonly used

det

is the Fisher Information Matrix

l o g



28/81

Computation of Posterior Distribution

Maximum A Posteriori Estimation

Monte Carlo Simulations

Variational Methods



29/81


(MAP) Find the parameter values which maximizes the Posterior

distribution

argmax

argmax

Maximizing the Numerator of Bayes formula



30/81


(MAP)Example Linear Regression

1, , , ~ 0, , , , , ,

Variables

is treated as deterministic (exogenous)

Model can be rewritten as

|, , ~ ; ,



31/81



Likelihood function:

, , 12 1

det exp 12

1det exp 12 2 Prior distribution: for , ; ,

exp 1

2

exp 12 2



32/81



Posterior distribution:

, , exp 12 2

exp 12

2

Rearranging terms,

exp 12 2



33/81



MAP Estimation of : log , , 0

2

2

0



34/81



MAP Estimation of :Bayesian Shrinkage



35/81


(MAP)Example Prediction of Global Warming by Greenhouse Gases

Data

Atmospheric CO2 Data collected by Mauna Loa Observatory

(1959 2012)

Temperature Anomaly Data collected by NASA and NOAA (1880

2012)

Task is to Predict likely Temperature Raise when CO2 levels reach

500 ppm from current 400ppm



36/81


(MAP)

Example Prediction of Global Warming by Greenhouse Gases



37/81


(MAP)


Hypothesis:

Temperature Anomaly CO2 ConcentrationDomain Knowledge:

Doubling CO2 concentration would result in increase in temperature

of 1 4 0C

During the period 1960 2000, CO2 concentration increased by 52

ppm and temperature increased by 0.44 0C



38/81


(MAP)


From data for the period 1960 1980

.. 2 . 2.3793 0.9924, 0.00730.0030

Use this as Prior information

Use data for period 1981 2007 as Training data

Predict Temp. Anom. for the period 2008 - 2012



39/81


(MAP)




40/81


(MAP)Prior distribution

; , exp 12

3.27, 0.01 , 9 10

00 81 10



41/81


(MAP)Maximum A Posteriori Estimation (MAP)

Prior distribution



42/81


(MAP)


Prediction using Prior Prediction using MLE


43/81


(MAP)


Prediction using MAP


44/81

Monte-Carlo SimulationsOften Posterior distribution is analytically not tractable

No expressions for Mean, Variance etc.

No closed form for Marginal Distributions

Solution

Draw i.i.d. samples from the Posterior

Using the samples compute the Mean, Variance, Confidence Interval

etc.

Markov Chain Monte-Carlo to generate i.i.d. samples



45/81

Monte-Carlo Simulations

Markov Chain Monte-Carlo Simulations (MCMC)Let , , be the parameters for which posteriordistribution needs to be computed

Consider the case where parameters are discrete

with K states for each , , Set up a Markov Process with states and Transition Probability (+) () such that Steady State corresponds to Posterior



46/81

Monte-Carlo SimulationsMarkov Chain Monte-Carlo Simulations

1. Metropolis Hasting Algorithm

i. Let () be the state of the system at time tii. Generate a candidate state for time t+1 by drawing from a

Proposal Distribution iii. Accept the proposal move with probability

min 1,

(+)

iv. If the proposal is rejected (+) ()v. Continue till the distribution converges to a steady state



47/81


1. Metropolis Hasting Algorithm

i. Hill Climbing type of algorithm

ii. Very generic, can be used for most Posterior distributions

iii. Need to choose proposal distribution

carefully to avoid

large number of rejections

slow convergence to steady state



48/81


1. Gibbs Sampling

i. Start with an initial state , , ii. At each step update components one by one by drawing from

a distribution conditional on the most recent value of rest of the

components (+)~ , , (+)~ + , , (+)~ + , + , + ,

(+)

~ +

, , +

iii. After M steps, all the components of the parameter will be updated

iv. Continue till the distribution converges to a steady state



49/81


1. Gibbs Sampling

i. Very efficient algorithm since there are no rejections

ii. Commonly used for practical applications

iii. Conditional distributions should be known



50/81


1. Gibbs Sampling

Example: Posterior ~ Bivariate Normal

; , 1

2 exp 1

2

, , , , Conditional density

; ,



51/81


1. Gibbs Sampling

Example: Posterior ~ Bivariate Normal

Conditional density

; ,


M hi L i i B iMachine Learning Using Bayesian Inference


52/81

Machine Learning using Bayesian

InferenceExample - Clustering

Consider a set of N points , , in D-dimensionsGoal is to Partition data set into K clusters such that

Distance between points within cluster are smaller compared to

distance between points in different clusters.

Let , 1 , , be D-dimensional vectors representing thecenters of each cluster

Let be the indicator function

1 if point in cluser 0 if point in cluster

52ACM Compute 2013

Machine Learning Using Bayesian Inference


53/81

Clustering Define an Objective Function (Distortion Function) to represent

the sum of square of each data point to the center of the clusterit belongs.

=

=

Task is to find andwhich minimizes

53ACM Compute 2013



54/81

Clustering

54ACM Compute 2013



55/81

Clustering Iterative procedure for finding

and

Start with initial values for Minimize w.r.t keeping fixed Minimize w.r.t keeping fixed

K Means algorithm

55ACM Compute 2013


56/81



57/81

Gaussian Mixture ModelWhere

=

,

=

,

=

57ACM Compute 2013


58/81


59/81



60/81

Gaussian Mixture ModelMaximum Likelihood Estimate Computational Issues

Presence of singularities

Let And one of the data points This term contributes

, 12 1 During minimization this term become singular as 0This issue do not arise in single Gaussian distribution ?

60ACM Compute 2013



61/81

Gaussian Mixture ModelMaximum Likelihood Estimate Computational Issues

Exercise:

Singularity issue do not arise in single Gaussian distribution ?

61ACM Compute 2013


62/81

Expectation Maximization (EM)Machine Learning Using Bayesian Inference


63/81

Expectation Maximization (EM)

AlgorithmDempster et.al. 1977

Elegant method for finding MLE for models with latent variables.

Taking derivative ofln , , w.r.t ,,and equating tozero will give

1

=

=

63ACM Compute 2013



64/81


Algorithm Taking derivative ofln , , w.r.t ,, and equating tozero will give

1

=

Does not yield a closed form solution since depends on

,

and

64ACM Compute 2013



65/81


Algorithm

Iterative Solution

1. Initialize parameters , and and evaluate initial value oflog likelihood.

2. E Step: Evaluate responsibilities using current parameter values

, , =

65ACM Compute 2013


66/81


67/81



68/81

EM Algorithm in a Bayesian SettingSuppose we know values of in addition to the observedConsider the problem of maximizing the likelihood of the completedata set ,

,

=

=

, ,

=

=

68ACM Compute 2013



69/81

EM Algorithm in a Bayesian Setting

Log likelihood of the complete data set

,

ln , ,, == ln ln , Logarithm directly acts on the Normal distribution

Simpler equation for MLE

For maximization w.r.t and , the expression is a sum of Kindependent terms For maximization w.r.t , there is a coupling since sum = 1 Using Lagrange multiplier

=

69ACM Compute 2013



70/81

EM Algorithm in a Bayesian Setting Log likelihood of the complete data set

, is easy to maximize

However usually is unknown. Take expectation of the complete-data likelihood using Posterior

distribution of

, ,, , ,, ,, , ,, ,

=

=

70ACM Compute 2013


71/81



72/81

EM Algorithm in a Bayesian Setting1. Initialize parameters , and and evaluate initial value of log

likelihood2. E Step: use these values to compute responsibilities 3. M Step: Keeping responsibilities fixed, maximize

ln , ,, w.r.t , and

1

=

1

=

72ACM Compute 2013


73/81


74/81

General form of EM Algorithm inMachine Learning Using Bayesian Inference


75/81

General form of EM Algorithm in

Bayesian Setting

For any choice of following decomposition holdsln , || , ln ,

|| ln ,

||is the Kullback-Leibler Divergence between

and

75ACM Compute 2013


76/81


77/81


78/81

General form of EM Algorithm inMachine Learning Using Bayesian Inference


79/81

General form of EM Algorithm in

Bayesian SettingGraphical interpretation of EM Algorithm

79ACM Compute 2013


80/81

f


81/81

References1. Pattern Recognition and Machine Learning

Christopher M. Bishop, Springer

2. Dynamic Linear Models with R Giovanni Petris,

Springer

Email: [email protected]

Twitter: @hari_koduvely
mailto:[email protected]:[email protected]

Date post:	02-Apr-2018
Category:	Documents
Upload:	harik68
View:	216 times
Download:	0 times

Bayesian Inference in Machine Learning

Documents