of 81
7/27/2019 Bayesian Inference in Machine Learning
1/81
7/27/2019 Bayesian Inference in Machine Learning
2/81
Tutorial ContentsI. Introduction to Machine Learning
II. Bayesian Inference
1. Maximum Likelihood Estimation
2. Bayesian Approach
3. Estimation of Posterior DistributionIII. Machine Learning using Bayesian Inference
1. Clustering
2. Latent Variables
3. Expectation Maximization Algorithm
4. Variational Methods
Tutorial Contents 2ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
3/81
Introduction to Machine Learning
Introduction to Machine Learning 3ACM Compute 2013
- Finding non-obvious patterns in data
- Learning complex relationships among
variables
- Predicting the outcomes when somevariables undergo changes
Machine Learning is a subject inArtificialIntelligence at the intersection ofComputerScience and Statistics dealing with
7/27/2019 Bayesian Inference in Machine Learning
4/81
Common Tasks in Machine Learning
Introduction to Machine Learning 4ACM Compute 2013
Identifying natural groups (clusters) in the dataand assigning each data point to a particular
clusterClustering
Assigning new observations to a set of predefinedcategories (classes) by learning from a training
data setClassification
Estimating the likely hood of purchase, probabilityof failure, odds ratio of win etc.
Prediction
Extracting non-obvious patterns and rules havingsignificant support and confidence from data
Pattern Mining
Reduce number of variables by finding a subset
(features) which contains most of the relevant
information (useful for visualization)
Dimensionality
Reduction
Predicting future values in a Time Series
Forecasting
7/27/2019 Bayesian Inference in Machine Learning
5/81
Machine Learning Algorithms
Introduction to Machine Learning 5ACM Compute 2013
K-Means Algorithm Decision TreesSupport Vector
Machines
Neural Networks Logistic Regression Bayesian Networks
Apriori Algorithm
PrincipalComponent
Analysis
Gibbs Sampling
7/27/2019 Bayesian Inference in Machine Learning
6/81
Machine Learning Algorithms
Introduction to Machine Learning 6ACM Compute 2013
x = x1
XX
X X
X
X
X
X
X
X
X
X
X
XX
X
X
X
X
X
X
X
X
X
XX
X
X
X
X X
X
X X
X
X
X
X
y = y1
y = y2
x = x2
x = x3
Decision Trees for Classification : Divide and Conquer Approach
7/27/2019 Bayesian Inference in Machine Learning
7/81
Machine Learning Algorithms
Introduction to Machine Learning 7ACM Compute 2013
Decision Trees for Classification : Divide and Conquer Approach
X < X1 X > X1
Y < Y1 Y > Y1 Y < Y2 Y > Y2
X < X3 X > X3X < X2 X > X2
Red Red Blue Red Blue
7/27/2019 Bayesian Inference in Machine Learning
8/81
Classification of Machine Learning
Algorithms
Introduction to Machine Learning 8ACM Compute 2013
Based on Data Type1. Supervised
When there are labeled data sets for training
Classification, Regression
2. Unsupervised
No labeled data sets for training Clustering
Based on Model Type
1. Parametric
Finite number of parameters describing models
Regression Coefficients2. Non-Parametric
Infinite number of parameters describing models
Make no assumption on the distribution of data
Rank Ordering
7/27/2019 Bayesian Inference in Machine Learning
9/81
Tools for Machine Learning
Introduction to Machine Learning 9ACM Compute 2013
Open Source
1. R
2. Mahout (for Big Data)
3. Weka
4. SciKit (Python)
Licensed
1. SAS
2. SPSS
3. KXEN
7/27/2019 Bayesian Inference in Machine Learning
10/81
Maximum Likelihood Estimation
Workhorse of Machine Learning
Objective is to estimate model parameters that maximizes the
probability of observed data given a model
MLE Steps: Data i.i.d. observations , , , Choose model parameters which maximizes Likelihood Function =
argmax log arg max1 log
=
Introduction to Machine Learning 10ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
11/81
MLE Example using R
Synthetic Data Set:
Linear Model 1 0 . 0 , 2 . 0 ,
Noise ~ , 0 , 2 5 . 0
Introduction to Machine Learning 11ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
12/81
MLE Example using R
MLE Solution: Case when s is known
9.8342 , 1.9653
MLE Solution: Case when s is unknown
6.7631 , 1.9993, 43.61
Introduction to Machine Learning 12ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
13/81
Maximum Likelihood Estimation
Limitations:
Point estimates for model parameters Need regularization to avoid model over fit
Difficult to incorporate domain knowledge Online/Sequential learning no provision for incremental
update
Introduction to Machine Learning 13ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
14/81
Bayesian Methods Bayesian Methods provide a framework to reason coherently about the world in
the presence of uncertainty
Based on a Theorem by Rev. Thomas Bayes (1701-1761)
Independently discovered and popularized by Laplace (1749 1827)
The core approach in Bayesian Methods is to
Start with a Belief (Hypothesis) about a problem and aPrior Degree of Belief
Update the Degree of Belief (Hypothesis) as mode
Evidence (Data) gathers Different from Frequentist Approach, where Probability is a measure of proportion
of outcomes
Analogous to how human beings learn about the world
Bayesian Inference 14ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
15/81
Bayesian Approach
In Bayesian approach, probability is considered as a Degree of
Belief
Starting point - A Hypothesis representing existing knowledge
or belief
Update the Hypothesis in the event of observing new data
using Bayes Theorem
Bayesian Inference 15ACM Compute 2013
)(
)()|()|(
01
DataP
HypothesisPHypothesisDataPDataHypothesisP
Posterior Likelihood Prior
Bayes Theorem
7/27/2019 Bayesian Inference in Machine Learning
16/81
Bayesian Approach
Prediction for a new data point is made by using the updated
posterior distribution
,
Bayesian Inference 16ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
17/81
Bayesian Approach
Advantageous:
Distribution of parameters capturing uncertainty in parameter
estimation
Marginalization over distribution of parameters to avoid over
fitting, no need of regularization
Framework to incorporate Domain knowledge through Prior
distribution
Framework to address Model uncertainties
Natural framework for online/sequential learning
Modeling can start with very little data
Bayesian Inference 17ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
18/81
Challenges in implementation
Estimation of Posterior Distribution is not trivial since P(Data)
often can not be computed
Choosing the right prior
Computationally more intensive
Bayesian Inference 18ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
19/81
Choosing the Right Prior
General Guidelines in choosing a Prior:
1. Justify assumptions and evaluate their plausibility in view of
what is known
2. Explore the sensitivity of the results of the analysis to the
assumptions on the Prior.
Different Type of Priors:
1. Conjugate Priors
2. Non-informative Priors3. Jeffrys Rule
4. Subjective Priors
Bayesian Inference 19ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
20/81
7/27/2019 Bayesian Inference in Machine Learning
21/81
Conjugate Priors
Data i.i.d. observations , , , Model , ~ 0, Prior , Posterior
: exp
= 1
2exp 1
2
Bayesian Inference 21ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
22/81
Conjugate Priors
After rearranging
12 2
=
= 12
2
12 2 1
2
/
2
Bayesian Inference 22ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
23/81
Conjugate Priors
: ~ ,
/ /
/
1
Bayesian Inference 23ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
24/81
Conjugate PriorsPosterior mean
/ /
/ Weighted average of the sample mean and prior men Component having less uncertainty has more weight As n increases the weight of prior mean decreases
Bayesian Inference 24ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
25/81
Conjugate PriorsPosterior precision
1 1
Sum of the precision of the sample mean and prior
Always greater than prior precision, even with poor quality data For large n, dominated by the precision of the sample mean
Bayesian Inference 25ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
26/81
Non-informative Priors
A prior which contains No Information aboutq
Used when no or minimal prior information is available
Often such priors are Improper Priors (infinite mass)
Priors being Improper is not an issue per say, as long as
the posteriors are well defined densities.
Often Uniform distribution is used as a non-informative prior
Bayesian Inference 26ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
27/81
Jeffreys PriorJeffreys (1961) suggested the following prior which is
commonly used
det
is the Fisher Information Matrix
l o g
Bayesian Inference 27ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
28/81
Computation of Posterior Distribution
Maximum A Posteriori Estimation
Monte Carlo Simulations
Variational Methods
Bayesian Inference 28ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
29/81
Maximum A Posteriori Estimation
(MAP) Find the parameter values which maximizes the Posterior
distribution
argmax
argmax
Maximizing the Numerator of Bayes formula
Bayesian Inference 29ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
30/81
Maximum A Posteriori Estimation
(MAP)Example Linear Regression
1, , , ~ 0, , , , , ,
Variables
is treated as deterministic (exogenous)
Model can be rewritten as
|, , ~ ; ,
Bayesian Inference 30ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
31/81
Maximum A Posteriori Estimation
(MAP)Example Linear Regression
Likelihood function:
, , 12 1
det exp 12
1det exp 12 2 Prior distribution: for , ; ,
exp 1
2
exp 12 2
Bayesian Inference 31ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
32/81
Maximum A Posteriori Estimation
(MAP)Example Linear Regression
Posterior distribution:
, , exp 12 2
exp 12
2
Rearranging terms,
exp 12 2
Bayesian Inference 32ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
33/81
Maximum A Posteriori Estimation
(MAP)Example Linear Regression
MAP Estimation of : log , , 0
2
2
0
Bayesian Inference 33ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
34/81
Maximum A Posteriori Estimation
(MAP)Example Linear Regression
MAP Estimation of :Bayesian Shrinkage
Bayesian Inference 34ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
35/81
Maximum A Posteriori Estimation
(MAP)Example Prediction of Global Warming by Greenhouse Gases
Data
Atmospheric CO2 Data collected by Mauna Loa Observatory
(1959 2012)
Temperature Anomaly Data collected by NASA and NOAA (1880
2012)
Task is to Predict likely Temperature Raise when CO2 levels reach
500 ppm from current 400ppm
Bayesian Inference 35ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
36/81
Maximum A Posteriori Estimation
(MAP)
Example Prediction of Global Warming by Greenhouse Gases
Bayesian Inference 36ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
37/81
Maximum A Posteriori Estimation
(MAP)
Example Prediction of Global Warming by Greenhouse Gases
Hypothesis:
Temperature Anomaly CO2 ConcentrationDomain Knowledge:
Doubling CO2 concentration would result in increase in temperature
of 1 4 0C
During the period 1960 2000, CO2 concentration increased by 52
ppm and temperature increased by 0.44 0C
Bayesian Inference 37ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
38/81
Maximum A Posteriori Estimation
(MAP)
Example Prediction of Global Warming by Greenhouse Gases
From data for the period 1960 1980
.. 2 . 2.3793 0.9924, 0.00730.0030
Use this as Prior information
Use data for period 1981 2007 as Training data
Predict Temp. Anom. for the period 2008 - 2012
Bayesian Inference 38ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
39/81
Maximum A Posteriori Estimation
(MAP)
Example Prediction of Global Warming by Greenhouse Gases
Bayesian Inference 39ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
40/81
Maximum A Posteriori Estimation
(MAP)Prior distribution
; , exp 12
3.27, 0.01 , 9 10
00 81 10
Bayesian Inference 40ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
41/81
Maximum A Posteriori Estimation
(MAP)Maximum A Posteriori Estimation (MAP)
Prior distribution
Bayesian Inference 41ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
42/81
Maximum A Posteriori Estimation
(MAP)
Bayesian Inference 42ACM Compute 2013
Prediction using Prior Prediction using MLE
7/27/2019 Bayesian Inference in Machine Learning
43/81
Maximum A Posteriori Estimation
(MAP)
Bayesian Inference 43ACM Compute 2013
Prediction using MAP
7/27/2019 Bayesian Inference in Machine Learning
44/81
Monte-Carlo SimulationsOften Posterior distribution is analytically not tractable
No expressions for Mean, Variance etc.
No closed form for Marginal Distributions
Solution
Draw i.i.d. samples from the Posterior
Using the samples compute the Mean, Variance, Confidence Interval
etc.
Markov Chain Monte-Carlo to generate i.i.d. samples
Bayesian Inference 44ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
45/81
Monte-Carlo Simulations
Markov Chain Monte-Carlo Simulations (MCMC)Let , , be the parameters for which posteriordistribution needs to be computed
Consider the case where parameters are discrete
with K states for each , , Set up a Markov Process with states and Transition Probability (+) () such that Steady State corresponds to Posterior
Bayesian Inference 45ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
46/81
Monte-Carlo SimulationsMarkov Chain Monte-Carlo Simulations
1. Metropolis Hasting Algorithm
i. Let () be the state of the system at time tii. Generate a candidate state for time t+1 by drawing from a
Proposal Distribution iii. Accept the proposal move with probability
min 1,
(+)
iv. If the proposal is rejected (+) ()v. Continue till the distribution converges to a steady state
Bayesian Inference 46ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
47/81
Monte-Carlo SimulationsMarkov Chain Monte-Carlo Simulations
1. Metropolis Hasting Algorithm
i. Hill Climbing type of algorithm
ii. Very generic, can be used for most Posterior distributions
iii. Need to choose proposal distribution
carefully to avoid
large number of rejections
slow convergence to steady state
Bayesian Inference 47ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
48/81
Monte-Carlo SimulationsMarkov Chain Monte-Carlo Simulations
1. Gibbs Sampling
i. Start with an initial state , , ii. At each step update components one by one by drawing from
a distribution conditional on the most recent value of rest of the
components (+)~ , , (+)~ + , , (+)~ + , + , + ,
(+)
~ +
, , +
iii. After M steps, all the components of the parameter will be updated
iv. Continue till the distribution converges to a steady state
Bayesian Inference 48ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
49/81
Monte-Carlo SimulationsMarkov Chain Monte-Carlo Simulations
1. Gibbs Sampling
i. Very efficient algorithm since there are no rejections
ii. Commonly used for practical applications
iii. Conditional distributions should be known
Bayesian Inference 49ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
50/81
Monte-Carlo SimulationsMarkov Chain Monte-Carlo Simulations
1. Gibbs Sampling
Example: Posterior ~ Bivariate Normal
; , 1
2 exp 1
2
, , , , Conditional density
; ,
Bayesian Inference 50ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
51/81
Monte-Carlo SimulationsMarkov Chain Monte-Carlo Simulations
1. Gibbs Sampling
Example: Posterior ~ Bivariate Normal
Conditional density
; ,
Bayesian Inference 51ACM Compute 2013
M hi L i i B iMachine Learning Using Bayesian Inference
7/27/2019 Bayesian Inference in Machine Learning
52/81
Machine Learning using Bayesian
InferenceExample - Clustering
Consider a set of N points , , in D-dimensionsGoal is to Partition data set into K clusters such that
Distance between points within cluster are smaller compared to
distance between points in different clusters.
Let , 1 , , be D-dimensional vectors representing thecenters of each cluster
Let be the indicator function
1 if point in cluser 0 if point in cluster
52ACM Compute 2013
Machine Learning Using Bayesian Inference
7/27/2019 Bayesian Inference in Machine Learning
53/81
Clustering Define an Objective Function (Distortion Function) to represent
the sum of square of each data point to the center of the clusterit belongs.
=
=
Task is to find andwhich minimizes
53ACM Compute 2013
Machine Learning Using Bayesian Inference
7/27/2019 Bayesian Inference in Machine Learning
54/81
Clustering
54ACM Compute 2013
Machine Learning Using Bayesian Inference
7/27/2019 Bayesian Inference in Machine Learning
55/81
Clustering Iterative procedure for finding
and
Start with initial values for Minimize w.r.t keeping fixed Minimize w.r.t keeping fixed
K Means algorithm
55ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
56/81
Machine Learning Using Bayesian Inference
7/27/2019 Bayesian Inference in Machine Learning
57/81
Gaussian Mixture ModelWhere
=
,
=
,
=
57ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
58/81
7/27/2019 Bayesian Inference in Machine Learning
59/81
Machine Learning Using Bayesian Inference
7/27/2019 Bayesian Inference in Machine Learning
60/81
Gaussian Mixture ModelMaximum Likelihood Estimate Computational Issues
Presence of singularities
Let And one of the data points This term contributes
, 12 1 During minimization this term become singular as 0This issue do not arise in single Gaussian distribution ?
60ACM Compute 2013
Machine Learning Using Bayesian Inference
7/27/2019 Bayesian Inference in Machine Learning
61/81
Gaussian Mixture ModelMaximum Likelihood Estimate Computational Issues
Exercise:
Singularity issue do not arise in single Gaussian distribution ?
61ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
62/81
Expectation Maximization (EM)Machine Learning Using Bayesian Inference
7/27/2019 Bayesian Inference in Machine Learning
63/81
Expectation Maximization (EM)
AlgorithmDempster et.al. 1977
Elegant method for finding MLE for models with latent variables.
Taking derivative ofln , , w.r.t ,,and equating tozero will give
1
=
=
63ACM Compute 2013
Expectation Maximization (EM)Machine Learning Using Bayesian Inference
7/27/2019 Bayesian Inference in Machine Learning
64/81
Expectation Maximization (EM)
Algorithm Taking derivative ofln , , w.r.t ,, and equating tozero will give
1
=
Does not yield a closed form solution since depends on
,
and
64ACM Compute 2013
Expectation Maximization (EM)Machine Learning Using Bayesian Inference
7/27/2019 Bayesian Inference in Machine Learning
65/81
Expectation Maximization (EM)
Algorithm
Iterative Solution
1. Initialize parameters , and and evaluate initial value oflog likelihood.
2. E Step: Evaluate responsibilities using current parameter values
, , =
65ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
66/81
7/27/2019 Bayesian Inference in Machine Learning
67/81
Machine Learning Using Bayesian Inference
7/27/2019 Bayesian Inference in Machine Learning
68/81
EM Algorithm in a Bayesian SettingSuppose we know values of in addition to the observedConsider the problem of maximizing the likelihood of the completedata set ,
,
=
=
, ,
=
=
68ACM Compute 2013
Machine Learning Using Bayesian Inference
7/27/2019 Bayesian Inference in Machine Learning
69/81
EM Algorithm in a Bayesian Setting
Log likelihood of the complete data set
,
ln , ,, == ln ln , Logarithm directly acts on the Normal distribution
Simpler equation for MLE
For maximization w.r.t and , the expression is a sum of Kindependent terms For maximization w.r.t , there is a coupling since sum = 1 Using Lagrange multiplier
=
69ACM Compute 2013
Machine Learning Using Bayesian Inference
7/27/2019 Bayesian Inference in Machine Learning
70/81
EM Algorithm in a Bayesian Setting Log likelihood of the complete data set
, is easy to maximize
However usually is unknown. Take expectation of the complete-data likelihood using Posterior
distribution of
, ,, , ,, ,, , ,, ,
=
=
70ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
71/81
Machine Learning Using Bayesian Inference
7/27/2019 Bayesian Inference in Machine Learning
72/81
EM Algorithm in a Bayesian Setting1. Initialize parameters , and and evaluate initial value of log
likelihood2. E Step: use these values to compute responsibilities 3. M Step: Keeping responsibilities fixed, maximize
ln , ,, w.r.t , and
1
=
1
=
72ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
73/81
7/27/2019 Bayesian Inference in Machine Learning
74/81
General form of EM Algorithm inMachine Learning Using Bayesian Inference
7/27/2019 Bayesian Inference in Machine Learning
75/81
General form of EM Algorithm in
Bayesian Setting
For any choice of following decomposition holdsln , || , ln ,
|| ln ,
||is the Kullback-Leibler Divergence between
and
75ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
76/81
7/27/2019 Bayesian Inference in Machine Learning
77/81
7/27/2019 Bayesian Inference in Machine Learning
78/81
General form of EM Algorithm inMachine Learning Using Bayesian Inference
7/27/2019 Bayesian Inference in Machine Learning
79/81
General form of EM Algorithm in
Bayesian SettingGraphical interpretation of EM Algorithm
79ACM Compute 2013
7/27/2019 Bayesian Inference in Machine Learning
80/81
f
7/27/2019 Bayesian Inference in Machine Learning
81/81
References1. Pattern Recognition and Machine Learning
Christopher M. Bishop, Springer
2. Dynamic Linear Models with R Giovanni Petris,
Springer
Email: [email protected]
Twitter: @hari_koduvely
mailto:[email protected]:[email protected]