IP, José Bioucas Dias, IST, 2007 1
Statistical Inference
Parametric Inference
Maximum Likelihood Inference
Exponential Families
Expectation Maximization (EM)
Bayesian Inference
Statistical Decison Theory
IP, José Bioucas Dias, IST, 2007 2
Statistical Inference
Statistics aims at retriving the “causes” (e.g., parameters of a pdf)
from the observations (effects)
Statistical inference problems can thus be seen as Inverse Problems
As a result of this perpective, at the eighteenth century (at the time of Bayes and
Laplace) Statistics was often called Inverse Probability
Probability
Statistics
IP, José Bioucas Dias, IST, 2007 3
Parametric Inference
Consider the parametric model where
is the parameter space and is the parameter
The problem of inference reduces to the estimation of from ; i.e,
Parameters of interest and nuisance parameters
Sometimes we are only interested in some function
that depends only on
Let
- parameter of interest;
- nuisance parameter
Example:
IP, José Bioucas Dias, IST, 2007 4
Parametric Inference (theoretical limits)
The Cramer Rao Lower Bound (CRLB)
Under under appropriate regularity conditions, the covariance matrix of any
Unbiased estimator , satisfies
where is the Fisher information matrix given by
An unbiased estimator that attains the CRLB may be found iif
For some function h. The estimator is
IP, José Bioucas Dias, IST, 2007 5
CRLB for the general Gaussian case
Example: Parameter of a signal in white noise
Example: Known signal in unknown white noise
If
IP, José Bioucas Dias, IST, 2007 6
Maximum Likelihood Method
is the likelihood function
If for all f we can use the log-likelihood
Example (Bernoulli)
IP, José Bioucas Dias, IST, 2007 7
Example (Uniform)
Maximum Likelihood
1 1
IP, José Bioucas Dias, IST, 2007 8
Maximum Likelihood
Example (Gaussian)
Sample mean
Sample variace
IID
IP, José Bioucas Dias, IST, 2007 9
Maximum Likelihood
Example (Multivariate Gaussian)
IID
Sample mean
Sample covariance
IP, José Bioucas Dias, IST, 2007 10
Maximum Likelihood (linear observation model)
Example: Linear observation in Gaussian noise
A is full rank
IP, José Bioucas Dias, IST, 2007 11
Example: Linear observation in Gaussian noise (cont.)
• MLE is equivalent to the LSE using the norm
• If , , is given by the Moore-Penrose Pseudo-Inverse
• is a projection matrix
(SVD)
• If the noise is zero-mean but not Gaussian, the Best Linear Unbiased
estimator (BLUE) is still given by
IP, José Bioucas Dias, IST, 2007 12
• Is the Minimum Variance Unbiased (MVU) estimator
[ and is the minimum among all unbiased estimators]
• Is efficient (it attains the Camer Rao Lower Bound (CRLB))
• Its PDF is
Linear observation in Gaussian noise
Maximum likelihood
Properties (MLE is optimal for the linear model)
MLE
IP, José Bioucas Dias, IST, 2007 13
Appealing properties of MLE
Maximum likelihood (characterization)
1. The MLE is consistent: ( denotes the true parameter)
2. The MLE is equivariant: if is the MLE estimate of , then is the
MLE estimate of
3. The MLE (under appropriate regularity conditions) is asymptotically Normal
and optimal or efficient:
Let A sequence of IID vectors in and
Fisher information matrix
IP, José Bioucas Dias, IST, 2007 14
The exponential Family
Definition: the set an exponential family of
dimension k if there there are functions
such that
is a sufficient statistic for f , i.e,
Theorem: (Neyman-Fisher Factorization) is a sufficient statistic for
f iif can be factored as
IP, José Bioucas Dias, IST, 2007 15
The exponential family
Natural (or canonical) form
Given an exponential family, it is always possible to introduce the change
of variables and the reparemeterization such that
Since is a PDF, it must integrate to one
IP, José Bioucas Dias, IST, 2007 16
The exponential family (The partition function)
Computing moments from the derivatives of the partition function
After some calculus
IP, José Bioucas Dias, IST, 2007 17
The exponential family (IID sequences)
Let a member of an exponential family defined by
The density of the IID sequence is
belongs exponential family defined by
IP, José Bioucas Dias, IST, 2007 18
Examples of exponential families
Many of the most common probabilistic models belong to exponential
families; e.g., Gaussian, Poisson, Bernoulli, binomial, exponential,
gamma, and beta.
Example:
Canonical form
IP, José Bioucas Dias, IST, 2007 19
Examples of exponential families (Gaussian)
Example:
Canonical form
IP, José Bioucas Dias, IST, 2007 20
Computing maximum likelihood estimates
Very often the MLE can not be found analytically. Commonly
used numerical methods:
1. Newton-Raphson
2. Scoring
3. Expectation Maximization (EM)
Newton-Raphson method
Scoring method
Can be computed off-line
IP, José Bioucas Dias, IST, 2007 21
Computing maximum likelihood estimates (EM)
Expectation Maximization (EM) [Dempster, Laird, and Rubin, 1977]
Idea: iterate between two steps:
Suppose that is hard to maximize
But we can find a vector z such that is easy to maximze and
E-step: “Fill in z” in
M-step: Maximize
Terminology
Complete data
Missing data
Observed data
IP, José Bioucas Dias, IST, 2007 22
Expectation maximization
The EM algorithm
1. Pick up a starting vector : repeat steps 2. and 3.
2. E-step: Calculate
3. M-step
Alternatively (GEM)
IP, José Bioucas Dias, IST, 2007 23
Expectation maximization
The EM (GEM) algorithm always increases the likelihood.
Define
1.
2.
3.
4.
Kulback Leibler
distance
KL distance maximization
IP, José Bioucas Dias, IST, 2007 24
Expectation maximization (why does it work?)
IP, José Bioucas Dias, IST, 2007 25
EM: Mixtures of densities
Let be the random variable that selects the active mode:
where and
IP, José Bioucas Dias, IST, 2007 26
EM: Mixtures of densities
Consider now that is a sequence of IID random variables
Let be IID random variables, where selects the active
mode in the sample :
IP, José Bioucas Dias, IST, 2007 27
EM: Mixtures of densities
Equivalent Q
Where is the sample mean of x, i.e.,
IP, José Bioucas Dias, IST, 2007 28
EM: Mixtures of densities
E-step:
M-step:
IP, José Bioucas Dias, IST, 2007 29
EM: Mixtures of densities
E-step:
M-step:
IP, José Bioucas Dias, IST, 2007 30
EM: Mixtures of Gaussian densities (MOGs)
M-step:
E-step:
Weighted sample mean
Weighted sample covariance
IP, José Bioucas Dias, IST, 2007 31
EM: Mixtures of Gaussian densities. 1D Example
0 1 0.6316
3 3 0.3158
6 10 0.0526
p = 3
N = 1900
0 5 10 15 20 25 30-5200
-5000
-4800
-4600
-4400
-4200
-4000
-3800loglikelihood L(fk)
-0.0288 1.0287 0.6258
2.8952 2.5649 0.3107
6.1687 7.3980 0.0635
IP, José Bioucas Dias, IST, 2007 32
EM: Mixtures of Gaussian Densities (MOGs)
Example – 1D 0 1 0.6316
3 3 0.3158
6 10 0.0526
p = 3
N = 1900
-5 0 5 10 150
0.05
0.1
0.15
0.2
0.25
0.3
0.35
hist
est MOG
true MOG
-6 -4 -2 0 2 4 6 8 10 120
0.05
0.1
0.15
0.2
0.25
0.3
0.35
hist
est modes
true modes
IP, José Bioucas Dias, IST, 2007 33
EM: Mixtures of Gaussian Densities: 2D Example
-2 0 2 4
-2
0
2
k=3
MOG with determination of the number of modes [M. Figueiredo, 2002]
IP, José Bioucas Dias, IST, 2007 34
Bayesian Estimation
IP, José Bioucas Dias, IST, 2007 35
The Bayesian Philosophy ([Wasserman, 2004])
Bayesian Inference
B1 – Probabilities describe degrees of belief, not limiting relative frequency
B2 – We can make probability statements about parameters, even though
they are fixed parameters
B3 – We make inferences about a parameter by producing a
probalility distribution for
F1 – Probabilities refer to limiting relative frequencies and are objective
properties of the real world
F2 – Parameters are fixed unknown parameters
F3 – The criteria for obtaining statistical procedures are based on long run
frequency properties.
Frequencist or Classical Inference
IP, José Bioucas Dias, IST, 2007 36
The Bayesian Philosophy
unknown
Classical Inference
Observation model observation
Prior knowledge
Bayesian Inference
describes degrees of belief (subjective), not limiting frequency
IP, José Bioucas Dias, IST, 2007 37
The Bayesian method
1. Choose a prior density , called the prior (or a priori) distribution
that expresses our beliefs about f, before we see any data
2. Choose the observation model that reflects our belief about g
given f
3. Calculate the posterior (or a posteriori) distribution using the
Bayes law:
where
is the marginal on g (other names: evidence, unconditional, predictive)
4. Any inference should be based on the posterior
IP, José Bioucas Dias, IST, 2007 38
for = >1, pulls
towards 1/2
The Bayesian method
Example: Let IID
and
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
3.5
4
==0.5
==1
==2
==10
IP, José Bioucas Dias, IST, 2007 39
Observation model
Prior
Posterior
Thus,
Example (cont.):
(Bernoulli observations, Beta prior)
IP, José Bioucas Dias, IST, 2007 40
Example (cont.):
(Bernoulli observations, Beta prior)
• Total ignorance: flat prior = =1
Maximum a posteriori estimate (MAP)
The von Mises Theorem
If the prior is continuous and not zero at the location of the
MLestimate, then,
• Note that for large values of n
IP, José Bioucas Dias, IST, 2007 41
Conjugate priors
In the previous example, the prior and the posterior are both Beta
distributed. We say that the prior is conjugate with respect to the model
• Formally, let and be
two parametrized families of priors and observation models, respectively
• is a conjugate family for if
for some
• Very often, prior information about f is very small, allowing to select
conjugate priors
• Conjugate priors why? Computing the posterior density simply consists
in updating the parameters of the prior
IP, José Bioucas Dias, IST, 2007 42
Conjugate priors (Gaussian observation, Gaussian prior)
• Gaussian observations
• Gaussian prior
• The posterior distribution is Gaussian
1. The mean of is in the simplex defined by {g,}
2. The variance of is the parallel of variances and
IP, José Bioucas Dias, IST, 2007 43
Conjugate priors (Gaussian IID observations, Gaussian prior)
• Gaussian IID observations
• Gaussian prior
• The posterior distribution is Gaussian
1. The mean of is in the simplex defined by
2. The variance of is the parallel of variances and
IP, José Bioucas Dias, IST, 2007 44
Conjugate Priors (Gaussian IID observations, Gaussian prior)
-15 -10 -5 5 10 15
0.2
0.4
0.6
0.8
1
IP, José Bioucas Dias, IST, 2007 45
Conjugate Priors (multivariate Gaussian: observation and prior)
• (g,f) jointly Gaussian distributed:
• Then
a)
b)
c)
IP, José Bioucas Dias, IST, 2007 46
Conjugate Priors (multivariate Gaussian: observation and prior)
• Linear observation model (f and w independent)
• Posterior
IP, José Bioucas Dias, IST, 2007 47
Conjugate Priors (multivariate Gaussian: observation and prior)
• Linear observation model (f and w independent)
• Using the matrix inversion lemma
• is the solution of the following regularized LS problem
e.g., penalize
oscillatory solutions
IP, José Bioucas Dias, IST, 2007 48
Improper Priors
• Assume that p(f)=k on given domain
• Even if the domain of f is unbounded, and, thus,
the posterior is well defined.
• In a sense, improper priors account for a state of total ignorance. This raises
no issues to the Bayesian framework, as far as the posterior is proper.
IP, José Bioucas Dias, IST, 2007 49
Bayes Estimators
IP, José Bioucas Dias, IST, 2007 50
Bayes estimators
Ingredients of Statistical Decision Theory:
• posterior distribution
conveys all knowledge about f, given the observation g
• loss function
measures the discrepancy between and .
• a posteriori expected loss
• optimal Bayes estimator
IP, José Bioucas Dias, IST, 2007 51
Bayesian framework
• Nuisance Parameter
Let and
Nuisance parameter
• The posterior risk depends only on the marginal on
• In a pure Bayesian framework, nuisance parameters are
integrated out
IP, José Bioucas Dias, IST, 2007 52
Bayes estimators: Maximum a posteriori probability (MAP)
• Zero-one, “0/1”, loss Volume of an -ball
• Maximum a posteriori probability
A discrete domain leads to the
MAP estimator as well
IP, José Bioucas Dias, IST, 2007 53
Bayes Estimators: Posterior Mean (PM)
• Quadratic loss:
Q is symmetric and positive definite
• Posterior mean
Only this term
Depends on
• Valid for any . If Q diagonal the loss function
is additive
may be hard to compute
IP, José Bioucas Dias, IST, 2007 54
Bayes estimators: Additive loss
• Let
• Then, the minimization is decoupled
• Each component of minimizes the corresponding marginal
a posteriori expected loss
IP, José Bioucas Dias, IST, 2007 55
Bayes Estimators: Additive Loss
• Additive “0/1” loss:
is the maximizer of the posterior marginal
• Additive quadratic loss:
The additive quadratic loss is a quadratic loss with Q=I. Therefore,
The corresponding Bayes estimator is the posterior mean
IP, José Bioucas Dias, IST, 2007 56
Example (Gaussian IID observations, Gaussian prior)
• Gaussian IID observations
• Gaussian prior
• The posterior distribution is Gaussian
as
IP, José Bioucas Dias, IST, 2007 57
Example (Gaussian observation, Laplacian prior)
MAP estimate
• Strictly concave
•
•
•
IP, José Bioucas Dias, IST, 2007 58
Example (Gaussian observation, Laplacian prior)
MAP estimate
IP, José Bioucas Dias, IST, 2007 59
Example (Gaussian observation, Laplacian prior)
PM estimate
No closed form expressions
Resort to numerical procedures
IP, José Bioucas Dias, IST, 2007 60
Example (Gaussian observation, Laplacian prior)
-10 -5 0 5 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -5 0 5 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-10 -5 0 5 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
-10 -5 0 5 100
0.1
0.2
0.3
0.4
0.5
-10 -5 0 5 100
0.1
0.2
0.3
0.4
0.5
-10 -5 0 5 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
IP, José Bioucas Dias, IST, 2007 61
Example (Gaussian observation, Laplacian prior)
-5 -4 -3 -2 -1 0 1 2 3 4 5-5
-4
-3
-2
-1
0
1
2
3
4
5
IP, José Bioucas Dias, IST, 2007 62
• is called the Wiener filter
• If all the eigenvectors of C approaches infinite, then
Example (Multivariate Gaussian: observation and prior)
• Linear observation model (f and w independent)
• Posterior
which is the Moore-Penrose pseudo (or generalized) inverse of A