ASP Lecture 3 Estimation Intro 2015

Post on 06-Dec-2015

221 views 3 download

Tags:

description

DSP

transcript

Advanced Signal Processing

Introduction to Estimation Theory

Danilo Mandic,

room 813, ext: 46271

Department of Electrical and Electronic Engineering

Imperial College London, UKd.mandic@imperial.ac.uk, URL: www.commsp.ee.ic.ac.uk/∼mandic

c© D. P. Mandic Advanced Signal Processing, Spring 2015 1

Aims of this lecture

◦ To introduce the notions of: Estimator, Estimate, Estimandum

◦ To discuss the bias and variance in statistical estimation theory,asymptotically unbiased and consistent estimators

◦ Performance metric: the Mean Square Error (MSE)

◦ The bias–variance dilemma and the MSE in this context

◦ To derive a feasible MSE estimator

◦ A class of Minimum Variance Unbiased (MVU) estimators

◦ Extension to the vector parameter case

◦ Point estimators, confidence intervals, statistical goodness of anestimator, the role of noise

c© D. P. Mandic Advanced Signal Processing, Spring 2015 2

Role of estimation in signal processing

(try also the function specgram in Matlab)

◦ An enabling technology in many electronic signal processing systems

1. Radar 4. Image analysis 7. Control2. Sonar 5. Biomedicine 8. Seismics3. Speech 6. Communications 9. Almost everywhere ...

◦ Radar and sonar: range and azimuth

◦ Image analysis: motion estimation, segmenation

◦ Speech: features used in recognition and speaker verification

◦ Seismics: oil reservoirs

◦ Communications: equalization, symbol detection

◦ Biomedicine: various applications

c© D. P. Mandic Advanced Signal Processing, Spring 2015 3

Statistical estimation problem

for simplicity, consider a DC level in WGN, x[n] = A + w[n], w ∼ N (0, σ2)

Problem statement: We seek to determine a set of parametersθ = [θ1, . . . , θp]

T from a set of data points x = [x[0], . . . , x[N − 1]]T

such that the values of these parameters would yield the highestprobability of obtaining the observed data. In other words,

maxspan θ

p(x;θ) reads : “p(x) parametrised by θ′′

◦ The unknown parameters may be seen as deterministic or randomvariables

◦ There are essentially two alternatives to the statistical case

– No a priori distribution assumed: Maximum Likelihood– A priori distribution known: Bayesian estimation

◦ Key problem # to estimate a group of parameters from a discrete-timesignal or dataset.

c© D. P. Mandic Advanced Signal Processing, Spring 2015 4

Estimation of a scalar random variable

Given an N - point dataset x[0], x[1], . . . , x[N − 1] which depends on anunknown parameter θ, (scalar), define an “estimator” as some function, g,of the dataset, that is

θ = g(x[0], x[1], . . . , x[N − 1])

which may be used to estimate θ (single parameter case).

(in our DC level estimation problem, θ = A)

◦ This defines the problem of “parameter estimation”

◦ Also need to determine g(·)

◦ Theory and techniques of statistical estimation are available

◦ Estimation based on PDFs which contain unknown but deterministicparameters is termed classical estimation

◦ In Bayesian estimation, the unknown parameters are assumed to berandom variables, which may be prescribed “a priori” to lie within somerange of allowable parameters (or desired performance)

c© D. P. Mandic Advanced Signal Processing, Spring 2015 5

The stastical estimation problem

First step: to model, mathematically, the data

◦ We employ a Probability Desity Function (PDF) to describe theinherently random measurement process, that is

p(x[0], x[1], . . . , x[N − 1]; θ)

which is “parametrised” by the unknown parameter θ

Example: for N = 1, and θ denoting the mean value, a generic form ofPDF for the class of Gaussian PDFs with any value of θ is given by

p(x[0]; θ) = 1√2πσ2

exp[− 1

2σ2(x[0]− θ)2]

x[0] 1 2

Clearly, theobserved value ofx[0] impacts uponthe likely value ofθ.

c© D. P. Mandic Advanced Signal Processing, Spring 2015 6

Estimator vs. Estimate

The parameter to be estimated is then viewed as a realisation of therandom variable θ

◦ Data are described by the joint PDF of the data and parameters:

p(x, θ) = p(x | θ)︸ ︷︷ ︸

(conditional PDF )

p(θ)︸︷︷︸

(prior PDF )

◦ An estimator is a rule that assigns a value of θ from each realisationof x = x = [x[0], . . . , x[N − 1]]T

◦ An estimate of, i.e. θ (also called ’estimandum’) is the value obtained

for a given realisation of x = [x[0], . . . , x[N − 1]]T in the form θ = g(x)

Example: for a noisy straight line: p(x; θ) = 1(2πσ2)N/2e

[

− 12σ2

∑N−1n=0 (x[n]−A−Bn)2

]

◦ Performance is critically dependent upon this PDF assumption - theestimator should be robust to slight mismatch between the measurementand the PDF assumption

c© D. P. Mandic Advanced Signal Processing, Spring 2015 7

Example: finding the parameters of straight line

Specification of the PDF is critical in determining a good estimator

In practice, we choose a PDF which fits the problem constraints and any“a priori” information; but it must also be mathematically tractable.

Example: Assume that “on theaverage” the data are increasing

n

noisy line

ideal noiseless line

0

A

x[n]

Data: straight line embedded inrandom noise w[n] ∼ N (0, σ2)

x[n] = A+Bn+ w[n]

n = 0, 1, . . . , N − 1

p(x;A,B) = 1(2πσ2)N/2e

[

− 12σ2

∑N−1n=0 (x[n]−A−Bn)2

]

Unknown parameters:

A, B ⇔ θ ≡ [A B]T

Careful: what would be the effectsof bias in A and B?

c© D. P. Mandic Advanced Signal Processing, Spring 2015 8

Bias in parameter estimation

Estimation theory (scalar case): estimate the value of an unknown

parameter, θ, from a set of observations of a random variable described bythat parameter

θ = g(x[0], x[1], . . . , x[N − 1]

)

Example: given a set of observations from Gaussian distribution, estimatethe mean or variance from these observations.

◦ Recall that in linear mean square estimation, when estimating a value ofrandom variable y from an observation of a related random variable x,the coefficents a and b in the estimation y = ax+ b depend upon themean and variance of x and y as well as on their correlation.

The difference between the expected value of the estimate and theactual value θ is caleld the bias and will be denoted y B.

B = E{θN} − θ

where θN denotes estimation over N data samples, x[0], . . . , x[N − 1]

c© D. P. Mandic Advanced Signal Processing, Spring 2015 9

Asymptotic unbiasedness

If the bias is zero, then the expected value of the estimate is equal to thetrue value, that is

E{θN} = θ ≡ B = E{θN} − θ = 0

and the estimate is said to be unbiased.

If B 6= 0 then the estimator θ = g(x) is said to be biased.

Example: Consider the sample mean estimator of the signalx[n] = A+ w[n], w ∼ N (0, 1), given by

A = x =1

N + 2

N−1∑

n=0

x[n] that is θ = A

Is the above sample mean estimator of the true mean A biased?

More often: an estimator is biased but bias B → 0 when N → ∞

limN→∞

E{θN} = θ

Such as estimator is said to be asymptotically unbiased.

c© D. P. Mandic Advanced Signal Processing, Spring 2015 10

How about the variance?

◦ It is desirable that an estimator be either unbiased or asymptoticallyunbiased (think about the power of estimation error due to DC offset)

◦ For an estimate to be meaningful, it is necessary that we use theavailable statistics effectively, that is,

V ar → 0 as N → ∞

or in other words

limN→∞

var{θN} = limN→∞

{

|θN − E{θN}|2}

= 0

If θN is unbiased then E{θN} = θ, and from Tchebycheff inequality ∀ ǫ > 0

Pr{|θN − θ| ≥ ǫ} ≤var{θN}

ǫ2

⇒ if V ar → 0 as N → ∞, then the probability that θN differs by morethan ǫ from the true value will go to zero (showing consistency).

In this case, θN is said to converge to θ with probability one.

c© D. P. Mandic Advanced Signal Processing, Spring 2015 11

Mean square convergence

Another form of convergence, stronger than convergence with probabilityone is mean square convergence.

An estimate θN is said to converge to θ in the mean–square sense, if

limN→∞

E{|θN − θ|2}︸ ︷︷ ︸mean square error

= 0

◦ For an unbiased estimator this is equivalent to the previous conditionthat the variance of the estimate goes to zero

◦ An estimate is said to be consistent if it converges, in some sense, tothe true value of the parameter

◦ We say that the estimator is consistent if it is asymptoticallyunbiased and has a variance that goes to zero as N → ∞

c© D. P. Mandic Advanced Signal Processing, Spring 2015 12

Example: Assessing the performance of the Sample

Mean as an estimator

Consider the estimation of a DC level, A in random noise, which could bemodelled as

x[n] = A+ w[n]

w[n] ∼ some zero mean random process.

◦ Aim: to estimate A given {x[0], x[1], . . . , x[N − 1]}

◦ Intuitively, the sample mean is a reasonable estimator

A =1

N

N−1∑

n=0

x[n]

Q1: How close will A be to A?

Q2: Are there better estimators than the sample mean?

c© D. P. Mandic Advanced Signal Processing, Spring 2015 13

Mean and variance of the Sample Mean estimator

x[n] = A+ w[n] w[n] ∼ N (0, σ2)

Estimator = f( random data ), ⇒ a random variable itself

⇒ its performance must be judged statistically

(1) What is the mean of A?

E{

A}

= E

{

1

N

N−1∑

n=0

x[n]

}

=1

N

N−1∑

n=0

E {x[n]} = A # unbiased

(2) What is the variance of A?

Assumption: The samples of w[n]s are uncorrelated

E{

A2}

= V ar{

A}

= V ar

{

1

N

N−1∑

n=0

x[n]

}

=1

N2

N−1∑

n=0

V ar {x[n]} =1

N2N σ2 =

σ2

N

Notice the variance → 0 as N → ∞ # consistent (see your P&A sets)

c© D. P. Mandic Advanced Signal Processing, Spring 2015 14

Minimum Variance Unbiased (MVU) estimation

Aim: to establish “good” estimators of unknown deterministic parameters

Unbiased estimator # “on the average” yields the true value of theunknown parameter independent of its particular value, i.e.

E(θ) = θ a < θ < b

where (a, b) denotes the range of possible values of θ

Example: Unbiased estimator for a DC level in White GaussianNoise (WGN). If we are given

x[n] = A+ w[n] n = 0, 1, . . . , N − 1

where A is the unknown, but deterministic, parameter to be estimatedwhich lies within the interval (−∞,∞), then the sample mean can be usedas an estimator of A, namely

A =1

N

N−1∑

n=0

x[n]

c© D. P. Mandic Advanced Signal Processing, Spring 2015 15

Careful: the estimator is parameter dependent!

An estimator may be unbiased for certain values of the unknownparameter but not all, such an estimator is not unbiased

Consider another sample mean estimator:

ˆA = 1

2N

∑N−1n=0 x[n]

Therefore: E{ˆA}

= 0 when A = 0 but

E{ˆA}

= A2 when A 6= 0 (parameter dependent)

HenceˆA is not an unbiased estimator

◦ A biased estimator introduces a “systemic error” which should notgenerally be present

◦ Our goal is to avoid bias if we can, as we are interested in stochasticsignal properties and bias is largely deterministic

c© D. P. Mandic Advanced Signal Processing, Spring 2015 16

Remedy

(also look in the Assignment dealing with PSD in your CW )

Several unbiased estimates of the same quantity may be averagedtogether, i.e. given the L independent estimates

{

θ1, θ2, . . . , θL

}

We may choose to average them, to yield

θ = 1L

∑Ll=1 θl

Our assumption was that the individual estimators are unbiased, with equalvariance, and uncorrelated with one another.

Then (NB: averaging biased estimators will not remove the bias)

E{

θ}

= θ

and

V ar{

θ}

= 1L2

∑Ll=1 V ar

{

θl

}

= 1LV ar

{

θl

}

Note, as L → ∞, θ → θ (consistent)

c© D. P. Mandic Advanced Signal Processing, Spring 2015 17

Effects of averaging for real world data

Problem 3.4 from your P/A sets: heart rate estimation

The heart rate, h, of a patient is automatically recorded by a computer every 100ms. In

one second the measurements{

h1, h2, . . . , h10

}

are averaged to obtain h. Given than

E{

hi

}

= αh for some constant α and var(hi) = 1 for all i, determine whether

averaging improves the estimator if α = 1 and α = 1/2.

h =1

10

10∑

i=1

hi[n],

E{

h}

10

10∑

i=1

h = αh

If α = 1, unbiased, if α = 1/2 it will not

be unbiased unless the estimator is formed

as h = 15

∑10i=1 hi[n].

var{

h}

=1

L2

10∑

i=1

var{

hi

}

h

p(h

i)

Before Averaging

h/2 h

p(h

i)

h

p(h

i)

After Averaging

h/2 h

p(h

i)

α =1

2α =

1

2

α = 1α = 1

hi

hi

c© D. P. Mandic Advanced Signal Processing, Spring 2015 18

Minimum variance criterion

⇒ An optimality criterion is necessary to define an optimal estimator

Mean Square Error (MSE)

MSE(θ) = E

{(

θ − θ)2

}

measures the average mean squared deviation of the estimator from thetrue value.

This criterion leads, however, to unrealisable estimators - namely, oneswhich are not solely a function of the data

MSE(θ) = E

{[(

θ − E(θ))

+(

E(θ)− θ)]2

}

= V ar(θ) + E{

(θ)− θ}2

= V ar(θ) +B2(θ)

⇒ MSE = VARIANCE OF THE ESTIMATOR + SQUARED BIAS

c© D. P. Mandic Advanced Signal Processing, Spring 2015 19

Example: An MSE estimator with ’gain factor’

Consider the following estimator for DC level in WGN

A = a1

N

N−1∑

n=0

x[n]

Task: Find a which results in minimum MSE

Given

E{

A}

= aA and

V ar(A) = a2σ2

N

we have

MSE(A) =a2σ2

N+ (a− 1)2A2

Of course, the choice of a = 1 removes the mean and minimises the variance

c© D. P. Mandic Advanced Signal Processing, Spring 2015 20

Continued: MSE estimator with ’gain’

But, can we find a analytically? Differentiating with respect to a yields

∂MSA

∂a(A) =

2aσ2

N+ 2(a− 1)A2

and setting the result to zero gives the optimal value

aopt =A2

A2 + σ2

N

but we do not know the value of A

◦ The optimal value depends upon A which is the unknown parameter

◦ Comment - any criterion which depends on the value of the unknownparameter to be found is likely to yield unrealisable estimators

◦ Practically, the minimum MSE estimator needs to be abandoned, andthe estimator must be constrained to be unbiased

c© D. P. Mandic Advanced Signal Processing, Spring 2015 21

A counter-example: A little bias can help

(but the estimator is difficult to control)

Q: Let {y[n]}, n = 1, . . . , N be iid Gaussian variables ∼ N (0, σ2).Consider the following estimate of σ2

σ2 =α

N

N∑

n=1

y2[n] α > 1

Find α which minimises the MSE of σ2.

S: It is straightforward to show that E{σ2} = ασ2 andMSE(σ2) = E{

(

σ2 − σ2)2} = E{σ4} + σ4(1 − 2α)

=α2

N2

N∑

n=1

N∑

s=1

E{y2[n]y2[s]} + σ4(1 − 2α)

=α2

N2

(

N2σ4 + 2Nσ4)

+ σ4(1 − 2α) = σ4[

α4(1 +2

N) + (1 − 2α)

]

The MMSE is obtained for αmin = NN+2 and has the value

minMSE(σ2) = 2σ4

N+2. Given that the minimum variance of an

unbiased estimator (CRLB, later) is 2σ4/N , this is an example of abiased estimator which obtains a lower MSE than the CRLB.

c© D. P. Mandic Advanced Signal Processing, Spring 2015 22

Example: Estimation of hearing threshold (hearing aids)

Objective estimation of hearing threshold:

◦ Design AM or FM stimuli with carrier frequency up to 20 kHz

◦ Modulate this carrier with e.g. 40 Hz envelope

◦ Present this sound to the listener

◦ Measure the electroencephalogram, which should contain a 40 Hz signalif the subject was able to detect the sound

◦ This is a good objective estimate of the hearing curve

c© D. P. Mandic Advanced Signal Processing, Spring 2015 23

Key points

Simulation: Auditory steady state response (ASSR), 40 Hz amplitude modulated

◦ An estimator is a random

variable, its performance can

only be completely described

statistically or by a PDF.

◦ Estimator’s performance cannot

be conclusively assessed by

one computer simulation - we

often average M trials of

independent simulations, called

“Monte Carlo” analysis.

◦ Performance and computation

complexity are often traded - the

optimal estimator is replaced by

a suboptimal but realisable one.

90 92 94 96 98

−2

0

2

x 10−6 Time series

20 25 30 35 40 45

10−15

PSD analysis

98 100 102 104 106

−2

0

2

x 10−6

20 25 30 35 40 45

10−15

106 108 110 112 114

−4

−2

0

2

4x 10

−6

20 25 30 35 40 45

10−15

90 95 100 105 110

−4

−2

0

2

4x 10

−6

Time (s)20 25 30 35 40 45

10−15

Averaged PSD analysis

Frequency (Hz)

Simulation: Left column: The 24s of ITE EEG sampled at 256Hz was split into 3

segments of length 8s. Left column: The respective periodograms (rectangular window).

Bottom right: The averaged periodogram, observe the reduction in variance (in red).

c© D. P. Mandic Advanced Signal Processing, Spring 2015 24

Desired: minimum variance unbiased (MVU) estimator

Minimising the variance of an unbiased estimator concentrates the PDF ofthe error about zero ⇒ estimation error is therefore less likely to be large

◦ Existence of the MVU estimator

The MVU estimator is an unbiased estimator with minimumvariance for all θ, that is, θ3 on the graph.

c© D. P. Mandic Advanced Signal Processing, Spring 2015 25

Methods to find the MVU estimator

◦ The MVU estimator may not always exist

◦ A single unbiased estimator may not exist – in which case a searchfor the MVU is fruitless!

1. Determine the Cramer-Rao lower bound (CRLB) and find someestimator which satisfies

2. Apply the Rao-Blackwell-Lehmann-Scheffe (RBLS) theorem

3. Restrict the class of estimators to be not only unbiased, but also linear(BLUE)

4. Sequential vs. block estimators

5. Adaptive estimators

c© D. P. Mandic Advanced Signal Processing, Spring 2015 26

Extensions to the vector parameter case

◦ If θ =[

θ1, θ2, . . . , θp

]T

∈ Rp×1 is a vector of unknown parameters, an

estimator is unbiased if

E(θi) = θi ai < θi < bi

for i = 1, 2, . . . , p

and by defining

E(θ) =

E(θ1)E(θ2)

...E(θp)

an unbiased estimator has the property E(θ) = θ

within the p– dimensional space of parameters

◦ An MVU estimator has the additional property that V ar(θi) fori = 1, 2, . . . , p is minimum among all unbiased estimators

c© D. P. Mandic Advanced Signal Processing, Spring 2015 27

Summary and food for thoughts

◦ We are now equipped with performance metrics for assessing thegoodnes of any estimator (bias, variance, MSE)

◦ Since MSE = var + bias2, some biased estimators may yield low MSE.However, we prefer the minimum variance unbiased (MVU) estimators

◦ Even a simple Sample Mean estimator is a very rich example of theadvantages of statistical estimators

◦ The knowledge of the parametrised PDF p(data;parameters) is veryimportant for designing efficient estimators

◦ We have introduced statistical “point estimators”, would it be useful toalso know the “confidence” we have in our point estimate

◦ In many disciplines it is useful to design so called “set membershipestimates”, where the output of an estimator belongs to a pre-defininedbound (range) of values

◦ We will next address linear, best linear unbiased, maximum likelihood,least squares, sequential least squares, and adaptive estimators

c© D. P. Mandic Advanced Signal Processing, Spring 2015 28

Homework: Check another proof for the MSE expression

MSE(θ) = var(θ) + bias2(θ)

Note : var(x) = E[x2]−[E[x]

]2(∗)

Idea : Let x = θ − θ → substitute into (∗)

to give var(θ − θ)︸ ︷︷ ︸

term (1)

= E[(θ − θ)2]︸ ︷︷ ︸

term (2)

−[E[θ − θ]

]2

︸ ︷︷ ︸

term (3)

(∗∗)

Let us now evaluate these terms:

(1) var(θ − θ) = var(θ)

(2) E[θ − θ]2 = MSE

(3)[E[θ − θ]

]2=

[E[θ]− E[θ]

]2=

[E[θ − θ]

]2= bias2(θ)

Substitute (1), (2), (3) into (**) to give

var(θ) = MSE− bias2 ⇒ MSE = var(θ) + bias2(θ)

c© D. P. Mandic Advanced Signal Processing, Spring 2015 29

Recap: Unbiased estimators

Due to the linearity properties of the E {·}, that is

E{a+ b} = E{a}+ E{b}

the sample mean operator can be simply shown to be unbiased, i.e.

E{

A}

= 1N

∑N−1n=0 E {x[n]} = 1

N

∑N−1n=0 A = A

◦ In some applications, the value of A may be constrained to be positive

a component value such as an inductor, capacitor or resistor wouldbe

positive (prior knowledge)

◦ For N data points in random noise, unbiased estimators generally havesymmetric PDFs centred about their true value, i.e.

A ∼ N (A, σ2/N)

c© D. P. Mandic Advanced Signal Processing, Spring 2015 30

Notes

c© D. P. Mandic Advanced Signal Processing, Spring 2015 31

Notes

c© D. P. Mandic Advanced Signal Processing, Spring 2015 32