Download - Unit 1 Minimum Variance Unbiased Estimator - MBWCLmoblie123.cm.nctu.edu.tw/EnD98/Unit 1.pdf · we set b=0 and search for the estimator that minimizes var(Ă) minimum variance unbiased

Institute of Communications Engineering, ECE, NCTU 1

Unit 1 Minimum VarianceUnbiased Estimator


Unit 1: Minimum Variance Unbiased Estimator Sau-Hsuan Wu

What are estimations all about? We want to know (or estimate) an unknown quantity This quantity, , may not be obtained directly, or it is a

notion (quantity) derived from a set of observations In any event, we need to retrieve the value of from a set of

observations, x = x[0],…, x[N-1], which are related to We hope to determine according to

How do we start? Suppose we have some knowledge about the collected data Examples:

x[n] = A1 + w[n], n=0,…,N-1 x[n] = A2 + B2 n + w[n], n=0,…,N-1



Â1= x[n]/N and ?

An estimator may be thought of as a rule that assigns avalue to for each realization of x

The estimate of is the value of for a given realizationof x

How do we assess the performance of estimation? What are we concerned the most about the estimate?

How close will Â be to A? Are there better estimators than the sample mean for instance? Ă = x[0] ?

What are the differences betweenÂ and Ă?



Can be have some performance measure? We model the data by its probability density function

(PDF), assuming that the data are inherently random. As an example:

We have a class of PDFs where each one is different due toa different value of , i.e. the PDFs are parameterized by .

The parameter is assumed to be deterministic butunknown

Given the PDF, we calculate some statistics of the estimates E (Â ) = E (Ă ) ? Var (Â) = Var (Ă ) ?



Note: In an actual problem, we are not given a PDF butmust choose one that is not only consistent with theproblem constraints and any prior knowledge, but onethat is also mathematically tractable

Sometimes, we might want to constraint the estimator toproduce values in a certain range. To incorporate this prior knowledge, we can assume that

is no longer deterministic but a random variable having auniform distribution over the [-U, U] interval for instance

Assign a PDF to , then the data are described by the jointPDF

Any estimator that yields estimates according to the priorknowledge of is termed a Bayesian estimator



We conclude for the time being that we are hoping tohave an estimator that gives An unbiased mean of the estimate:

A minimum variance for the estimate:

Is an unbiased estimator always the optimal estimator? Consider a widely used optimality criterion:

the minimum mean squared error (MSE) criterion



As an example, consider the modified estimator

We attempt to find the‘a’which yields the minimum MSE Since E(Ă)=aA and var(Ă) = a22 / N, we have

mse (Ă) = a22 / N + (a-1)2A2

Taking the derivative w.r.t. to a and setting it to zero leads toaopt = A2 / (A2 + 2/ N )

The optimal value of a depends upon the unknown parameterA The estimator is not realizable

How do we resolve this dilemma? Since, mse (Ă) = var(Ă) + b2, as an alternative,

we set b=0 and search for the estimator that minimizes var(Ă) minimum variance unbiased (MVU) estimator



However, does a MVU estimator always exist for all ? Example:

x[0] ~ N(,1) x[1] ~ N(,1) if 0

~ N(,2) if 0 Both of the two estimators are unbiased

Therefore

None of the estimator has a variance uniformly less than orequal to 18/36


Cramer-Rao Lower Bound



How to measure the accuracy of estimation? Consider a single sample of observation x[0] = + w[0]

with w[0]~N(0,2), The unbiased estimator is Â and the variance is The accuracy of estimation improves as 2 decreases

We can see this from the likelihood function (LK) of



The accuracy of estimation increases with the“sharpness”of the LK, which is inversely proportional to i

2 in this case Actually, the variance i

2 in this case is the negative inverseof the second derivative of the logarithm of the likelihoodfunction (LLK)

Therefore, in a more rigorous definition, the sharpness ofthe LK is measured by the curvature of the LLK, i.e.

In general, the curvature depends on x[0] as well, thus amore appropriate measure of curvature is



Now, we put our problem in a more rigorous statement Define a scalar parameter = g() Define a PDF of the observation, p(x ; ) which is

parameterized by Consider all unbiased estimators , namely The variance of the estimation is given by

We want to find a lower bound for

How do we start? The hint from the previous example is that the variance is

inversely proportional to the average curvature of p(x ; )



Therefore, we may mathematically associate the variancewith the average curvature like this

The larger the curvature, the smaller the lower bound for thevariance of estimation

By the Cauchy-Schwarz inequality, we have



Then, can we relate the square of the first-order derivativeto the second-order derivative? i.e. the connection between

and

Observe that if the integral and partial are interchangeableif the integral and partial are interchangeable



Then, we have

supposed that

Substituting the results to the Cauchy-Schwarz inequality



Therefore,

The equality holds if and only if

where c might be a function of but not of x If =g()= , then



Taking the derivative w.r.t. on the both sides leads to

Taking the expectation on the both sides, for unbiasedestimators we have

Thus,



Clearly, the Cramer-Rao bound (CRB)

is valid if the regularity condition holds

Does the regularity condition always hold? Refer to Leibniz’s rule



The gradient, with respect to , of the log likelihoodfunction (LLK) is called the score

Under the regularity condition, the expected value of thescore is zero, namely

The Fisher information is the variance of the score



By definition,

If the regularity condition holds,

then the information is additive for independent x and y

Therefore, it is referred to as the information of



Cramer Rao’s lower bound for vector parameter Let vector = g() and define

The goal is to prove

Namely, for arbitrary a,

1ˆ . .( ) 0p s d

αC I θ

ln ( ; ) ln ( ; )( ) ( ; )

ln ( ; ) ln ( ; )( ; )

T

T

p pp d

p pp d

x θ x θI θ x θ x

θ θ

x θ x θx θ x

θ θ

1ˆ ( ) 0T αa C I θ a

ˆ ˆ ˆ( )( ) ( ; )T p d αC α α α α x θ x



Cauchy-Schwarz inequality

2

ˆ ˆ( )( ) ( ; )

ln ( ; ) ln ( ; )( ; )

ln ( ; )ˆ( ) ( ; )

( ; ) ( ; )ˆ

T T

TT

TT

T TT T

p d

p pp d

pp d

p pd

a α α α α x θ x a

x θ x θb x θ x b

θ θ

x θa α α b x θ x

θ

x θ x θa α b x a α

θ θ

2

22ˆ( ; ) ( )T T

T T

d

p d

b x

α x θ x g θa b a b

θ θ

αC

( )I θ



Therefore, by Cauchy-Schwarz inequality, we have

Besides, since b is arbitrary, we can define

As a result

2

ˆ( )

( )T T TT

α

g θa b a C ab I θb

θ

-1 ( )( )

T

g θb I θ a

θ

22-1

-1 -1ˆ

( ) ( ) ( )= ( )

( ) ( )( ) ( ) ( )

TT T

T T

TT T

T

α

g θ g θ g θa b a I θ a

θ θ θ

g θ g θa C aa I θI θI θ a

θ θ



Since I() is positive definite, so is I-1(). We have

Since a is arbitrary, thus

2

-1 -1ˆ

( ) ( ) ( ) ( )( ) ( )

T TT T T

T T

αg θ g θ g θ g θ

a I θ a a C aa I θ aθ θ θ θ

-1ˆ

( ) ( )( )

TT T

T

g θ g θa C a a I θ a

θ θ

-1ˆ

( ) ( )- ( ) 0

TT

T

g θ g θa C I θ a

θ θ

positive semidefinite

-1ˆ . .

( ) ( )- ( ) 0

T

p s dT

g θ g θ

C I θθ θ



When g()= , we have

Since the conditions for equality are

For arbitrary a

Thus, for = g() =

-1- ( ) 0θ

C I θ

-1ln ( ; ) ln ( ; ) ( )ˆ( ) ( ) ( ) ( )T

TT T

p pc c

x θ x θ g θ

a α α θ b θ I θ aθ θ θ

-1( ) ln ( ; ) 1 ˆ( ) ( )( )T

pc

g θ x θ

I θ α αθ θ θ

ln ( ; ) 1 ˆ( )( )( )

pc

x θ

I θθ θθ θ



Now take the derivative on the both sides w.r.t. Denote the i-th row of I()/c() by Ii()/c(), we have

Putting the row vectors into a matrix returns

Eventually, we arrive at

2 ln ( ; ) 1( ) ( ) ( )=1

( )T

pE c

c

x θI θ I θ θ

θθ θ

ln ( ; ) ˆ( )( )p

x θI θθ θ

θ

1

ˆ( )( ) ( ) ( ) 1ˆ ˆ( ) ( ) ( )( ) ( ) ( ) ( )

i i iiT

Nc c c c

I θθ θ I θ I θθ θ θ θ I θ

θ θ θ θ θ

1

( ) ( ) 1 1ˆ ˆ( ) ( ) ( )= ( )( ) ( ) ( ) ( )i i

i iN

E Ec c c c

x xI θ I θ

θ θ θ θ I θ I θθ θ θ θ



For a positive semidefinite matrix A [A]ii 0 Since xTAx 0 x

eiTAei = [A]ii 0

Therefore,

Suppose and achieves the CRLB, i.e. Now, for a linear transformation = g() = A+ b

, achieves the CRLB too

-1- ( ) 0θ

C I θ

-1ˆ

ˆvar( ) ( )i ii ii θ

C I θ

1ˆ ( )θ

C I θˆ( )E θ θ

ˆˆ ˆ( ) andE α Aθ b α α

-1ˆˆ

( ) ( )= ( )

T

T

θ

g θ g θC AC Α I θ

θ θ



We revisit the line fitting problem x[n] = A + Bn + w[n], n=0,…,N-1 Let x = [x[0],…, x[N-1]]T , = [A B]T and

An estimator which is unbiased and attains the CRLBbound is said to be efficient in that it efficiently uses thedata

2ln ( ; )/T Tp

x θH x H Hθ

θ

1 01 1

1 1N

H

2

1

( ) /ln ( ; ) ˆ( ) ( )

T

T T

Ip

I I

θ H Hx θ

θ H H H x θ θ θ θθ



Efficiency is maintained over a linear transformation Let us take a closer look at I() = HTH/2

Compared with x[n] = A + w[n] in which var(Â) = 2/N,now

Thus, the CRLB always increases as we estimate moreparameters

2 2

2 2

( 1)2( ) ,

( 1) ( 1)(2 1)2 6

N N N

N N N N N

I θ

2 2

1

2 2

2

2(2 1) 6( 1) ( 1)

( )6 12( 1) ( 1)

NN N N N

N N N N

I θ

2 2 2

2

2(2 1) 12ˆ ˆvar( ) if N 2 and var( )( 1) ( 1)N

N N N N N

A B



Extension to linear model x = H+ w with w ~ N(0,C) To derive the MVU, we can use a whitening approach Let C-1 = DTD DE{w wH} DT = I x’= D x = D(H+ w ) = H’+ w’with w’~ N(0, I) Based on the previous result, we have

and

1 1

11 1

ˆ ' ' ' ' =

=

T T T T T T

T T

θ H H H x H D DH H D Dx

H C H H C x

1 1 11ˆ ' ' = =T T T T θ

C H H H D DH H C H



Ex. Estimation of the signal-to-noise (SNR) ratio Given x[n] = A + w[n], n=0,…,N-1 Let = [A σ2]T and g() = 1

2/2 = A2 /σ2 = Then,

Recall that

The variance of estimation increases with the SNR , why? The variance depends on the gradient of g()

2 4 221

2 4

4

0( ) ( ) 4 2 4 2

( ) ( )0

2

T

T

Ng g A A

N N N N

θ θI θ I θ

θ θ

2

2 4

( ) 2T

g A A

θθ


Sufficient Statistics



Is there a set of T(x) of x that is sufficient for estimation? What do we mean by the sufficiency of a set of T(x)? We want to have a set of statistics T(x) such that given T(x),

any x(n) of x = [x(0),…,x(N-1)] is independent of A

Suppose Â1= x[n]/N, then are the followings sufficient? S1 = {x[0], x[1],…, x[N-1]} each element is a statistic S2 = {x[0]+x[1], x[2],…, x[N-1]} S3 = {x[n]} x[n] is also a statistic

Then, what is the minimum set of sufficient statistics? Givenx[n] = T0 , do we still need the individual data?



We say the conditional PDF : p( x | x[n] = T0 ; A)should not depend on A if statistic T0 is sufficient

E.g.

For (a), a value of A near A0 is more likely even given T0

For (b), however, p( x | x[n] = T0 ; A) is a constant



Now, we need to determine p( x | x[n] = T0 ; A) toshow that x[n] = T0 is sufficient

By Baye’s rule

Since T(x) is a direct function of x,

Clearly, we have



Thus, we have



Since T(x) = x[n] ~ N(NA, N2)

Thus

which does not depend on A



In general, to identify potential sufficient statistics isdifficult

An efficient procedure of finding the sufficient statisticsis to employ the Neyman-Fisher factorization theorem

Observe that

If we can factorize p(x;) into p(x;)=g( T(x),) h(x)where g is a function that depends on x only through T(x) h is a function that depends only on x T(x) is a sufficient statistic for The converse is also true

If T(x) is a sufficient statistic p(x;)=g(T(x),) h(x)



Recall p(x;A)

Now, we want to estimate 2 of y[n]=A+x[n] Suppose A is given, then define x[n] = y[n]-A

Clearly, T(x) = x2[n] is a sufficient statistic for 2



Proof of the Neyman-Fisher Factorization Theorem ()

By assumption, p(x;) = g(T(x), )h(x)

Since



We have

Then

which does not depend on Hence T(x) is a sufficient statistic



() Recall

And

Suppose T(x) is a sufficient statistic Then p(x| T(x)=T0 ;) = P(x| T(x)=T0) ( not a function of )

We can define p(x| T(x)=T0 ;) = w(x)(T(x) - T0) withw(x)(T(x) - T0) =1

Therefore, we can let



As a result, we have

and

This holds for arbitrary T0, resulting in

where



Ex. we want to estimate the phase of a sinusoidx[n]=A cos(2f0 n +) + w[n], n=0,1,…,N-1

Suppose A and f0 are given

Expand the exponent



In this case, no single sufficient statistic exists, however

g(T1(x),T2 (x), )

h(x)



The r statistics T1(x), T2(x),…, Tr(x) are jointly sufficientif p(x | T1(x), T2(x),…, Tr(x) ; ) does not depend on

If p(x ; ) = g(T1(x), T2(x),…, Tr(x) , ) h(x) {T1(x), T2(x),…, Tr(x)} are sufficient statistics for

Now, we know how to obtain the sufficient statistics How do we apply them to help obtain the MVU estimator? The Rao-Blackwell-Lehmann-Scheffe Theorem If is an unbiased estimator of and T(x) is a sufficient

statistic for , then is unbiased and A valid estimator for (not dependent on ) Of lesser or equal variance than that of for all If T(x) is complete, then is the MVU estimator



Proof 1> validity

by definition p(x|T(x) ; ) is not a function of after theintegration w.r.t. x, the result is not a function of but T

2> unbiasedness

By assumption



3> show

Recall that is solely a function of T(x)

If

00



Finally, a statistic is complete if there is only one function,say g, of the statistic that is unbiased

is solely a function of T(x) If T(x) is complete is unique and unbiased

Since T(x) is sufficient, is as good as with another T1(x) Besides, for all and any unbiased estimator Then, must be the MVU In summary, the MVU can be found by Taking any unbiased and carrying out Alternatively, since there is only one function of T(x) that

leads to an unbiased estimator find the unique g(T(x)) that makes unbiased



Completeness of a sufficient statistic We know that for x[n] = A + w[n], with w[n]~ N(0, 2)

T(x) = x[n] is sufficient and g(T(x)) = T(x)/N is unbiased Suppose, a second function h for which E{h(T(x))} = A E{g(T(x)) -h(T(x))} = A–A =0, A

Since T ~ N(NA, N2)

where v(T) = g(T(x)) -h(T(x)) Let = T /N and v’() = v(N )

W()



Which is a convolution of v’() and w() and =0 A v’()=0

Recall a signal v(t) is zero iff F(v(t)) = 0 F(v’()*w()) = V’(f)W(f) = 0, f Since W(f) is still Gaussian V’(f) = 0 v’() = 0, g(T(x)) = h(T(x)), thus T(x) is complete



Incomplete sufficient statistic Now consider x[0] = A + w[0], while w[0] ~ U[-1/2, 1/2] T(x) = x[0] is sufficient. But, is T(x) completely sufficient? Let v(T) = g(T(x)) -h(T(x))

However, x = x[0] = T, so that

but



So that

A nonzero v(T) = sin (2T) will satisfy this condition



Hence, v(T) = g(T)–h(T) = sin (2T) Let g(T) = T = x[0] since E{x[0]} =A h(T) = T - sin (2T) Â = x[0] - sin (2x[0]) is also an unbiased estimator of

A, using statistic T = x[0] x[0] is not complete, not sure if Â = x[0] is an MVU estimator

To summarize, we say a sufficient statistic is complete if

is satisfied only by the zero function or by v(T) = 0, T



As a summary, the RBLS method can be used to find theMVU even when an efficient estimator does not exist

The procedure we learn by now to find an MVU estimator 1> Find a sufficient statistic T(x) for

by the Neyman-Fisher factorization theorem 2> Determine if T(x) is complete, if so, 3> Find a function g(T(x)) that yields an unbiased

estimation of which is the MVU of Alternatively,

where is an unbiased estimator



Ex: Mean of uniform noise x[n] = w[n] , n=0,1,…,N-1, w[n] ~ U[0, ] Want to find the MVU estimator for the mean = /2 The initial approach of using the CRLB to find an efficient

estimator cannot even be tried for it does not satisfy theregularity condition : E{ln P(x;) / } = 0

Is a sample mean the MVU estimator for this case?

Its variance is



Now, we follow the procedure we learned so far Define

Then

where =2 The PDF is

or



Alternatively

so that

T(x) = max(x) We need to determine a function g to make T(x) unbiased To do so requires us to determine E{T(x)}



The CDF of T(x)

The PDF follows as



But d Pr{x[n] < }/d is the PDF of x[n] or

Integrating, we obtain



Which finally yields

We now have



To make it unbiased, we multiply T(x) by (N+1)/N

which is the MVU estimator whose correspondingvariance is

and

< 2/(12N) of the sample mean



Extension to a vector parameter We want to seek an unbiased vector estimator such that

each element has the minimum variance Similarly, a vector T(x) = [T1(x), T2(x)…, Tr(x)]T is said

to be sufficient for the estimation of If P(x| T(x) ; ) = P(x| T(x)) If P(x ; ) =g(T(x), ) h(x) T(x) is of minimum

dimension



Back to a similar example x[n]=A cos(2f0 n) + w[n], n=0,1,…,N-1 Now, =[A, f0 , 2 ]T is the unknown vector parameter The PDF is

Expanding the exponent, we obtain

Cannot reduce the PDF due to



If f0 is known, then =[A, 2 ]T

We are able to factorize the PDF which gives

1

001

12 2

0

[ ]cos(2 )( )

( )( )

[ ]

N

n

N

n

x n f nT xT x

x n

T x



Now apply the above result to our previous discussion x[n]=A + w[n], n=0,1,…,N-1 Again, =[A, 2 ]T

Set f0 = 0 for x[n]=A cos(2f0 n) + w[n], we have

Taking the expected values produces

1

01

12 2

0

[ ]( )

( )( )

[ ]

N

n

N

n

x nTT

x n

xT x

x

2 2 2{ ( )}{ [ ]} ( )NA NA

ENE x n N A

T x



So T2(x) only helps estimate the second moment not thevariance

If we transform T (x) into

Then, E{g(T(x))} gives

1

12 2 2

2 1 0

1( )

( ( )) 11 1 [ ]( ) ( )

N

n

T xN

gx n xT T NN N

x

T xx x

1

2 2 22 2

0

{ ( ( ))} 1[ ]

N

n

E x AE g

A E xE x n xN

T x



Since N (A, 2/N)

Substituting this back into E{g(T(x))} yields

Therefore, multiply the second element by N/(N-1) yieldsan unbiased estimator of 2

22 2 2 2 2 2{ ( ( ))}

AAE g

A E x A A N

T x

~x 22 2E x A N

1

1 22 2

2 10

1 ( )

( ( )) 1 1 1[ ] ( ) ( )1 1

N

n

Tx Ng

x n Nx T N TN N N

x

T xx x



Since

Eventually, we have

Is this the MVU of =[A, 2 ]T ? We have shown that for Gaussian PDF, T(x) is complete Actually, this is also true for the vector exponential

family of PDFs Is efficient ?

1 1

22 2

0 0

ˆ ( ( )) 1 1[ ] [ ]

1 1

N N

n n

x xg

x n Nx x n xN N

T x

1 1

2 2 2

0 0

[ ] [ ]N N

n n

x n x x n Nx



In fact Â and are independent with

Therefore

While, the CRLB is