+ All Categories
Home > Documents > SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics...

SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics...

Date post: 20-Jun-2020
Category:
Upload: others
View: 14 times
Download: 0 times
Share this document with a friend
90
SM-4331 Advanced Statistics Chapter 1 (Estimation Theory) Dr Haziq Jamil FOS M1.09 Universiti Brunei Darussalam Semester II 2019/20
Transcript
Page 1: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

SM-4331 Advanced StatisticsChapter 1 (Estimation Theory)

Dr Haziq JamilFOS M1.09

Universiti Brunei Darussalam

Semester II 2019/20

Page 2: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Outline

1 InequalitiesProbability inequalitiesInequalities for expectations

2 Convergence of random variablesLimitsConvergence of random variablesTwo limit theorems: LLN and CLT

3 Point estimationPoint estimationDesirable propertiesMaximum likelihoodProperties of MLE

4 Confidence sets

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 0 / 85

Page 3: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Inequalities »

Inequalities

Inequalities are useful tools in establishing various properties of statisticalinference methods. They may also provide estimates for probabilities withlittle assumption on probability distributions.

There are four main inequalities that we will learn:

• Markov’s inequality• Chebyshev’s inequality• Cauchy-Schwarz inequality• Jensen’s inequality

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 1 / 85

Page 4: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Inequalities » Probability inequalities

Markov’s inequality

In probability theory, Markov’s inequality gives an upper bound for theprobability that a non-negative random variable (r.v.) exceeds somepositive constant.

Theorem 1 (Markov’s inequality)Let X be a non-negative r.v. and E(X ) <∞. Then, for any t > 0,

P(X ≥ t) ≤ E(X )

t.

Markov’s inequality relate probabilities to expectations, and providesbounds for the cumulative distribution function of a r.v..

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 2 / 85

Page 5: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Inequalities » Probability inequalities

Proof of Markov’s inequality

Proof.Let f (x) be the pdf of X . Since X > 0,

E(X ) =

∫ ∞−∞

x f (x) dx =

∫ ∞0

x f (x) dx

=

∫ t

0x f (x) dx +

∫ ∞t

x f (x) dx

≥∫ ∞t

x f (x) dx

≥ t

∫ ∞t

f (x) dx

= tP(X ≥ t)

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 3 / 85

Page 6: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Inequalities » Probability inequalities

Corollary to Markov’s inequality

Corollary 2For any r.v. X and any constant t > 0,

P(|X | ≥ t) ≤ E |X |t

provided E |X | <∞

P(|X | ≥ t) ≤E(|X |k

)tk

provided E(|X |k

)<∞

The tail probability P(|X | ≥ t) is a useful measure in insurance and riskmanagement in finance. The more moments X has, the smaller the tailprobabilities are.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 4 / 85

Page 7: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Inequalities » Probability inequalities

Chebyshev’s inequality

In probability theory, Chebyshev’s inequality guarantees that no more thana certain fraction of values can be more than a certain distance from themean.

Theorem 3 (Chebyshev’s inequality)

Suppose a r.v. X has mean µ and variance σ2. Then, for any t > 0,

P(|X − µ| ≥ tσ) ≤ 1t2.

Because it can be applied to completely arbitrary distributions (providedthey have a known finite mean and variance), the inequality generally givesa poor bound, compared to what might be deduced if more aspects areknown about the distribution involved.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 5 / 85

Page 8: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Inequalities » Probability inequalities

Chebyshev’s inequality (cont.)

Example 4Suppose X has mean 0 and variance 1. By Chebyshev’s inequality,

P(|X | ≥ 1) ≤ 1.00P(|X | ≥ 2) ≤ 0.25P(|X | ≥ 3) ≤ 0.11

In contrast, suppose that we know that X is normally distributed. Then

P(|X | ≥ 1) ≤ 0.318P(|X | ≥ 2) ≤ 0.046P(|X | ≥ 3) ≤ 0.002

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 6 / 85

Page 9: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Inequalities » Probability inequalities

Proof of Chebyshev’s inequality

Prove this as an exercise. Hint: Use Markov’s inequality.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 7 / 85

Page 10: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Inequalities » Inequalities for expectations

Cauchy-Schwartz inequality

This is a very useful inequality that crops up in many different areas ofmathematics, such as linear algebra, analysis, probability theory, vectoralgebra, etc.

Theorem 5 (Cauchy-Schwartz inequality)

Let E (X 2) <∞ and E (Y 2) <∞. Then

|E (XY )|2 ≤ E (X 2)E (Y 2).

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 8 / 85

Page 11: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Inequalities » Inequalities for expectations

Cauchy-Schwartz inequality (cont.)

The covariance inequality can be proved using Cauchy-Schwartz.

Lemma 6 (Covariance inequality)Let X and Y be random variables. Then

Var(Y ) ≥ Cov(Y ,X ) Cov(Y ,X )

Var(X )

Proof.Let µ := EX and ν := EY . Then

|Cov(X ,Y )|2 =∣∣E[(X − µ)(Y − ν)

]∣∣2≤ E

[(X − µ)2]E

[(Y − ν)2]

= Var(X ) Var(Y )

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 9 / 85

Page 12: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Inequalities » Inequalities for expectations

Convex functions

• A function g is convex if for any x , y and any α ∈ [0, 1],

g(αx + (1− α)y) ≤ αg(x) + (1− α)g(y).

• If g ′′(x) > 0 for all x , then g is convex.• A function g is concave if −g is convex.

Example 7Examples of convex functions: g1(x) = x2 and g2(x) = ex , sinceg ′′1 (x) = 2 > 0 and g2(x) = ex > 0 for all x .

Examples of concave functions: g3(x) = −x2 and g4(x) = log(x).

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 10 / 85

Page 13: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Inequalities » Inequalities for expectations

Convex functions

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 11 / 85

Page 14: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Inequalities » Inequalities for expectations

Jensen’s inequality

In the context of probability theory,

Theorem 8 (Jensen’s inequality)If g is convex,

E [g(X )] ≥ g (EX )

Example 9It follows directly from Jensen’s inequality, the following:

E(X 2) ≥ E(X )2

E(1/X ) ≥ 1/EX

E(logX ) ≥ log(EX )

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 12 / 85

Page 15: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Limits

Limits

Recall the limits of sequences of real numbers x1, x2, . . . :

Definition 10 (Limits of real sequences)We call x the limit of the real sequence (xn) if for each real number ε > 0,there exists a natural number N such that, for every natural numbern ≥ N, we have |xn − x | < ε.

We write limn→∞ xn = x , or simply xn → x . This also means that|xn − x | → 0 as n→∞. For every measure of closeness ε, the sequence’sterms are eventually that close to the limit.

Example 11• If xn = c for some constant c ∈ R, then xn → c .• If xn = 1/n, then xn → 0.• limn→∞(1 + 1/n)n = e.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 13 / 85

Page 16: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Limits

Limits (cont.)

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 14 / 85

Page 17: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Convergence of random variables

Convergence of random variables

We can say similar things about sequences of random variables, e.g. X isthe limit of a sequence (Xn) if |Xn − X | → 0 as n→∞. There are somesubtle issues here:

1. |Xn − X | itself is a r.v., i.e. it takes difference values in the samplespace Ω. Therefore, |Xn − X | → 0 should hold (almost) entirely onthe sample space. This calls for a probability statement.

2. Since r.v. have distributions, we may also consider convergence oftheir distributions FXn(x)→ FX (x) for all x .

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 15 / 85

Page 18: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Convergence of random variables

Types of convergence

Let X1,X2, . . . be a sequence of r.v., and X be another r.v.. The two maintypes of convergence for r.v. are defined as follows.

Definition 12 (Convergence in probability)Xn converges to X in probability if for any constant ε > 0,P(|Xn − X | > ε)→ 0 as n→∞. We write Xn

P−→ X , or plimn→∞ Xn = X .

Definition 13 (Convergence in distribution)Xn converges to X in distribution if limn→∞ FXn(x) = FX (x). We writeXn

D−→ X .

Remarks:1. X may be a constant, since a constant is a r.v. with probability mass

concentrated on a single point.

2. If XnP−→ X , it also holds that Xn

D−→ X , but not vice versa.Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 16 / 85

Page 19: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Convergence of random variables

Types of convergence (cont.)

Example 14Let X ∼ N(0, 1) and Xn = −X for all n ≥ 1. Then, clearly FXn ≡ FX (bylinearity of normal distributions). Hence, Xn

D−→ X .

However, Xn 6P−→ X , as for any ε > 0,

P(|Xn − X | > ε) = P(2|X | > ε)

= P(|X | > ε/2) > 0.

So we cannot have that P(|Xn − X | > ε)→ 0 as n→∞.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 17 / 85

Page 20: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Convergence of random variables

Types of convergence (cont.)

The convergence of r.v. is especially important when considering how goodour estimator is. For example, suppose we have collected data X1, . . . ,Xn

for estimation purposes, which can be treated to be realisations of asequence of r.v.. Let θn = h(X1, . . . ,Xn) be an estimator for θ.

• Naturally, we require θnP−→ θ.

• But θn is a r.v.; it takes different values with different samples. Toconsider how good it is an estimator of θ, we hope that the distributionof (θn − θ) becomes more concentrated around zero when n increases.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 18 / 85

Page 21: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Convergence of random variables

Mean square convergence

It is sometimes more convenient to consider the mean square convergence:

Definition 15 (Mean square convergence)

Xn converges in mean square to X if E[(Xn − X )2]→ 0 as n→∞. We

write Xnm.s.−−→ X .

It follows that from Markov’s inequality,

P(|Xn − X | ≥ ε) = P(|Xn − X |2 ≥ ε2)

≤E[(Xn − X )2]

ε2

Therefore, if Xnm.s.−−→ X , it also holds that Xn

P−→ X .

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 19 / 85

Page 22: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Convergence of random variables

Mean square convergence (cont.)

Example 16Let

U ∼ Unif(0, 1) and Xn =

n if U < 1/n0 otherwise

Then, for some ε > 0, P(|Xn| > ε) ≤ P(U < 1/n) = 1/n→ 0 as n→∞.Hence, Xn

P−→X .

However,E(X 2

n ) = n2 P(U < 1/n) = n→∞

hence Xn 6m.s.−−→ X .

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 20 / 85

Page 23: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Convergence of random variables

Mean square convergence (cont.)

Example 17

Let Xn = n w.p. 1/n, and 0 w.p. 1− 1/n. Then XnP−→0, since

P(|Xn| ≤ ε) ≤ 1/n→ 0

as n→∞. However, E(Xn) = n × 1/n = 1 6→ 0.

Caution

XnP−→X does not imply E(Xn)→ E(X ).

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 21 / 85

Page 24: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Convergence of random variables

Relationship between convergences

In general,

Convergence inmean square

⇒ Convergencein probability

⇒ Convergencein distribution

Caution

When XnD−→X , we also write Xn

D−→FX , where FX is the cdf of X .

However, the notation XnP−→FX does not make sense!

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 22 / 85

Page 25: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Convergence of random variables

Slutzky’s TheoremTheorem 18 (Slutzky’s Theorem)Let Xn, Yn, X , and Y be r.v., g a continuous function, and c a realconstant. Then,

1. If XnP−→X and Yn

P−→Y , thenI Xn + Yn

P−→X + Y ;I XnYn

P−→XY ; andI g(Xn)

P−→g(X ).

2. If XnD−→X and Yn

D−→c , thenI Xn + Yn

D−→X + c ;I XnYn

D−→cX ; andI g(Xn)

D−→g(X ).

Caution

Note that XnD−→X and Yn

D−→Y does not in general imply Xn +YnD−→X +Y .

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 23 / 85

Page 26: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Two limit theorems: LLN and CLT

The (weak) Law of Large Numbers (LLN)

Let X1,X2, . . . be iid with mean µ and variance σ2. Let Xn denote thesample mean, i.e.

Xn =1n

n∑i=1

Xi .

Recall two simple facts:

E(Xn) = µ and Var(Xn) = σ2/n.

Exercise: Prove this.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 24 / 85

Page 27: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Two limit theorems: LLN and CLT

The (weak) Law of Large Numbers (LLN) (cont.)

Definition 19 (The weak Law of Large Numbers)

As n→∞, XnP−→µ.

The LLN is very natural: When the sample size increases, the sample meanbecomes more and more close to the population mean. Furthermore, thedistribution of Xn degenerates to a single point distribution at µ.

Proof.Prove this as an exercise. Hint: Use Chebyshev’s inequality.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 25 / 85

Page 28: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Two limit theorems: LLN and CLT

The (weak) Law of Large Numbers (LLN) (cont.)

Let X1,X2, . . . be the score of randomly thrown dice.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 26 / 85

Page 29: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Two limit theorems: LLN and CLT

The Central Limit Theorem (CLT)

Definition 20 (The Central Limit Theorem)As n→∞,

Xn − µσ/√n

D−→ N(0, 1)

The standardised sample mean√n(Xn − µ)/σ is approximately standard

normal when the sample size is large. Hence, we can make statements suchas• √n(Xn − µ)/σ ≈ N(0, 1)

• √n(Xn − µ) ≈ N(0, σ2/n)

• Xn − µ ≈ N(0, σ2/n)

• Xn ≈ N(µ, σ2/n)

The CLT is one of the reasons why the normal distribution is the mostuseful and important distribution in statistics.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 27 / 85

Page 30: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Two limit theorems: LLN and CLT

The Central Limit Theorem (CLT) (cont.)

Example 21If we take a sample X1, . . . ,Xn from Unif(0, 1), the standardised histogramwill resemble the density function f (x) = 1(0,1)(x) (i.e. horizontal line at 1from x = 0 to x = 1). Now, the sample mean when calculated will be closeto µ = E(Xi ) = 0.5, provided n is sufficiently large (LLN).

However, the CLT implies that

Xn ≈ N(0.5, (12n)−1)

since Var(Xi ) = 1/12.

If we take many samples of size n and compute the sample mean for eachsample, we then obtain many sample means. The standardised histogramof those samples resembles the pdf of N(0.5, (12n)−1).

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 28 / 85

Page 31: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Two limit theorems: LLN and CLT

The Central Limit Theorem (CLT) (cont.)

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 29 / 85

Page 32: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Two limit theorems: LLN and CLT

The Central Limit Theorem (CLT) (cont.)

Example 22Suppose X1, . . . ,Xn is an iid sample. A natural estimator for thepopulation mean µ = E(Xi ) is the sample mean Xn. By the CLT, we caneasily gauge the error of this estimation as follows:

P(|Xn − µ| > ε) = P(|√n(Xn − µ)/σ| >

√nε/σ)

≈ P(|Z | >√nε/σ),where Z ∼ N(0, 1)

= 2P(Z >√nε/σ)

= 2(1− Φ(

√nε/σ)

)So with ε,n given, we can find the value Φ(

√nε/σ) from the table for the

standard normal distribution, if we know σ.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 30 / 85

Page 33: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Two limit theorems: LLN and CLT

The Central Limit Theorem (CLT) (cont.)

Remarks.

• Let ε := 2σ/√n = 2

√Var(Xn). Then

P(|Xn − µ| < ε) ≈ 2Φ(2)− 1 = 0.954

Hence, if one estimates µ by Xn, and repeats it a large number oftimes, about 95% of times, µ is within 2× s.d.(Xn) distance awayfrom Xn. Recall the “68-95-99.7” rule.• Typically, σ2 = Var(Xi ) is unknown in practice. We estimate is using

the (unbiased) sample variance estimator

s2n =

1n − 1

n∑i=1

(Xi − Xn)2

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 31 / 85

Page 34: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Convergence » Two limit theorems: LLN and CLT

The Central Limit Theorem (CLT) (cont.)

• Note that the estimate of√

Var(Xn) = σ/√n, given by sn/

√n, is

called the standard error of the sample mean. In full,

SE(Xn) =1

n(n − 1)

n∑i=1

(Xi − Xn)2

• In fact, it still holds that as n→∞,

Xn − µsn√n

D−→ N(0, 1)

which implies that replacing σ with sn in the above example yields thesame results. Hence, if one estimates µ by Xn, and repeats it a largenumber of times, about 95% of times, µ is within 2× s.d.(Xn)distance away from Xn.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 32 / 85

Page 35: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

1 Inequalities

2 Convergence of random variables

3 Point estimation

4 Confidence sets

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 32 / 85

Page 36: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation »

Fundamental concepts in statistical inference

Let X1, . . . ,Xn, be a sample from a population that follows somedistribution with pdf f (x |θ). The form of the pdf f is known, but theparameter θ of the pdf is unknown. Often, we may specify θ ∈ Θ, where Θis the parameter space. Note that θ may be a vector θ = (θ1, . . . , θp)>.

Example 23• For N(µ, σ2), θ = (µ, σ2)>, so p = 2 and Θ = R× R≥0.• For Poi(λ), θ = λ, so p = 1 and Θ = R≥0.

Most inference problems can be identified as one of three types:1. Point estimation2. Confidence sets3. Hypothesis testing

for the parameter θ.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 33 / 85

Page 37: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Point estimation

Point estimation

GOAL: Provide a single “best guess” of θ, based on observationsX1, . . . ,Xn. Formally, we may write

θ = θn = g(X1, . . . ,Xn)

as a point estimator for θ, where g(X1, . . . ,Xn) is a statistic.

Example 24A natural point estimator for the mean µ = E(X1) is the sample meanµ = Xn = n−1∑n

i=1 Xi .

Remark: Parameters to be estimated are unknown constants. Theirestimators are viewed as r.v., although in practice, θ admit some concretevalues.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 34 / 85

Page 38: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Point estimation

Point estimation (cont.)

Estimator vs EstimateWe use the term “estimator” to denote the function that gives theestimate. On the other hand, an “estimate” is the realised value of theestimator function.

Estimator/Estimate vs true valueThe standard convention is to denote estimators/estimates of parameterswith hats on the respective symbols (e.g. θ), whereas true values do nothave hats (c.f. θ).

A good estimator should make |θ − θ| as small as possible. However,1. θ is unknown; and2. the value of θ changes with the sample observed.

Hence, we seek for an estimator θ which makes the mean squared error(MSE) as small as possible for all possible values of θ.Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 35 / 85

Page 39: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Desirable properties

MSE, bias and standard error

Definition 25 (Bias)

The bias of an estimator θ is defined to be

Biasθ(θ) = Eθ(θ)− θ.

When Eθ(θ) = θ, Biasθ(θ) = 0 for all possible values of θ, and θ is calledan unbiased estimator for θ.

Definition 26 (Mean squared error)

The MSE of the estimator θ is defined as

MSEθ(θ) = Eθ

[(θ − θ)2

]= Biasθ(θ)2 + Varθ(θ).

Note: Eθ means that expectations are taken with respect to thedistribution which uses the true value θ.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 36 / 85

Page 40: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Desirable properties

MSE, bias and standard error (cont.)

Definition 27 (Standard error)

The standard error of the estimator θ is defined as

SE(θ) =√

Varθ(θ)

Note that in the definition of the standard error of an estimator, theexpectations (variance) are taken using the distribution with estimatedparameters θ.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 37 / 85

Page 41: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Desirable properties

MSE, bias and standard error (cont.)

Example 28Let Y1, . . . ,Yn be a sample from Bern(p), with p unknown. Letp := Yn = n−1∑n

i=1 Yi . Then,

E(p) =1n

n∑i=1

E(Yi ) = p; and

Var(p) =1n2

n∑i=1

Var(Yi ) =p(1− p)

n.

Therefore, Yn is an unbiased estimator for p, with standard deviation√p(1− p)/n, and standard error SE(p) =

√p(1− p)/n.

For example, if n = 10 and Yn = 0.3, we have p = 0.3 andSE(p) = 0.1449, while the standard deviation of p is unknown.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 38 / 85

Page 42: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Desirable properties

Consistency

Definition 29 (Consistency)

θn is a consistent estimator for θ if θn → θ as n→∞.

Consistency is a natural condition for a reasonable estimator as θn shouldconverge to θ if we have a (theoretically) infinite amount of information.Therefore, a non-consistent estimator should not be used in practice!

Remark

If MSE(θn)→ 0, then θnm.s.−−→ θ (by definition). Therefore, θn

P−→ θ too, soθ is a consistent estimator for θ.

Example 30(Cont. e.g. 28) Since Bias(p) = 0, MSE(p) = Var(p) = p(1− p)/n→ 0as n→∞. Hence, p is consistent.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 39 / 85

Page 43: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Desirable properties

Consistency vs Unbiasedness

Consistency and unbiasedness are two different concepts:• Unbiasedness (E(θ) = θ) is a statement about the expected value of

the sampling distribution of the estimator.• Consistency (plimn θn = θ) is a statement about “where the sampling

distribution of the estimator is going” as the sample size increases.Both are desirable properties of estimators, though it might be possible forone to be satisfied but not the other (see Example 32).

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 40 / 85

Page 44: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Desirable properties

Consistency vs Unbiasedness (Cont.)

In both examples, let X1, . . . ,Xn be a sample from N(µ, σ2).

Example 31

Define µ = X1. Then µ is unbiased since E(X1) = µ, but it is notconsistent since the distribution of µ is always N(µ, σ2) and will neverconcentrate around µ even with infinite sample size.

Example 32

Define σ2 = n−1∑ni=1(Xi − Xn)2. It is a fact that E(σ2) = n−1

n σ2, whichshows that σ2 is biased in finite samples. On the other hand, we can show

Var(σ2) =2σ4(n − 1)

n2 → 0

as n→∞. Therefore, MSE(σ2)→ 0, and σ2 is therefore consistent.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 41 / 85

Page 45: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Desirable properties

Asymptotic normality

Definition 33 (Asymptotic normality)

An estimator θn is asymptotically normal if

θn − θSE(θn)

D−→ N(0, 1)

Alternatively, it can be written as θnD−→ N(θ, SE(θn)2).

Remark:• Many good estimators such as maximum likelihood estimator (MLE),

least squares estimator (LSE) and method of moments estimator(MME) are asymptotically normal under some mild conditions.• The desire for asymptotic normality is simply for convenience (e.g.

hypothesis testing of parameters).

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 42 / 85

Page 46: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Maximum likelihood

Estimation methods

Thus far, we have only discussed desirable properties of estimators, but notany specific way of actually obtaining estimates (apart from the samplemean for µ). Many methods exist, and to name a few:• Method of moments: Express population moments as functions of

parameters, then substituting in the sample moments.• Maximum likelihood estimation: Parameter estimates are those

which maximise the likelihood function.• Least squares estimation: Parameter estimates are those which

minimise some error function.And many others... (generalised MoM, minimum mean squared error,Bayesian least squared error, maximum a posteriori, particle filter, Markovchain Monte Carlo methods, etc.)

In this course, we will be focusing on ML estimation.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 43 / 85

Page 47: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Maximum likelihood

Likelihood

The likelihood is one of the most fundamental concepts in all types ofstatistical inference.

Definition 34 (Likelihood)Suppose that X = (X1, . . . ,Xn) has pdf or pmf f (x|θ), and we haveobserved X = x∗. Then, the likelihood function with observation x∗ isdefined as

L(θ) ≡ L(θ|x∗) = f (x∗|θ)

Pdf/Pmf vs likelihoodPdf/Pmf is a function of x, specifying the distribution of the randomvariable X.

Likelihood is a function of θ, reflecting information on θ contained inobservation x.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 44 / 85

Page 48: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Maximum likelihood

Likelihood (cont.)

HJ (2018, Ch. 3)It was Fisher in 1922 who introduced the method of maximum likelihood asan objective way of conducting statistical inference. This method ofinference is distinguished from the Bayesian school of thought in that onlythe data may inform deductive reasoning, but not any sort of priorprobabilities. Towards the later stages of his career, his work reflected theview that the likelihood is to be more than simply a device to obtainparameter estimates; it is also a vessel that carries uncertainty aboutestimation. In this light and in the absence of the possibility of makingprobabilistic statements, one should look to the likelihood in order to makerational conclusions about an inference problem. Specifically, we mayask two things of the likelihood function: where is the maxima and whatdoes the graph around the maxima look like? The first of these twoproblems is maximum likelihood estimation, while the second concerns theFisher information.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 45 / 85

Page 49: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Maximum likelihood

Likelihood (cont.)

Example 35Suppose that x is the number of success from a known number n ofindependent trials with unknown probability of success π. The probabilityfunction, and so the likelihood function, is

L(π) = f (x |π) =

(n

x

)πx(1− π)n−x

Remember, x is known, while π is unknown.

The likelihood function can be graphed as a function of π. It changesshape for different values of x . As an example, a likelihood function forx = 3 when n = 10 is shown in the figure in the next slide.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 46 / 85

Page 50: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Maximum likelihood

Likelihood (cont.)

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 47 / 85

Page 51: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Maximum likelihood

Likelihood (cont.)

Notice that the likelihood function shown is not a density function. It doesnot have an area of one below it.

We use the likelihood function to compare the plausibility of differentpossible parameter values.• For instance, the likelihood is much larger for π = 0.3 than forπ = 0.8.• That is, the data x = 3 have a greater probability of being observed ifπ = 0.3 than if π = 0.8.

NoteIn the above argument, we do not need to calculate exact probabilitiesunder different values of π. Only the order (magnitude) of those quantitiesmatter.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 48 / 85

Page 52: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Maximum likelihood

Log-likelihood

Let X1, . . . ,Xn be iid with pdf f (x |θ). Write X = (X1, . . . ,Xn)>. Then,the likelihood function is a product of n terms:

L(θ) ≡ L(θ|X) =n∏

i=1

f (Xi |θ).

Definition 36 (Log-likelihood)The log-likelihood function is defined to be

l(θ) ≡ l(θ|X) = log f (X|θ).

Thus, for iid observations, the log-likelihood becomes a sum of n terms:

l(θ) =n∑

i=1

log f (Xi |θ).

This explains why log-likelihood functions are often used with independentobservations.Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 49 / 85

Page 53: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Maximum likelihood

Maximum likelihood estimator

Definition 37 (Maximum likelihood estimator)

A maximum likelihood estimator (MLE), θ = θ(X) ∈ Θ of parameter θ isdefined to be

θ = arg maxθ∈Θ

L(θ|X) = arg maxθ∈Θ

log L(θ|X).

By definition, the MLE satisfies L(θ|X) ≥ L(θ|X) for all θ ∈ Θ. Obviously,an MLE is the most plausible value for θ as judged by the likelihoodfunction.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 50 / 85

Page 54: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Maximum likelihood

Maximum likelihood estimator (cont.)

Definition 38 (Score function)The score function is defined to be

S(θ|X) =∂

∂θlog L(θ|X).

In many cases where Θ is continuous and the maximum does not occur ata boundary of Θ, θ is often the solution of the equation

S(θ|X) =∂

∂θlog L(θ|X) = 0.

This requires solving a system of p (one for each θk in θ) simultaneousequations.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 51 / 85

Page 55: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Maximum likelihood

Maximum likelihood estimator (cont.)

Example 39Suppose that Y1, . . . ,Yn is an iid random sample from N(µ, σ2), whereneither µ nor σ2 is known. Then, the log-likelihood function is

l(µ, σ2) = log

(√2πσ2)−ne−

∑ni=1(Yi−µ)2/(2σ2)

= −n

2log 2π − n

2log σ2 − 1

2σ2

n∑i=1

(Yi − µ)2

Taking derivative with respect to µ gives us the first component of thescore function:

S1(θ) =1σ2

n∑i=1

(Yi − µ)

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 52 / 85

Page 56: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Maximum likelihood

Maximum likelihood estimator (cont.)

Example 39Equating this to zero gives the MLE for µ

1σ2

n∑i=1

(Yi − µ) = 0

n∑i=1

Yi − nµ = 0

⇒ µ =1n

n∑i=1

Yi =: Yn

Thus, µ = Yn.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 53 / 85

Page 57: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Maximum likelihood

Maximum likelihood estimator (cont.)

Example 39The profile log-likelihood remaining is

l(µ, σ2) = −n

2log 2π − n

2log σ2 − 1

2σ2

n∑i=1

(Yi − µ)2

= const.− n

2log σ2 − nσ2

2σ2

where we defined σ2 = n−1∑ni=1(Yi − µ)2. Taking derivatives with

respect to σ2, we get

S2(θ) = − n

2σ2 +nσ2

2σ4

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 54 / 85

Page 58: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Maximum likelihood

Maximum likelihood estimator (cont.)

Example 39Equating S2 to zero we get

− n

2σ2 +nσ2

2σ4 = 0

⇒ σ2 = σ2

Thus the MLE for σ2 is σ2 = n−1∑ni=1(Yi − Yn)2.

RemarkThe MLE for σ2 is biased since

E(σ2) =n − 1n

σ2 6= σ2.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 55 / 85

Page 59: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Maximum likelihood

Invariance property of MLE

Definition 40 (One-to-one transformation)Let g : X → Y be a function. Then g is said to be one-to-one if and onlyif for every y ∈ Y there is at most one x ∈ X such that g(x) = y .Equivalently, g is one-to-one if and only if for x1, x2 ∈ X , g(x1) = g(x2)implies x1 = x2.

Definition 41 (Invariance property of MLE)

Suppose X ∼ f (x|θ), and ψ = ψ(θ) is 1 one-to-one transformation. Let θbe the MLE for θ, i.e.

θ = arg maxθ

L(θ).

Then, the MLE for ψ isψ = ψ(θ).

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 56 / 85

Page 60: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Maximum likelihood

Invariance property of MLE (cont.)

Example 42

Let π be the MLE for π after observing data X1, . . . ,Xniid∼ Bern(π). The

log-odds of an event happening is given by ν = log(π/ log(1− π)

), which

is a one-to-one transformation of π. Therefore, the MLE for ν is given by

ν = logπ

1− π.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 57 / 85

Page 61: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Maximum likelihood

Numerical computation of MLEs

In modern statistical applications, it is typically difficult to find explicitanalytical forms for the MLE. These estimators are found more often byiterative procedures built into computer software.• An iterative scheme starts with some guess at the MLE and then

steadily improves it with each iteration.• The estimator is considered to be found when it has become

numerically stable.• Sometimes, the iterative procedures become trapped at a local

maximum which is not a global maximum.• There may be a very large number of parameters in a model, which

makes such local entrapment more common.• Some iterative schemes include: Newton-Raphson scheme, Fisher

scoring algorithm, quasi-Newton methods, gradient descent, conjugategradients, etc.–we won’t go into details in this course.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 58 / 85

Page 62: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Fisher information

In simple terms, the Fisher information measures the amount ofinformation that an observable random variable X carries about anunknown parameter θ of the statistical model that models X .

Consider the unidimensional case for now: let X ∼ f (x |θ), where θ ∈ R.

Definition 43 (Fisher information)The Fisher information is defined to be the expectation of the secondmoment of the score function, i.e.

I(θ) = E

[(d

dθlog f (X |θ)

)2]∈ R

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 59 / 85

Page 63: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Fisher information (cont.)

Lemma 44 (Expectation of the score is zero)Under certain regularity conditions, E[S(θ)] = 0.

Proof.

E[S(θ)] =

∫ddθ

logf (x |θ)f (x |θ) dx

=

∫ddθ

f (x |θ) dx

=ddθ

*1∫f (x |θ) dx = 0

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 60 / 85

Page 64: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Fisher information (cont.)

Corollary 45 (The Fisher information is the variance of the score)

I(θ) = Var(S(θ)

)Proof.Prove this as an exercise.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 61 / 85

Page 65: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Fisher information (cont.)

Lemma 46 (Another definition for the Fisher information)Under certain regularity conditions, the Fisher information can also bedefined to be negative expected value of the second derivative of the score

I(θ) = −E

[d2

dθ2 log f (X |θ)

]∈ R

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 62 / 85

Page 66: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Fisher information (cont.)

Proof.

d2

dθ2 log f (X |θ) =ddθ

(ddθ

log f (X |θ)

)=

ddθ

(ddθ f (X |θ)

f (X |θ)

)

=

(f (X |θ)

d2

dθ2 f (X |θ)− ddθ

f (X |θ)ddθ

f (X |θ)

)/f (X |θ)2

=d2dθ2 f (X |θ)

f (X |θ)−(

ddθ

log f (X |θ)

)2

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 63 / 85

Page 67: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Fisher information (cont.)

Proof.Note that

E

[d2dθ2 f (X |θ)

f (X |θ)

]=

∫ d2dθ2 f (X |θ)

f (X |θ)

f (X |θ) dx =

∫d2

dθ2 f (X |θ) dx

=d2

dθ2

*1∫

f (X |θ) dx = 0

Therefore,

−E

[d2

dθ2 log f (X |θ)

]= E

[(ddθ

log f (X |θ)

)2]

= I(θ).

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 64 / 85

Page 68: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Fisher information (cont.)

The above discussion was concerning a single random variable X . If weobserve instead X = (X1, . . . ,Xn)>, where each Xi are iid, then

I(θ) = IX(θ) =n∑

i=1

IXi(θ) = nIX1 .

That is, the information is additive: the total Fisher information from n iidrandom variables X1, . . . ,Xn is simply the sum of the n unit Fisherinformation.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 65 / 85

Page 69: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Fisher information matrix

If instead θ is p-dimensional, then we also have similar results for theFisher information, which is now a matrix:

• I(θ) = E[∂∂θ log f (X |θ)

(∂∂θ log f (X |θ)

)>] ∈ Rp×p.

• I(θ) = −E[

∂2

∂θ∂θ> log f (X |θ)]∈ Rp×p.

• I(θ) = IX(θ) =∑n

i=1 IXi(θ) = nIX1 .

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 66 / 85

Page 70: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Cramér-Rao inequality

Theorem 47 (Cramér-Rao inequality)Let X ∼ f (x|θ) satisfying some regularity conditions. Let T = T (X) be astatistic with g(θ) = Eθ(T ). Then, for any θ ∈ Θ,

Varθ(T ) ≥ g′(θ)2

I(θ).

The Cramér-Rao inequality specifies a lower bound for any unbiasedestimator for the paramter g(θ). When the equality holds T is called theminimum variance unbiased estimator (MVUE) of g(θ).

Important case

For any unbiased estimator θ = T (X), i.e. g(θ) = θ (the identity function),

Var(θ) ≥ 1/I(θ).

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 67 / 85

Page 71: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Cramér-Rao inequality (cont.)

We can state a similar theorem for the multivariate case:

Theorem 48 (Multivariate Cramér-Rao inequality)

Let X ∼ f (x|θ) with θ = (θ1, . . . , θp)> ∈ Rp satisfying some regularityconditions. Let I(θ) ∈ Rp×p be the Fisher information matrix, and letT =

(T1(X), . . . ,Tp(X)

)be a statistic with g(θ) = Eθ(T). Then, for any

θ ∈ Θ,Varθ(T) ≥ J (θ)I(θ)−1J (θ)>.

where J (θ) is called the Jacobian matrix whose (i , j)-th element is givenby ∂gi (θ)

∂θj. Note that the matrix inequality A ≥ B is understood to mean

that the matrix A− B is positive semidefinite.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 68 / 85

Page 72: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Cramér-Rao inequality (cont.)

Important case

For any unbiased estimator θ = T(X),

Var(T) ≥ I(θ)−1.

It is convenient to compute the inverse of the Fisher information matrix,then one can simply take the reciprocal of the corresponding diagonalelement to find a (possibly loose) lower bound.

Var(Tk) = Var(T)kk the kth diagonal entry

≥[I(θ)−1]

kk

≥ [I(θ)kk ]−1

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 69 / 85

Page 73: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Cramér-Rao inequality (cont.)

Example 49Let X1, . . . ,Xn be a sample from N(µ, σ2). We consider estimators for µ,treating σ2 as known. The score function for a single observation is

S(µ) =∂

∂µlogconst.× e−

12σ2

(X1−µ)2

=∂

∂µ

− 12σ2 (X1 − µ)2

=

X1 − µσ2

Also note that l ′′(µ) = S ′(µ) = σ−2, hence IX1(µ) = σ−2. Therefore,

I(µ) =n∑

i=1

IXi(µ) = n/σ2.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 70 / 85

Page 74: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Cramér-Rao inequality (cont.)

Example 49For any unbiased estimator µ for µ, the Cramér-Rao theorem says that

Var(µ) ≥ I(µ)−1 = σ2/n.

Previous example has shown that

Var(Xn) = σ2/n

so therefore Xn is the MVUE for µ.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 71 / 85

Page 75: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Asymptotic properties of MLE

Let X1, . . . ,Xn be iid with pdf f (x |θ). Write

l(θ) =n∑

i=1

log f (xi |θ).

Let θ be the MLE which maximises l(θ), and suppose that f fulfils certainregularity conditions.

The MLE satisfies (usually) the following two properties:1. Consistency; and2. Asymptotic normality.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 72 / 85

Page 76: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Consistency

The MLE is consistent in the sense that as n→∞,

P[‖θ − θ‖ > ε

]→ 0

for any ε > 0. The operator ‖ · ‖ is some norm (absolute value norm,2-norm, d-norm, et.c) for vectors, analogous to taking absolute values inthe unidimensional case.

Consistency requires that an estimator converges to the parameter to beestimated. It is a very mild and modest condition that any reasonableestimator should fulfil. The consistency condition is often used to rule outbad estimators.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 73 / 85

Page 77: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Asymptotic normality

As n→∞, √n(θ − θ)

D−→ Np(0, IX1(θ)−1)

In other words, for large n, it approximately holds that

θ ∼ Np(θ, IX1(θ)−1/n).

Therefore, asymptotically the MLE is unbiased and attains the Cramér-Raobound. Any estimator fulfilling this condition is called efficient.

Approximate standard error

An approximate standard error of the jth component of θ (i.e. θj) is thesquare root of the (j , j)th element of IX1(θ)−1 divided by

√n.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 74 / 85

Page 78: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Asymptotic properties of MLE (cont.)

Example 50The family of Bernoulli distributions has pmf f (x |p) = px(1− p)1−x ,x ∈ 0, 1. Taking logarithms,

log f (x |p) = x log p + (1− x) log(1− p),

and the first and second derivatives are

S(p) =x

p− 1− x

1− p

S ′(p) = − x

p2 +1− x

(1− p)2

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 75 / 85

Page 79: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Asymptotic properties of MLE (cont.)

Example 50The (unit) Fisher information can then be computed as

I(p) = −E[S ′(p)

]=

EX

p2 −1− EX

(1− p)2

=p

p2 −1− p

(1− p)2

=1p− 1

1− p=

1p(1− p)

.

As we know (or can be shown), the MLE for p is p = Xn when a sampleX1, . . . ,Xn is observed, and the asymptotic normality result states that

√n(p − p)

D−→ N(0, p(1− p)).

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 76 / 85

Page 80: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Point estimation » Properties of MLE

Asymptotic properties of MLE (cont.)

Example 50Of course, since the MLE is p = Xn (the sample mean), we can also appealto the CLT which gives the exact same result:

√n(Xn − µ)

D−→ N(0, σ2)

as n→∞, where µ = EX = p and σ2 = VarX = p(1− p) for theBernoulli distribution.

RemarkRecall the “normal approximation to the binomial”: For large n,X ∼ Bin(n, p) can be approximated by X ∼ N(np, np(1− p)). This isexactly the reason why this approximation is used. Note that the Bernoulliand Binomial distributions are related.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 77 / 85

Page 81: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

1 Inequalities

2 Convergence of random variables

3 Point estimation

4 Confidence sets

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 77 / 85

Page 82: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Confidence sets »

Confidence sets

A point estimator is simple to construct and use, but it is not veryinformative.• If a different sample is used, the value of the estimate changes.• The point estimate does not reflect the uncertainty in the estimation.

The confidence interval is the most commonly used confidence set. It ismore informative than a point estimator.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 78 / 85

Page 83: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Confidence sets »

Confidence sets (cont.)

Example 51A random sample X1, . . . ,Xn is drawn from N(µ, 1). Then we know that√n(Xn − µ) ∼ N(0, 1) approximately or asymptotically. From this, we can

deduce thatP(−1.96 ≤

√n(Xn − µ) ≤ 1.96

)= 0.95

orP(Xn − 1.96/

√n ≤ µ ≤ Xn + 1.96/

√n)

= 0.95,

so a 95% confidence interval for µ is(Xn − 1.96/

√n, Xn + 1.96/

√n).

Suppose n = 4, Xn = 2.25, then a 95% CI is

(2.25− 1.96/2, 2.25 + 1.96/2) = (1.27, 3.23)

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 79 / 85

Page 84: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Confidence sets »

Confidence sets (cont.)

QUESTION: What is P(1.27 < µ < 3.23)? Note that µ is an unknownconstant!

ANSWER: (1.27, 3.23) is one instance of the random interval whichcovers µ with probability 0.95.

If one draws 10,000 samples with size n each in order to construct 10,000intervals of the form

(Xn − 1.96/

√n, Xn + 1.96/

√n), then about 95% of

the intervals, i.e. 9,500 intervals, will cover the true value of µ.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 80 / 85

Page 85: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Confidence sets »

Confidence sets (cont.)

The blue lines represent CI constructed with a replicate of the sample ofsize n. In this case, hypothetically the true value µ is known.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 81 / 85

Page 86: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Confidence sets »

Confidence interval

Definition 52 (Confidence interval)

Let X = (X1, . . . ,Xn)> be a sample from a pdf with parameter θ. If L(X)and U(X) are two statistics for which

P(L(X) < θ < U(X)

)= 1− α,

then(L(X),U(X)

)is called a 100(1− α)% confidence interval for θ.

Remark1− α is called the confidence level, which is usually set at 0.90, 0.95 or0.99. Naturally for given α, we shall search for the interval with theshortest length U − L, which gives the most accurate estimation.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 82 / 85

Page 87: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Confidence sets »

Approximate CI based on asymptotic normality

Definition 53 (Approximate CI based on asymptotic normality)

If (θ − θ)/SE(θ)D−→ N(0, 1), then(

θ − Zα/2SE(θ), θ + Zα/2SE(θ))

is an approximate 1− α confidence interval for θ, where Zα is the top-αpoint of N(0, 1), i.e. P(Z > Zα) = α, where Z ∼ N(0, 1).

Sometimes this approximate interval is known as the Wald interval.

Here are some values for Zα/2 for various α = 0.1, Zα/2 = 1.64 ; α values:α = 0.05, Zα/2 = 1.96 (most commonly used); α = 0.01, Zα/2 = 2.58.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 83 / 85

Page 88: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Confidence sets »

Confidence interval (cont.)

Example 54We saw earlier that for X1, . . . ,Xn ∼ Bern(p),

√n(p − p)

D−→ N(0, p(1− p))

as n→∞, where p = Xn. Now an approximate 95% confidence intervalfor p is

p ± 1.96√

p(1− p)/n.

For example, if n = 10 and p = 0.3, an approximate 95% confidenceinterval for p is

0.3 ± 1.96√

0.3(1− 0.3)/10 = (0.155, 0.445).

Whereas if n = 100 and p = 0.3, an approximate 95% confidence intervalfor p is (0.254, 0.346).

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 84 / 85

Page 89: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

Confidence sets »

Confidence interval (cont.)

RemarkThe point estimator p in the above example did not change with n = 10 orn = 100 (of course this is just a made-up example, in reality it mightchange, but for the argument’s sake, let’s say it did not change). However,the confidence interval is much shorter when n = 100, giving much moreaccurate estimation.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 85 / 85

Page 90: SM-4331AdvancedStatistics Chapter1(EstimationTheory) · SM-4331AdvancedStatistics Chapter1(EstimationTheory) DrHaziqJamil FOS M1.09 Universiti Brunei Darussalam SemesterII2019/20.

References I

Casella, G. and R. L. Berger (2002). Statistical Inference. 2nd ed. PacificGrove, CA: Duxbury. ISBN: 978-0-534-24312-8.

Pawitan, Y. (2001). In All Likelihood. Statistical Modelling and InferenceUsing Likelihood. Oxford University Press. ISBN: 978-0-19-850765-9.

Jamil, H. (Oct. 2018). “Regression modelling using priors depending onFisher information covariance kernels (I-priors)”. PhD thesis. LondonSchool of Economics and Political Science.

Wassermann, L. (2006). All of Nonparametric Statistics. New York:Springer-Verlag. ISBN: 978-0-387-25145-5. DOI:10.1007/0-387-30623-4.

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 1 Sem II 2019/20 85 / 85


Recommended