+ All Categories
Home > Documents > An Introduction and Overview - 123.physics.ucdavis.edu

An Introduction and Overview - 123.physics.ucdavis.edu

Date post: 28-Jan-2022
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
88
STATISTICAL MODELING AND INFERENCE An Introduction and Overview Marie Davidian Department of Statistics [email protected] http://www.stat.ncsu.edu/davidian SAMSI Inverse Problem Workshop September 21, 2002 1
Transcript
Page 1: An Introduction and Overview - 123.physics.ucdavis.edu

STATISTICAL MODELING AND INFERENCE

An Introduction and Overview

Marie DavidianDepartment of Statistics

[email protected]

http://www.stat.ncsu.edu/∼davidian

SAMSI Inverse Problem WorkshopSeptember 21, 2002

1

Page 2: An Introduction and Overview - 123.physics.ucdavis.edu

OUTLINE

1. Introduction: Sources of variation in data

2. Whirlwind review of probability

3. Statistical models

4. Statistical inference: Classical frequentist paradigm

5. Modeling and inference for independent data

6. Hierarchical statistical models for complex datastructures

7. Statistical inference: Bayesian paradigm

8. Hierarchical models, revisited

9. Closing remarks

10. Where to learn more. . .2

Page 3: An Introduction and Overview - 123.physics.ucdavis.edu

1. Introduction

Statistics:

• “The study of variation”

• NOT a branch of mathematics

• More like a philosophy for thinking about drawing conclusions from

data

Objective of this tutorial: Introduce statistical thinking from the

point of view of fitting models to data

• Lay groundwork for further study

3

Page 4: An Introduction and Overview - 123.physics.ucdavis.edu

Deterministic models: Representation of an “exact” relationship

• System x(t) = gt, x(t), θ• Solution y = x(t, θ)

• Objective: “Inverse problem” – Learn about θ

• Example – the design problem (Banks)

Observations: Suppose we observe the system over time and record

values y1, . . . , yn at times 0 ≤ t1 < · · · < tn

• Commonly, observations do not track exactly on the curve

y = x(t, θ)

4

Page 5: An Introduction and Overview - 123.physics.ucdavis.edu

Simple example: Pharmacokinetics of theophylline (anti-asthmatic)

• Understanding of processes of absorption, distribution, elimination

important for developing dosing recommendations

• Common deterministic model: One compartment model with

first-order absorption and elimination following oral dose D

D X(t) --

keka

• Assumption: X(t) = V C(t) [constant relationship between drug

concentration in plasma C(t) and amount of drug in body X(t)]

5

Page 6: An Introduction and Overview - 123.physics.ucdavis.edu

System: Xa(t) = amount of drug at absorption site at time t

X(t) = kaXa(t)− keX(t)

Xa(t) = −kaXa(t)

with initial conditions Xa(0) = Xa0 = FD, X(0) = X0 = 0, where F is

the fraction available (assume known)

Closed form solution: Divide by V

C(t) =kaFD

V (ka − ke)e−ket − e−kat

Result: If the model is a perfect representation of the system, the

relationship

C(t) =kaFD

V (ka − ke)e−ket − e−kat

should describe the concentration observed at time t

6

Page 7: An Introduction and Overview - 123.physics.ucdavis.edu

Experiment: PK in humans following oral dose

• 12 “healthy volunteers” each given dose D (mg/kg) at time t = 0

• Blood samples drawn at 10 subsequent time points over the next 25

hours for each subject

• Samples assayed for theophylline concentration

• Observe y1, . . . , y10 at times t1, . . . , t10

Objectives:

1. For a specific subject, learn about absorption, elimination,

distribution by determining ka, ke, V ⇒ dosing recommendations

for this subject

2. Learn about how absorption, elimination, and distribution differ

from subject to subject ⇒ dosing recommendations for the

population of likely subjects

7

Page 8: An Introduction and Overview - 123.physics.ucdavis.edu

Data for subject 12: Plot of concentration vs. time

Time (hr)

The

ophy

lline

Con

c. (

mg/

L)

0 5 10 15 20 25

02

46

810

12

Subject 12

8

Page 9: An Introduction and Overview - 123.physics.ucdavis.edu

Data for subject 12: With “fitted model” superimposed

Time (hr)

The

ophy

lline

Con

c. (

mg/

L)

0 5 10 15 20 25

02

46

810

12

Subject 12

9

Page 10: An Introduction and Overview - 123.physics.ucdavis.edu

Remarks:

• Observed concentrations trace out a pattern over time quite similar

to that dictated by the one compartment model

• But they do not lie exactly on a smooth trajectory

• “Observation error”

Why?

• One obvious reason: Assay is not perfect, cannot measure

concentration exactly (measurement error)

• Other reasons?

10

Page 11: An Introduction and Overview - 123.physics.ucdavis.edu

Remarks:

• Observed concentrations trace out a pattern over time quite similar

to that dictated by the one compartment model

• But they do not lie exactly on a smooth trajectory

• “Observation error”

Why?

• One obvious reason: Assay is not perfect, cannot measure

concentration exactly (measurement error)

• Other reasons?

– Model misspecification

– More complex biological process

– Times/dose recorded incorrectly

– Etc. . .

10

Page 12: An Introduction and Overview - 123.physics.ucdavis.edu

Hypothetically, what’s really going on: More complex biological

process and measurement error

• The model is an idealized representation in some sense

0 5 10 15 20

02

46

810

12

C(t

)

t

11

Page 13: An Introduction and Overview - 123.physics.ucdavis.edu

Sources of variation: The (deterministic) model is a good

representation of the general pattern, but observed concentrations are

subject to

• Intra-subject “fluctuations”

• Assay measurement error

Conceptualization: Can think of what we observe as

yj = f(tj , θ) + εj

• f(t, θ) = C(t), a function of θ = (ka, ke, V )T (and D)

• εj is the deviation between what the (deterministic) model dictates

we would see at tj and what we actually observe due to

measurement error, “biological fluctuations”

12

Page 14: An Introduction and Overview - 123.physics.ucdavis.edu

Thought experiment: Consider measurement error

• A particular blood sample has a “true” concentration

• When we measure this concentration, an error is committed, which

causes “observed” to deviate from “true” (+ or −)

• Suppose we were to measure the same sample over and over – each

time, a possibly different error is committed

• Thus, all such observations would turn out differently (VARY), even

though, ideally, they should be all the same (measuring the same

thing)

Result: Measurement error is a source of variation that leads to

UNCERTAINTY in what we observe

• In actuality, we measure the concentration only once, but it could

have turned out differently

• ⇒ Any determination of θ from data is subject to uncertainty

13

Page 15: An Introduction and Overview - 123.physics.ucdavis.edu

All 12 subjects:

Time (hr)

The

ophy

lline

Con

c. (

mg/

L)

0 5 10 15 20 25

02

46

810

12

All subjects

14

Page 16: An Introduction and Overview - 123.physics.ucdavis.edu

All 12 subjects:

Time (hr)

The

ophy

lline

Con

c. (

mg/

L)

0 5 10 15 20 25

02

46

810

12

All subjects

15

Page 17: An Introduction and Overview - 123.physics.ucdavis.edu

Recall objective 2: Absorption, distribution, elimination in the

population of subjects

• Similar pattern for all subjects, but different features

• ⇒ Each subject has his/her own θ

• Subject-to-subject variation

• Objective, restated – learn about θ values in the (hypothetical)

population of subjects like these

• But we have only seen a sample of 12 subjects from this population

⇒ uncertainty about entire population

• Each subject’s data also subject to uncertainty due to measurement

error, “biological fluctuation”

• How to formalize the objective and take into account uncertainty

from all these sources of variation?

16

Page 18: An Introduction and Overview - 123.physics.ucdavis.edu

Principles:

• Failure to acknowledge uncertainty can lead to erroneous conclusions

• Acknowledging uncertainty requires a formal framework to describe

and assess it

• Acknowledging uncertainty clarifies limitations of what can be

learned from data

Statistical models:

• Formally represent sources of variation leading to uncertainty

• ⇒ In this context: Incorporate a deterministic model in a statistical

framework

• Main tool: probability

17

Page 19: An Introduction and Overview - 123.physics.ucdavis.edu

For example: Statistical model for theophylline concentrations for

subject 12

Yj = f(tj , θ) + εj , j = 1, . . . , n

• εj (and hence Yj) is a random variable with a probability distribution

that characterizes “populations” of possible values of phenomena

like measurement errors, fluctuations that might occur at tj

• Describes pairs (Yj , tj) we might see – describes the “data

generating mechanism”

• Data we observe are realizations of Yj , j = 1, . . . , n: y1, . . . , yn

• The mechanism is characterized by assumptions on the probability

distribution of εj (so, equivalently, on that of Yj)

18

Page 20: An Introduction and Overview - 123.physics.ucdavis.edu

2. Review of probability

“Experiment”: Sample space Ω

• Toss a coin 2 times , Ω = HH, HT, TH, TT• Measure concentration, Ω = all possible conditions

Probability function: For B some collection of subsets A of Ω

• P (A) ≥ 0, P (Ω) = 1, P (∪∞i=1Ai) =∑∞

i=1 P (Ai)

• Properties – P (∅) = 0, P (A) ≤ 1, P (Ac) = 1− P (A),P (A) ≤ P (B) if A ⊂ B, etc

Random variable: A function from Ω into < (capital letters) with

new sample space X• E.g., Toss coin 2 times, X(ω) = # heads, X = 0, 1, 2• E.g., Measure concentration, ε(ω) error committed, X = (−∞,∞)

19

Page 21: An Introduction and Overview - 123.physics.ucdavis.edu

Probability function for X:

• X = x ∈ X iff ω ∈ Ω 3 X(ω) = x

PX(X = x) = P (ω ∈ Ω : X(ω) = x)

• Customary to speak directly about probability wrt random variables

• X denotes random variable, x denotes possible values (elements of

X , realizations)

Cumulative distribution function for X: F (x) = P (X ≤ x) ∀x(Nondecreasing and right continuous)

• X is discrete if F (x) is a step function

• X is continuous if F (x) is continuous

20

Page 22: An Introduction and Overview - 123.physics.ucdavis.edu

Probability mass and density functions:

• Discrete: probability mass function

f(x) = P (X = x) ∀x ⇒ F (x) =∑u≤x

f(u)

• Continuous: probability density function f(x) satisfies

F (x) =∫ x

−∞f(u) du ∀x

• f(x) ≥ 0,∑

x f(x) = 1 or∫∞−∞ f(x)

Transformations: Y = g(X) is also a random variable with pmf/pdf

that may be derived from f(x)

⇒ “Probability distribution”

21

Page 23: An Introduction and Overview - 123.physics.ucdavis.edu

Random vectors: (X1, . . . , Xp)T is a function from Ω into <p with

pmf/pdf f(x1, . . . , xp)

• E.g., all discrete – f(x1, . . . , xp) = P (X1 = x1, . . . , Xp = xp)

• Marginal pmf/pdf – e.g., fX1(x1) =∑

x2,...,xpf(x1, . . . , xp)

Independence: X1 and X2 are independent if

f(x1, x2) = fX1(x1)fX2(x2), X1 ‖ X2

Expectation (mean, expected value): “Average value”

Eg(X) =

∫∞−∞ g(x)f(x) dx X continuous∑

x g(x)f(x) =∑

x g(x)P (X = x) X discrete

Variance: Second central moment, measure of “spread,” quantifies

variation var(X) = E[X − E(X)2

]• Standard deviation =

√var(X) on same scale of X

22

Page 24: An Introduction and Overview - 123.physics.ucdavis.edu

Covariance and correlation: “Degree of association”

• Covariance between X1 and X2

cov(X1, X2) = E[X1 − E(X1)X2 − E(X2)

]• Will be > 0 if X1 > E(X1) and X2 > E(X2) or X1 < E(X1) and

X2 < E(X2) tend to happen together

• Will be < 0 if X1 > E(X1) and X2 < E(X2) or X1 < E(X1) and

X2 > E(X2) tend to happen together

• Will = 0 if X1 and X2 are ‖• Correlation – covariance on a unitless basis

ρX1X2 = corr(X1, X2) =cov(X1, X2)√var(X1)var(X2)

• −1 ≤ ρX1X2 ≤ 1; ρX1,X2 = −1 or 1 iff X1 = a + bX2

23

Page 25: An Introduction and Overview - 123.physics.ucdavis.edu

Conditional probability: Probabilistic statement of “relatedness”

• E.g., weight Y > 200 more likely for X = 6 than X = 5 feet tall

• X, Y discrete: conditional pmf given X = x is function of y

f(y|x) = P (Y = y|X = x) =f(x, y)fX(x)

, fX(x) > 0

and satisfies∑

y f(y|x) = 1 (a pmf for fixed x)

• X, Y continuous: conditional pdf given X = x is function of y

f(y|x) =f(x, y)fX(x)

, fX(x) > 0

and satisfies∫∞−∞ f(y|x) dy = 1 (a pdf for fixed x)

• Thus, the conditional distribution of Y given X = x is possibly

different for each x

• Y |X denotes the family of probability distributions so defined

24

Page 26: An Introduction and Overview - 123.physics.ucdavis.edu

Conditional expectation: For g(Y ) a function of Y , define the

conditional expectation of Y given X = x

Eg(Y )|X = x = Eg(Y )|x =∑

y

g(y)f(y|x) discrete

Eg(Y )|X = x = Eg(Y )|x =∫ ∞

−∞g(y)f(y|x) dy continuous

• Conditional expectation is a function of x taking a value in <,

possibly different for each x

• Thus, Eg(Y )|X is a random variable whose value depends on the

value of X (and takes on values Eg(Y )|x as X takes on values x)

• Conditional variance defined similarly

25

Page 27: An Introduction and Overview - 123.physics.ucdavis.edu

Independence: If X and Y are independent random variables, then

f(y|x) =f(x, y)fX(x)

=fX(x)fY (y)

fX(x)= fY (y)

and

Eg(Y )|X = x = Eg(Y ) for any x

so Eg(Y )|X is a constant random variable and equal to Eg(Y )

26

Page 28: An Introduction and Overview - 123.physics.ucdavis.edu

Some probability distributions: “∼” means “distributed as”

• X ∼ Poisson(λ) – a model for counts x = 0, 1, 2, . . .

f(x) = P (X = x) =e−λλx

x!, E(X) = λ, var(X) = λ

• Normal or Gaussian distribution: X ∼ N (µ, σ2). For −∞ < x < ∞

f(x) =1√2πσ

exp− (x− µ)2

2σ2

, E(X) = µ, var(X) = σ2, σ > 0

symmetric about µ, Z = (X − µ)/σ ∼ N (0, 1) standard normal

• Lognormal distribution: log X ∼ N (µ, σ2)

E(X) = eµ+σ2/2, var(X) = (eσ2 − 1)e2µ+σ2 ∝ E(X)2

Constant coefficient of variation (CV) =√

var(X)/E(X)(“noise-to-signal”) – does not depend on E(X)

27

Page 29: An Introduction and Overview - 123.physics.ucdavis.edu

(a) Normal pdfs (σ21 , σ2

2) and (b) lognormal pdf:

µ

σ21

σ22

(a)

0

x

f(x

)

(b)

28

Page 30: An Introduction and Overview - 123.physics.ucdavis.edu

Multivariate normal distribution: Random vector

X = (X1, . . . , Xp)T has a multivariate (p-variate) normal distribution if

αT X ∼ normal ∀α ∈ <p

f(x) = (2π)−p/2|Σ|−1/2 exp−(x− µ)T Σ−1(x− µ)/2,

for x = (x1, . . . , xp)T ∈ <p

• E(X) = µ = (µ1, . . . , µp)T = E(X1), . . . , E(Xp)T

• Σ (p× p) is such that Σjj = var(Xj), Σjk = Σkj = cov(Xj , Xk)

• Σ = E(X − µ)(X − µ)T is the covariance matrix

• The marginal pdfs are univariate normal

29

Page 31: An Introduction and Overview - 123.physics.ucdavis.edu

Two bivariate (p = 2) normal pdfs:

2530

3540

4550

55

25

30

35

40

45

50

55

00.

20.

40.

60.

81

1.2

25 30 35 40 45 50 55

2530

3540

4550

55

2530

3540

4550

55

25

30

35

40

45

50

55

00.

10.

20.

30.

40.

50.

60.

7

25 30 35 40 45 50 55

2530

3540

4550

55

x1

x1

x1

x1

x2

x2

x2

x2

ρX1X2 = 0.8ρX1X2 = 0.8

ρX1X2 = 0.0ρX1X2 = 0.0

30

Page 32: An Introduction and Overview - 123.physics.ucdavis.edu

3. Statistical models

Basic idea: Random variables and their probability distributions are

the building blocks of statistical models

• A statistical model is a representation of the mechanism by which

data are assumed to arise

• Phenomena that are subject to variation and hence give rise to

uncertainty in the way data may “turn out” are represented by

random variables

• Assumptions on the probability distributions for these random

variables represent assumptions on the nature of such variation

• Return to the theophylline example for a demonstration. . .

31

Page 33: An Introduction and Overview - 123.physics.ucdavis.edu

Recall: For subject 12

Yj = f(tj , θ) + εj , j = 1, . . . , n

f(t, θ) =kaFD

V (ka − ke)e−ket − e−kat, θ = (ka, ke, V )T

• Aggregate effects of measurement error, “biological fluctuations,”

other phenomena represented by random variable εj

• The assumed probability distribution of εj , and hence that of Yj

reflects assumed features of these phenomena

Aside: More formally, could observe the PK process at any time ⇒

Y (t) = f(t, θ) + ε(t), t ≥ 0

• Y (t) [and ε(t)] is a stochastic processes with sample paths y(t)

• Yj = Y (tj), εj = ε(tj)

32

Page 34: An Introduction and Overview - 123.physics.ucdavis.edu

Example–Building a statistical model for subject 12 data:Make assumptions that characterize beliefs about variation

Yj = f(tj , θ) + εj

Assumption 1: Nature of εj – “additive effects”

εj = ε1j + ε2j

• ε1j represents measurement error that could be committed at tj

• ε2j represents “fluctuation” that might occur at t

• Continuous random variables – concentrations in principle can take

on any value (although we may be limited in what we may actually

observe due to resolution of measurement)

• Random vectors ε1 = (ε11, . . . , ε1n)T , ε2 = (ε21, . . . , ε2n)T ,

Y = (Y1, . . . , Yn)T

33

Page 35: An Introduction and Overview - 123.physics.ucdavis.edu

Assumption 2–Measurement error: Some “reasonable”

assumptions on aspects of the joint probability distribution of ε1

• Measuring device is unbiased – does not systematically err in a particular

direction ⇒E(ε1j) = 0 for each j = 1 . . . , n

(All possible errors for measuring concentration for the sample taken at

any tj “average out” to zero)

• In fact, negative or positive errors are equally likely ⇒ the marginal

probability density of ε1j is symmetric for each j

• Measurement errors at any two times tj , tj′ are “unrelated”

ε1j ‖ ε1j′ ⇒ cov(ε1j , ε1j′) = 0

• Variation among all errors that might occur at any tj is the same ⇒var(ε1j) = σ2

1

for all j (unaffected by time or “actual concentration” in the sample at

tj) – is this realistic?

34

Page 36: An Introduction and Overview - 123.physics.ucdavis.edu

Assumption 3–“Fluctuations”: Some “reasonable” assumptions on

aspects of the joint probability distribution of ε2

• Fluctuations tend to “track” the smooth trajectory f(t, θ) over time

(sample path) but can be “above” or “below” at any point in time ⇒E(ε2j) = 0

(All possible fluctuations at any particular time “average out” to zero)

• In fact, negative or positive fluctuations at a particular time are equally

likely ⇒ the marginal probability density of ε2j is symmetric

• Variation among fluctuations that might occur at any tj is same ⇒var(ε2j) = σ2

2

• Fluctuations “close together” in time (at times tj , tj′) tend to behave

“similarly,” with extent of “similarity” decreasing as |tj − tj′ | ↑cov(ε2j , ε2j′) = C(|tj − tj′ |) ⇒ corr(ε2j , ε2j′) = c(|tj − tj′ |)

for decreasing functions C( · ), c( · ) with with C(0) = σ22 and c(0) = 1

35

Page 37: An Introduction and Overview - 123.physics.ucdavis.edu

Assumption 3–“fluctuations,” continued:

• E.g., for corr(ε2j , ε2j′) = c(|tj − tj′ |),c(u) = exp(−φu2)

(so correlation between fluctuations at two times is nonnegative,

reflecting “similarity”)

• Extent and direction of measurement error at any time tj unrelated to

fluctuations at tj or any other time ⇒ε1j ‖ ε2j′

for any tj , tj′ , j, j′ = 1, . . . , n

Remarks:

• The foregoing assumptions are not the only assumptions one could

make, but exemplify the considerations involved

• The normal probability distribution is a natural choice to represent

the assumption of symmetry

36

Page 38: An Introduction and Overview - 123.physics.ucdavis.edu

Result: Taken together, these assumptions imply

Y ∼ Nnf(θ), σ21In + σ2

2Γ ψ = (θT , σ21 , σ2

2 , φ)T (1)

• f(θ) = f(t1, θ), . . . , f(tn, θ)T

• In is (n× n) identity matrix, Γ is (n× n)

Γ =

1 e−φ(t1−t2)2 · · · e−φ(t1−tn)2

e−φ(t1−t2)2 1. . .

...

.... . .

. . . e−φ(tn−1−tn)2

e−φ(t1−tn)2 · · · e−φ(tn−1−tn)2 1

• Each marginal is a normal density, e.g. Yj ∼ Nf(tj , θ), σ2

1 + σ22

• E(Yj) = f(tj , θ), var(Yj) = σ21 + σ2

2 , cov(Yj , Yj′) = σ22e−φ(tj−tj′ )

2

• Interpretation – f(t, θ) is the result of averaging across all possible

sample paths of the fluctuation process and measurement errors, so

representing the “inherent trajectory” for subject 12

37

Page 39: An Introduction and Overview - 123.physics.ucdavis.edu

Common simplification:

• If the tj are far apart in time, |tj − tj′ | is large, and hence

exp−φ(tj − tj′)2 close to zero ⇒ “correlation among fluctuations

at t1, . . . , tn is negligible”

• Approximate by assuming ε2j ‖ ε2j′ ⇒ cov(ε2j , ε2j′) = 0 and thus

Γ = In ⇒Yj ‖ Yj′ ⇒ cov(Yj , Yj′) = 0,

and var(Yj) = σ2 = σ21 + σ2

2

• The statistical model becomes

Y ∼ Nnf(θ), σ2In, ψ = (θT , σ2)T (2)

38

Page 40: An Introduction and Overview - 123.physics.ucdavis.edu

4. Statistical inference I

Key point: A statistical model like (1) or (2) describes all possible

probability distributions for random vector Y representing the data

generating mechanism for observations we might see at t1, . . . , tn

• E.g., for (1), possible probability distributions are specified by

different values of the parameter ψ = (θT , σ21 , σ2

2 , φ)T ∈ Ψ

• The big question: Which value of ψ truly governs the mechanism?

• In particular, we are interested in θ (σ21 , σ2

2 , φ are required to

describe things fully, but are a nuisance . . . more later)

Objective: If we collect data [so observe a single realization of

Y = (Y1, . . . , Yn)T ], what can we learn about ψ?

• . . . and how can we account for the fact that things could have

turned out differently (i.e., a different realization)?

39

Page 41: An Introduction and Overview - 123.physics.ucdavis.edu

Conceptually:

• Think of the statistical model as a formal representation of the

“population” of all possible realizations of Y1, . . . , Yn we would ever

see

• When we collect data, we observe a sample from the population;

i.e., a single realization of Y1, . . . , Yn, y1, . . . , yn

Objective, restated: What can we learn about the “true value” of ψ

(which determines the nature of the population) from a sample?

• How uncertain will we be?

Statistical inference (loosely speaking): Making statements

about a population on the basis of only a sample

40

Page 42: An Introduction and Overview - 123.physics.ucdavis.edu

Parameter (point) estimation: Construct a function of Y1, . . . , Yn

that, if evaluated at a particular realization y1, . . . , yn, yields a numerical

value that gives information on the true value of ψ

• Estimator: The function itself

• Estimate: The numerical value for the particular realization

• Estimation: Used both to denote the procedure (estimator) and

actual calculation of a numerical value (estimate)

Example: Ordinary least squares (OLS) estimator θ(Y )

arg minθ

n∑j=1

Yj − f(tj , θ)2

• For a particular data set, the OLS estimate θ

arg minθ

n∑j=1

yj − f(tj , θ)2

41

Page 43: An Introduction and Overview - 123.physics.ucdavis.edu

Remark: Distinction between estimator and estimate routinely abused

Convention: Emphasis that estimator is a function of Y usually

suppressed (write ψ(Y ) as ψ)

Question: How “good” is ψ(Y ) as an estimator?

• A question about the procedure

Key idea: An estimator is a function of Y ⇒ for any ψ, ψ has a

probability distribution (depending on that of Y and hence ψ)

• Each realization of Y yields a value – sample space for ψ

• “All possible” ψ values from “all possible data sets,” of which we

observe only one

• Large variance – another realization might give very different result

⇒ lots of uncertainty

• Small variance – similar answer from another realization ⇒ mild

uncertainty

42

Page 44: An Introduction and Overview - 123.physics.ucdavis.edu

Sampling distribution: The probability distribution of an estimator

• Properties characterize uncertainty in estimation procedure

(estimator)

• Unbiased estimator – Eψ(ψ) = ψ

• Sampling covariance matrix varψ(ψ)

• Sampling variance and standard error for kth component

varψ(ψk),√

varψ(ψk)

• Key factors determining sampling variance

– Variance of Y (may be out of our control)

– Sample size n (often under our control)

43

Page 45: An Introduction and Overview - 123.physics.ucdavis.edu

(Very) simple example: f(t, θ) = θ1 + θ2t in model

Y ∼ Nnf(θ), σ2In), ψ = (θT , σ2) ⇒ f(θ) = Xθ

• OLS estimator θ(Y ) = (XT X)−1XT Y

• Unbiased – Eψ(θ) = θ, Sampling covariance

varψ(θ) = σ2(XT X)−1 = σ2n−1(n−1XT X)−1

– Depends on σ2 and n, var(θk) = σ2(XT X)−1kk

• Sampling distribution – θψ∼ Nnθ, σ2(XT X)−1

Absolute necessity: Report of estimate should always be

accompanied by estimate of standard error

σ2 = (n− p)−1n∑

j=1

Yj − f(tj , θ)2 ⇒ SE(θk) =√

σ2(XT X)−1kk

• If θ2 = 2.0 and SE(θ2) = 3.0 ⇒ pretty uncertain

• If θ2 = 2.0 and SE(θ2) = 0.03 ⇒ feeling pretty good!

44

Page 46: An Introduction and Overview - 123.physics.ucdavis.edu

Confidence interval: Probability statement to refine statement of

uncertainty

• Linear example: T =θk − θk

SE(θk)ψ∼ tn−2 (“Student’s tn−2 dist’n”)

• If t1−α/2 3 P (T ≥ t1−α/2) = α/2

Pθ2 − t1−α/2SE(θ2) ≤ θ2 ≤ θ2 + t1−α/2SE(θ2) = 1− α

• A probability pertaining to the sampling distribution of θ2: For “all

possible realizations of Y of size n,” probability is 1− α that

endpoints of [ θ2 − t1−α/2SE(θ2) , θ2 + t1−α/2SE(θ2) ] include

(the fixed value) θ2

• Provides more information than just SE: how “large” or “small” SE

must be relative to θ2 to feel “confident” that procedure of data

generation and estimation provides a reliable understanding

45

Page 47: An Introduction and Overview - 123.physics.ucdavis.edu

Confidence interval: A probability statement about the procedure by

which an estimator is constructed from a realization of Y

• Interpretation: “For all possible realizations of Y of size n, if we

were to calculate the interval according to the estimation procedure,

(1− α)% of such intervals would ‘cover’ θ2”

• α is chosen by the analyst; e.g. α = 0.05

Example:

• θ2 = 2.0, SE(θ2) = 3.0 gives [−3.88, 7.88] ⇒ no confidence

• θ2 = 2.0, SE(θ2) = 0.03 gives [1.94, 2.09] ⇒ feeling pretty

confident!

Warning: The numerical values themselves are meaningless except forthe impression they give about the quality of the procedure

• Wrong: The probability is 1− α that θ2 is between −3.88 and 7.88.

• What we CAN say: We are (1− α)% “confident” that intervals

constructed this way would “cover” θ2

46

Page 48: An Introduction and Overview - 123.physics.ucdavis.edu

Parametric statistical model: A statistical model in which the

probability distribution is completely specified, e.g.,

Y ∼ Nnf(θ), σ2In for subject 12

• ⇒ All possible probability distributions are multivariate normal with

this mean and covariance matrix (over all ψ)

• If we believe this, know everything

Semiparametric statistical model: A statistical model in which

the probability distribution of Y is only partially specified, e.g.,

E(Y ) = f(θ), var(Y ) = σ2In

• All we’re willing to say

• Larger class of possible probability distributions

Trade-off: Fewer assumptions ⇔ protection if wrong ⇔ could be “too

broad”

47

Page 49: An Introduction and Overview - 123.physics.ucdavis.edu

Maximum likelihood estimation: Most popular approach for

parametric models

The likelihood function: Suppose p(y |ψ) is the pmf/pdf for the

sample Y under the assumed parametric model

• The notation emphasizes that p(y) is indexed by ψ (and is a

function of y for fixed ψ)

• Given that Y=y is observed, the likelihood function of ψ defined by

L(ψ | y) = p(y |ψ)

• A function of ψ for fixed y

Discrete case: Suppose ψ1 and ψ2 are two possible values for ψ with

P(Y = y|ψ1) = L(ψ1 | y) > L(ψ2 | y) = P (Y = y|ψ2)

• The y we actually observed is more likely to have occurred if ψ = ψ1 than

if ψ = ψ2 ⇒ ψ1 is “more plausible”

48

Page 50: An Introduction and Overview - 123.physics.ucdavis.edu

Maximum likelihood estimator (MLE): For each y ∈ Y, let ψ(y)be a parameter value where L(ψ | y) attains its maximum as a function

of ψ, with y held fixed. A maximum likelihood estimator for ψ is ψ(Y )

• Intuitively reasonable – MLE is the parameter value for which the

observed y is “most likely”

• Has certain optimality properties under the assumption that the

specified probability model is correct

– Invariance

– “Most precise” (≈ smallest sampling variance for n “large”)

Example: Y ∼ Nnf(θ), σ2In• OLS estimator is the MLE

In general: For complex statistical models, likelihood function is

complex function of ψ ⇒ computational issues

49

Page 51: An Introduction and Overview - 123.physics.ucdavis.edu

Remarks:

• Identifiability of statistical models. A parameter ψ for a family of

probability distributions is identifiable if distinct values of ψ correspond to

distinct pmf/pdfs. I.e., if the pmf/pdf for Y is p(y |ψ) under the model,

then

ψ 6= ψ′ ⇔ p(y |ψ) is not the same function of y as p(y |ψ′)

• A feature of the model, NOT of an estimator or estimation procedure

• Difficulty with nonidentifiability in inference. Observations governed by

p(y |ψ) and p(y |ψ′) look the same ⇒ no way to know if ψ or ψ′ is the

true value (same likelihood function value)

• How to fix nonidentifiability. Revise model, impose constraints, make

assumptions

• Limitations of data. Even if a model is identifiable, without sufficient

data, it may be practically impossible to estimate some parameters,

because the information needed is not available in the data

50

Page 52: An Introduction and Overview - 123.physics.ucdavis.edu

Sampling distribution for estimators in complex models:

• Not possible to derive a closed form expression for ψ(Y ) ⇒derivation of exact sampling distribution intractable

• In semiparametric models – no distribution assumption

• Approximation – “large sample (asymptotic) theory”

(such as n →∞)

For most “regular” estimators: ψn (p× 1)

• ψn is consistent (“approaches” true ψ in probabilistic sense as

n →∞)

• ψ has “approximate” sampling distribution for n →∞n1/2(ψn − ψ) ·∼ Np(0, C) or ψn

·∼ Np(ψ, n−1Cn), C = limn→∞Cn

• Comparison of competing estimators – compare C, “asymptotic

relative efficiency”

51

Page 53: An Introduction and Overview - 123.physics.ucdavis.edu

5. Modeling/inference: Independent data

Recall subject 12: Assumed correlation over time negligible

Y ∼ Nnf(θ), σ2In ⇒ E(Yj) = f(tj , θ), var(Yj) = σ2 ‖• Assumed particular structure Yj = f(tj , θ) + εj with

εj = ε1j + ε2j

Overall deviation Measurement Error “Fluctuation”

and σ2 = σ21 + σ2

2

• OLS estimator is MLE estimator for θ if we believe all of this

• σ2 = (n− p)−1∑n

j=1Yj − f(tj , θ)2

• Do we believe? Can we check?

52

Page 54: An Introduction and Overview - 123.physics.ucdavis.edu

εj = Yj − f(tj , θ)

⇒ Plot “residuals” rj = Yj − f(tj , θ) vs. “predicted values” f(tj , θ) or

time tj

Informal impression:

• Magnitude of rj/σ increases with f(tj , θ)

• Roughly symmetric about 0

• No apparent pattern with time

Suggestion: Independence and normality may be okay, but σ2

increases with “inherent” concentration level

• “Fluctuations” and/or measurement deviations increase with

inherent concentration

53

Page 55: An Introduction and Overview - 123.physics.ucdavis.edu

rj/σ vs. f(tj , θ):

Predicted Value

Res

idua

l

0 2 4 6 8 10 12

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

Subject 12

54

Page 56: An Introduction and Overview - 123.physics.ucdavis.edu

Subject-matter knowledge – Assay error is dominant source of

variation ⇒ εj ≈ ε1j

• Error in determining concentrations well-known to be greater for

higher concentrations ⇒ suggests

var(εj) = V f(tj , θ), γ, V (u, γ) ↑ in u

• E.g., popular choice of V – V f(tj , θ), γ = σ2f2γ(tj , θ)

• γ = 1 ⇒ constant CV σ – “Multiplicative error”

Yj = f(tj , θ)(1 + δj), E(δj) = 0, var(δj) = σ2, εj = f(tj , θ)δj

⇒ Yj ∼ normal, lognormal, . . .

• Could include fluctuations var(εj) = V (tj , γ) + σ22

Result: ψ = (θT , σ2, γ)T

• Even though θ is of central interest, must get the entire probability

model correct

55

Page 57: An Introduction and Overview - 123.physics.ucdavis.edu

Inference: MLE or something else

• Yj ∼ N[f(tj , θ), V f(tj , θ), γ

]– MLE is NOT OLS – OLS arises

from constant variance (over j) assumption

• Using OLS anyway leads to inefficient inference on θ (and can be

biased for small n)

• Problem with MLE – if true distribution is NOT normal, inference

compromised; e.g., sensitive to outliers

• Better approach – generalized least squares (GLS) estimator (sort of

like weighted least squares); solve

n∑j=1

V −1f(tj , θ), γYj − f(tj , θ) ∂

∂θf(tj , θ) = 0

NOT the same as minimizingn∑

j=1

Yj − f(tj , θ)2V f(tj , θ), γ

(Also need estimator for γ. . . )

56

Page 58: An Introduction and Overview - 123.physics.ucdavis.edu

Remarks:

• In large samples, GLS estimator is the “best” estimator if only

willing to assume mean and variance (semiparametric model)

• If, in addition, normality is correct, MLE is “better,” but can be led

astray if assumption of normality is wrong, GLS is “safer”

• Given a means to calculate f(t, θ) at each tj , straightforward to

implement (including estimating γ)

• If assumption of negligible correlation is incorrect, all of this is

suspect – need fancier techniques

• Note that must estimate all components of ψ, including those not of

central interest – efficiency of estimation of θ will depend on that of

estimators for “nuisance parameters”

57

Page 59: An Introduction and Overview - 123.physics.ucdavis.edu

Take-away messages:

• Must consider a statistical model that embodies realistic

assumptions

• The estimation procedure is dictated by the statistical model

• Must estimate all parameters, not only those of interest

• OLS is OFTEN NOT the appropriate thing to do!

58

Page 60: An Introduction and Overview - 123.physics.ucdavis.edu

6. Hierarchical statistical models and inference

Recall objective 2: 12 subjects drawn from population of interest,

θ = (ka, ke, V )T

• Learn about θ values in the population (can we be more specific?)

• Must acknowledge measurement error, etc, in data that would come

from each subject

• Moreover, want to learn about whole population from a sample of

only 12 subjects

Required: A statistical model that reflects the data-generating

mechanism

• Sample m subjects from the population

• For the ith subject, ascertain concentration at each of ni time

points (could be different for each subject i)

59

Page 61: An Introduction and Overview - 123.physics.ucdavis.edu

All 12 subjects:

Time (hr)

The

ophy

lline

Con

c. (

mg/

L)

0 5 10 15 20 25

02

46

810

12

All subjects

60

Page 62: An Introduction and Overview - 123.physics.ucdavis.edu

Formalization: “Learn about θ values in the population”

• Conceptualize the (infinitely-large) population as all possible θ (one

for each subject) – how θs would “turn out” if we sampled subjects

• ⇒ Represent by a (joint) probability distribution for θ (with mean,

covariance matrix, etc)

• ⇒ “Average value of θ” (mean), “variability of ka, ke, V ” (diagonal

elements of covariance matrix), “associations between PK

processes” (off-diagonal elements)

Thus: Think of potential θ values if we draw m subjects independently

as independent random vectors θi, each with this probability distribution,

e.g.,

θi ∼ Np(θ∗, D), i = 1, . . . , m

• When we actually do this, we obtain realizations of θi, i = 1, . . . , m

• Objective, formalized – Estimate θ∗ and D

61

Page 63: An Introduction and Overview - 123.physics.ucdavis.edu

Data for subject i: Given θi

• Concentrations arise from an assumed mechanism for an individual

subject as before, e.g.

Yij = f(tij , θi) + εij , j = 1, . . . , ni

ti = (ti1, . . . , tini)T are the time points at which i would be

observed, f(ti, θi) = f(ti1, θi), . . . , f(tini, θi)T

• Yi = (Yi1, . . . , Yini)T random vector of observations from i

• Same considerations for εij as before, e.g., εij = ε1ij + ε2ij

Result: Specify a family of conditional probability distributions for

Yi|θi, e.g.,

Yi|θi ∼ Nnif(ti, θi), σ2Ini

• E(Yi|θi) = f(ti, θi), var(Yi|θi) = σ2Ini

• Taking σ2 the same for all i reflects belief that assay error is similar

for blood samples from any subject

62

Page 64: An Introduction and Overview - 123.physics.ucdavis.edu

Or a fancier model:

Yi|θi ∼ Nnif(ti, θi), σ21Ri(γ) + σ2

2Γi(φ)

Ri(γ) = diagf2γ(ti1, θi), . . . , f2γ(tini , θi), (ni × ni)

Γi(φ)jj = 1, Γi(φ)jj′ = exp−φ(tij − tij′)2, (ni × ni)

• Common σ21 , σ2

2 , γ, φ across subjects reflects assumption of similar

pattern of variation due to each source, could be modified

• So, e.g., E(Yij |θi) = f(tij , θi), var(Yij |θi) = σ21f2γ(tij , θi) + σ2

2

63

Page 65: An Introduction and Overview - 123.physics.ucdavis.edu

Assumptions: Y = (Y Ti1 , . . . , Y T

ini)T and θ = (θT

1 , . . . , θTm)T

• Assume θi independent ⇒ θi are exchangeable (all permutations of

θ1, . . . , θm have the same joint probability distribution)

• Also assume Yi are ‖ of each other, and Yi ‖ θi′ , i′ 6= i

• Thus, joint pmf/pdf of (Y T , θT ) is

p(y, θ) =m∏

i=1

p(yi, θi)

Hierarchical statistical model: Combining leads to a two-stage

hierarchy

1. Assumption for family of conditional probability distributions for

Yi|θi ⇒ pmf/pdf p(yi|θi) (ni-dimensional, depending on ti)

2. Assumption for probability distribution for θi ⇒ pmf/pdf p(θi)[same for all i; e.g., Np(θ∗, D)]

64

Page 66: An Introduction and Overview - 123.physics.ucdavis.edu

p(y, θ) =m∏

i=1

p(yi, θi)

Remarks:

• Model in terms of parameters ψ = θT∗ , vech(D)T , σ2

1 , σ22 , γ, φT ,

vech(D) = vector of distinct elements of D

• Note that p(yi|θi)p(θi) = p(yi, θi) for each i by definition of

conditional pmf/pdf ⇒m∏

i=1

p(yi, θi) =m∏

i=1

p(yi|θi)p(θi)

• The model contains observable and unobservable random

components – Yi, i = 1, . . . , m, are observed, θi are not – but both

are required in the formulation to reflect all sources of variation

• Because θi are not observed, would like the probability distribution

of Y alone. . .

65

Page 67: An Introduction and Overview - 123.physics.ucdavis.edu

Result: (Marginal) probability distribution for observable random

vector Y has pmf/pdf

p(y) =∫

p(y, θ) dθ

=∫ m∏

i=1

p(yi, θi) dθi

=m∏

i=1

∫p(yi|θi)p(θi) dθi

=m∏

i=1

∫p(yi|θi; σ2

1 , σ22 , γ, φ)p(θi; θ∗, D) dθi (1)

where (1) highlights dependence on components of ψ

Objective: Find estimator for ψ

66

Page 68: An Introduction and Overview - 123.physics.ucdavis.edu

Natural approach: Maximum likelihood

L(ψ|y) =m∏

i=1

∫p(yi|θi; σ2

1 , σ22 , γ, φ)p(θi; θ∗, D) dθi

• MLE maximizes L(ψ|y) in ψ

• Complication – intractable p-dimensional integration

– Quadrature or variant – bad in high dimensions

– Approximate integral for large ni

– Other approaches. . .

Model refinement: θi aren’t exchangeable

• For example, ke is associated with weight, e.g.,

E(kei|Wi = wi) = β1 + β2wi for weight Wi

• In general θi|Wi ∼ Nph(β, Wi), D

67

Page 69: An Introduction and Overview - 123.physics.ucdavis.edu

Sampling distribution for MLE ψ: Usual approach

• Approximate for m →∞, ni fixed

• or m →∞, minni →∞Software:

1. SAS proc nlmixed, macro nlinmix

(http://www.sas.com/rnd/app/da/new/danlmm.html,

http://ftp.sas.com/techsup/download/stat/nlmm800.html)

2. Splus nlme() function (http://nlme.stat.wisc.edu/)

3. NONMEM, SAS macro nlmem

(http://c255.ucsf.edu/nonmem0.html,

http://www-personal.umich.edu/∼agalecki/)Drawbacks – 1, 2 require user to provide forward solution; 3 have certain

(but not arbitrary) systems hardwired, all can be flaky for

high-dimensional θ

68

Page 70: An Introduction and Overview - 123.physics.ucdavis.edu

Other examples:

• (From Bedaux and Kooijman, 1994) – Sample of m insects treated

to Cadmimum-contaminated food until equilibrium, then stopped –

for Cd concentrations in post-food period (starting at t = 0)

C(t) = −kC(t), Q(0) = Q0, θ = (k, Q0)T

for each insect

– Data collection – at each of 0 ≤ t1 < · · · < tm, a single insect is

selected, Cd concentration measured (sacrifice the insect)

– Each insect gives rise to a single observation Yi (ni = 1)

– Each insect has its own θ = (k, Q0)T governing Cd kinetics ⇒insect i has θi = (ki, Q0i)T

– Hierarchical model

Yi = f(ti, θi) + εi ⇒ p(yi|θi), θi ∼ p(θi)

– Observable quantities – Y = (Y1, . . . , Ym)T ⇒ p(yi)

69

Page 71: An Introduction and Overview - 123.physics.ucdavis.edu

Other examples:

• (From Bortz, 2002) – In vitro experiment involving m “identical”

cultures of HIV retrovirus strain – interest in process of culture

growth over time, complex 4-dim system

X(t) = gt, X(t), θ

– Data collection – at each of 0 ≤ t1 < · · · < tm, a dish is chosen

(at random) and total number of cells measured (destroy the

dish)

– Although ideally, all dishes should give rise to identical growth

process, subtle variation in conditions for each dish ⇒ different

processes across dishes

– ⇒ Each dish has its own θi

– Same hierarchical model as for insects

70

Page 72: An Introduction and Overview - 123.physics.ucdavis.edu

7. Statistical inference II

So far: We have considered the classical, frequentist approach to

statistical inference

• Objective – Estimate parameters in a suitable statistical model,

where the parameters are regarded as fixed quantities

• Theme – think of what would happen across repeated samples from

the probability distribution dictated by the model (“all possible

samples we could have ended up with”)

• ⇒ The likelihood function and sampling distribution are the

cornerstones

Not surprisingly: As in many disciplines, there is a competing point

of view and philosophical debate. . .

71

Page 73: An Introduction and Overview - 123.physics.ucdavis.edu

Bayesian inference: Basic idea

• Treat parameters like ψ as random vectors

• Make inference on ψ in terms of probability statements about ψ

Fundamental elements: In the Bayesian approach

• We still specify a probability distribution describing the “data

generating mechanism” ⇒ p(y|ψ), now viewed as a conditional

pmf/pdf

• Also specify a pmf/pdf for the prior distribution of ψ ⇒ p(ψ)

• Statistical inference is based on implied posterior distribution with

pmf/pdf p(ψ|y) ⇒ the conditional pmf/pdf for ψ given the data

• The mode of this distribution is used as the “estimate”

• “Uncertainty” about this estimate is characterized by the entire

posterior distribution (e.g., its variance)

72

Page 74: An Introduction and Overview - 123.physics.ucdavis.edu

Rationale and advantages:

• Can formally incorporate prior belief/opinion, knowledge, or info on

ψ through specification of the prior distribution

• Can use the prior distribution to impose constraints on plausible

values for ψ

• Interpretation can be easier than for some frequentist constructs

(e.g., confidence intervals)

Possible disadvantages:

• Sensitivity of inferences to choice of prior?

• “Who cares what you believe/think?” (frequentist criticism)

73

Page 75: An Introduction and Overview - 123.physics.ucdavis.edu

Bayes’ theorem: Cornerstone of Bayesian inference

p(ψ | y) =p(y |ψ)p(ψ)

p(y)=

p(y |ψ)p(ψ)∫p(y |ψ)p(ψ) dψ

so that the posterior pmf/pdf satisfies p(ψ | y) ∝ p(y |ψ)p(ψ)

• If ψ is a scalar, use Bayes theorem directly

• If ψ is multidimensional, use the marginal posterior pmf/pdf

• For example, ψ = (ψT1 , ψT

2 ), ψ1 = parameter of interest (e.g., θ),

ψ2 = “nuisance parameters”

p(ψ1, ψ2 | y) ∝ p(y |ψ1, ψ2)p(ψ1, ψ2)

and the marginal posterior for ψ1 is found by averaging over ψ1

p(ψ1 | y) =∫

p(ψ1, ψ2 | y) dψ2

=∫

p(ψ1 |ψ2, y)p(ψ2 | y) dψ2 ⇒ p(ψ1 | y) is a mixture

74

Page 76: An Introduction and Overview - 123.physics.ucdavis.edu

p(ψ1 | y) =∫

p(ψ1, ψ2 | y) dψ2

Issue: Finding the marginal posterior involves potentially

high-dimensional integration (more in a moment. . . )

Data generating model/likelihood specification: Same as before

Prior specification: Must choose a probability distribution and values

of its parameters that reflect prior belief/knowledge/information about

components of ψ

• Prior elicitation from “experts”

• Description of historical data, results from literature

• Sometimes criticized for focus on (analytical or computational)

convenience

75

Page 77: An Introduction and Overview - 123.physics.ucdavis.edu

Simple example: Yj

‖∼ N (θ1 + θ2tj , σ2), Y = (Y1, . . . , Yn)T ⇒

Y ∼ Nn(Xθ, σ2In) so that p(y |ψ) is a normal pdf, ψ = θ

(σ2 known)

• Choose prior ψ ∼ N3(Tψ∗, G) ⇒ p(ψ) is a normal pdf

• Posterior pdf p(ψ | y) may be shown analytically to be that of

NA(σ−2XT y + G−1Tψ∗), A, A = ( σ−2XT X + G−1 )−1

• Posterior mode (= posterior mean due to symmetry) is

A(σ−2XT y + G−1Tψ∗)

depends on actual data observed, y, and ψ∗, G characterizing prior

• Would need to specify ψ∗ and G

76

Page 78: An Introduction and Overview - 123.physics.ucdavis.edu

Simple example, continued: In fact, if we take G−1 = 0 ⇒“noninformative prior,” posterior mode becomes

(XT X)−1Xy = θOLS

and posterior distribution is

NθOLS , σ2(XT X)−1• Same estimate as using frequentist inference

• Compare to frequentist sampling distribution

θOLS |θ ∼ Nθ, σ2(XT X)−1

Of course: Most problems are not this nice and yield different

approach from frequentist

• Nuisance parameters, intractable integrals, etc

• Luckily, there is a natural computational strategy. . .

77

Page 79: An Introduction and Overview - 123.physics.ucdavis.edu

“Bayesian confidence intervals:” Credible interval or set C

• For chosen α, C satisfies

1− α = p(C|y) =∫

C

p(ψ | y) dψ

• “The probability that ψ is in C, given the observed data y, is 1−α”

78

Page 80: An Introduction and Overview - 123.physics.ucdavis.edu

Reconciling frequentist and Bayesian approaches:

• Bayesian approach often leads to procedures with “good” properties

in the frequentist sense (provided that prior distributions introduce

only weak information)

• In fact, lead to same procedure for some statistical models!

• Under these conditions, qualitatively similar inferences

• Adopting a Bayesian perspective allows use of computational

strategies in problems that would be intractable using frequentist

tools (e.g., hierarchical models with high-dimensional integration)

• Bayesian approach provides natural way to impose constraints,

exploit previous knowledge

79

Page 81: An Introduction and Overview - 123.physics.ucdavis.edu

8. Hierarchical models and Bayesian inference

Recall: Theophylline example Yi = (Yi1, . . . , Yini)T , i = 1, . . . , m

p(yi | θi, σ2) : Yij = f(tij , θi) + εij ⇒ Yi|θi ∼ Nnf(ti, θi), σ2Ini

p(θi | θ∗, D) : θi ∼ Np(θ∗, D)

• σ2, θ∗, vech(D) are now random variables/vectors

• p(θi | θ∗, D) is sometimes called the prior

• ψ = θT∗ , vech(D)T , σ2T

• Interested in θ∗, D

To be Bayesian: Need to specify a joint prior distribution for ψ

• Often called the hyperprior in this context, with hyperparameters

given by the analyst

80

Page 82: An Introduction and Overview - 123.physics.ucdavis.edu

Popular specification: Take θ∗, D, σ2 independent with

θ∗ ∼ Np(δ, G), D−1 ∼ Wishart(ρD∗)−1, ρ, σ−2 ∼ Gamma(ν/2, ντ/2)

• Hyperparameters δ, G, ρ, D∗, ν, τ

Posterior distribution for θ∗: High-dimensional integrationp(θ∗|y) =∫ ∫

m∏i=1

∫p(yi|θi, σ

2)p(θi|θ∗, D) dθi

p(θ∗|δ, G)p(D|ρ, D∗)p(σ2|ν, τ) dDdσ2

∫ ∫ ∫ m∏

i=1

∫p(yi|θi, σ

2)p(θi|θ∗, D) dθi

p(θ∗|δ, G)p(D|ρ, D∗)p(σ2|ν, τ) dDdσ2dθ∗

• Yuck

Fortunately: It is possible to “do” these integrals by simulation, and

end up with a sample from the desired marginal posterior, from which it

may be approximated

• Markov chain Monte Carlo (MCMC) techniques

81

Page 83: An Introduction and Overview - 123.physics.ucdavis.edu

Basic idea: Gibbs sampling

• For random variables U = (U1, . . . , UK), given Y , the full

conditional distributions pk(uk|u` 6= uk, y) completely determine the

joint dist’n p(u1, . . . , uK |y) and hence the marginals pk(uk|y)

• Algorithm: Start with u(0)1 , . . . , u

(0)K

Draw U(1)1 ∼ p1(u1|u(0)

2 , . . . , u(0)K , y)

Draw U(1)2 ∼ p2(u2|u(1)

1 , u(0)3 , . . . , u

(0)K , y)

...

Draw U(1)K ∼ pK(uK |u(1)

1 , . . . , u(1)K−1, y)

• Usefulness – Can show (U (t)1 , . . . , U

(t)K ) ·∼ p(u1, . . . , uK |y) as t →∞

⇒ can generate samples from the joint posterior (and hence get

samples from marginal posteriors) by sampling from full conditionals!

• For complex models, actually need fancier algorithm – FULL

DETAILS ON MONDAY AFTERNOON!

82

Page 84: An Introduction and Overview - 123.physics.ucdavis.edu

Full conditional distributions for the theophylline model:

(θ∗|σ2, D, θi, i = 1, . . . , m, y) ∼ NpU(mD−1θ + G−1δ), U

(D−1|σ2, θ∗, θi, i = 1, . . . , m, y) ∼ Wishart(S + ρD∗)−1, m + ρ

(σ−2|σ2, D, θ∗, θi, i = 1, . . . , m, y) ∼ Gamma(ν + N)/2, (J + ντ/2)p(θi|σ2, D, θ∗, θ`, ` 6= i, y) ∝ exp(σ−2J/2−Q/2) (1)

• Except for (1), is straightforward to generate samples from these

distributions

• To deal with (1), require fancier approach – MONDAY

AFTERNOON

U = (mD−1 + G−1)−1, θ = m−1∑m

i=1 θi, S =∑m

i=1(θi − θ∗)(θi − θ∗)T ,

J =∑m

i=1yi − f(ti, θi)T yi − f(ti, θi), Q =∑m

i=1(θi − θ∗)T D−1(θi − θ∗)

83

Page 85: An Introduction and Overview - 123.physics.ucdavis.edu

Software:

1. Bayesian inference Using Gibbs Sampling, BUGS

(http://www.mrc-bsu.cam.ac.uk/bugs/)

2. For common pharmacokinetic models PKBUGS

(http://www.med.ic.ac.uk/divisions/60/pkbugs web/home.html)

3. More general (e.g., physiologically-based PK models) MCSim

(http://fredomatic.free.fr/)

4. These sites will lead you to more. . .

Drawbacks: 1 cannot do really complex models, 2 has only certain PK

models hard-wired, 3 has steep learning curve

84

Page 86: An Introduction and Overview - 123.physics.ucdavis.edu

9. Closing remarks

• To take appropriate account of sources of variation in data, a

deterministic model should be embedded in a statistical model that

incorporates realistic assumptions

• Inverse problem should be regarded as parameter

estimation/inference in the statistical model framework

• Parameter estimation should be accompanied by assessment of

uncertainty

• Personal belief – Bayesian formulation implemented via Markov

chain Monte Carlo methods are the way to go for complex

hierarchical models

• Important issue for future work – Design

85

Page 87: An Introduction and Overview - 123.physics.ucdavis.edu

10. References and where to read more

Beduax, J.J.M. and Kooijman, S.A.L.M. (1994) Stochasticity in deterministic

models. In Handbook of Statistics, Vol. 12, G.P. Patil and C.R. Rao, eds.

Elsevier Science, pp. 561–581.

Bortz, D.M. (2002) Modeling, analysis, and estimation of an in vitro HIV

infection using functional differential equations. Ph.D. dissertation,

Department of Mathematics, North Carolina State University.

Carlin, B.P. and Louis, T.A. (2000) Bayes and Empirical Bayes Methods for

Data Analysis, Second Edition. Chapman and Hall/CRC.

Carroll, R.J. and Ruppert, D. (1988) Transformation and Weighting in

Regression. Chapman and Hall.

Casella, G. and Berger, R.L. (2002) Statistical Inference, Second Edition.

Duxbury Press.

Chen, M.H., Shao, Q.M., and Ibrahim, J.G. (2000) Monte Carlo Methods in

Bayesian Computation. Springer.

86

Page 88: An Introduction and Overview - 123.physics.ucdavis.edu

Davidian, M. and Giltinan, D.M. (1995) Nonlinear Models for Repeated

Measurement Data. Chapman and Hall/CRC.

Gelman, A., Bois, F., and Jiang, J. (1996) Physiological pharmacokinetic

analysis using population modeling and informative prior distributions.

Journal of the American Statistical Association 91, 1400–1412.

Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995) Bayesian Data

Analysis. Chapman and Hall/CRC.

Kaipio, J.P., Kolehmainen, V., Somersalo, E., and Vauhkonen, M. (2000)

Statistical inversion and Monte Carlo sampling methods in electrical

impedance tomography. Inverse Problems 16, 1487–1522.

Mosegaard, K. and Sambridge, M. (2002) Monte Carlo analysis of inverse

problems. Inverse Problems 18, R29–R54.

Robert, C.P. and Casella, G. (1999) Monte Carlo Statistical Methods.

Springer.

87


Recommended