Download - Statistical Inverse Problems, Model Reduction and Inverse ...sgallari/Meetings/GNCS-GNAMPA/firenze2004.pdf · 1. Statistical inverse problems: A brief review 2. Model reduction, discretization

Statistical Inverse Problems,

Model Reduction and

Inverse Crimes

Erkki Somersalo, Helsinki University of Technology, Finland

Firenze, March 22–26, 2004

CONTENTS OF THE LECTURES

1. Statistical inverse problems: A brief review

2. Model reduction, discretization invariance

3. Inverse crimes

Material based on the forthcoming book

Jari Kaipio and Erkki Somersalo: Computational and Statistical Inverse Prob-lems. Springer-Verlag (2004)

STATISTICAL INVERSE PROBLEMS

Bayesian paradigm, or “subjective probability”:

1. All variables are random variables

2. The randomness reflects the subject’s uncertainty of the actual values

3. The uncertainty is encoded into probability distributions of the variables

Notation: Random variables X, Y , E etc.

Realizations: If X : Ω → Rn, we denote

X(ω) = x ∈ Rn.

Probability densities:

PX ∈ B

=

∫B

πX(x)dx =∫

B

π(x)dx.

Hierarchy of the variables:

1. Unobservable variables of primary interest, X

2. Unobservable variables of secondary interest, E

3. Observable variables, Y

Example: Linear inverse problem with additive noise,

y = Ax + e, A ∈ Rm×n.

Stochastic extension:Y = AX + E.

Conditioning: Joint probability density of X and Y :

PX ∈ A, Y ∈ B

=

∫A×B

π(x, y)dx dy.

Marginal densities:

PX ∈ A

= P

X ∈ A, Y ∈ Rm

=

∫A×Rm

π(x, y)dx dy,

in other words,

π(x) =∫

Rm

π(x, y)dy.

Conditional probability:

PX ∈ A | Y ∈ B

=

∫A×B

π(x, y)dx dy∫B

π(y)dy.

Shrink B into a single point y:

PX ∈ A | Y = y

=

∫A

π(x, y)π(y)

dx =∫

A

π(x | y)dx,

where

π(x | y) =π(x, y)π(y)

or π(x, y) = π(x | y)π(y).

Bayesian solution of an inverse problem: Given a measurement y = yobserved

of the observable variable Y , find the posterior density of X,

πpost(x) = π(x | yobserved).

Prior density, πpr(x) expresses all prior information independent of the mea-surement.

Likelihood density π(y | x) is the likelihood of a measurement outcome y givenx.

Bayes formula:

π(x | y) =πpr(x)π(y | x)

π(y).

Three steps of Bayesian inversion:

1. Construct the prior density

2. Construct the likelihood density

3. Extract useful information from the posterior density

Example: Linear model with additive noise,

Y = AX + E,

where the density πnoise is known. Fixing X = x yields

π(y | x) = πnoise(y −Ax),

and soπ(x | y) = πpr(x)πnoise(y −Ax).

Assume that X and E are mutually independent and Gaussian,

X ∼ N (x0,Γpr), E ∼ N (0,Γe),

where Γpr ∈ Rn×n and Γe ∈ Rm×m are symmetric positive (semi)definite.

πpr(x) ∝ exp(−1

2(x− x0)TΓ−1

pr (x− x0))

,

π(y | x) ∝ exp(−1

2(y −Ax)TΓ−1

e (y −Ax))

.

From Bayes formula, the posterior covariance is Gaussian,

π(x | y) ∼ N (x∗,Γpost),

where

x∗ = x0 + ΓprAT(AΓprA

T + Γe)−1(y −Ax0),

Γpost = Γpr − ΓprAT(AΓprA

T + Γe)−1AΓpr.

Special case: Assume that

x0 = 0, Γpr = γ2I, Γe = σ2I.

In this case,x∗ = AT(AAT + α2I)−1y, α =

σ

γ,

known as Wiener filtered solution (m×m problem), or, equivalently,

x∗ = (ATA + α2I)−1ATy,

which is the Tikhonov regularized solution (n× n problem).

Engineering rule of thumb: If n < m, use Tikhonov, if m < n use Wiener.

(In practice, ATA or AAT should often not be calculated.)

Frequently asked question: How do you determine α?

Bayesian paradigm: Either

1. You know γ and σ; then α = σ/γ,

or

2. You don’t know them; make them part of the estimation problem.

This is the empirical Bayes approach.

Example: If γ in the previous example in unknown, write

πpr(x | γ) ∝ 1γn

exp(− 1

2γ2‖x‖2

),

and writeπpr(x, γ) = πpr(x | γ)πh(γ),

where πh is a hyperprior or hierarchical prior.

Determine π(x, γ | y).

BAYESIAN ESTIMATION

Classical inversion methods produce estimates of the unknown.

In contrast, Bayesian approach produces a probability density that can beused

• to produce estimates,

• to assess the quality of estimates (statistical and classical).

Example: Conditional mean (CM) and maximum a posteriori (MAP) esti-mates:

xCM =∫

Rn

xπ(x | y)dx,

xMAP = arg maxπ(x | y).

Calculating MAP esitmate is an optimization problem, CM estimate and in-tegration problem.

Monte Carlo integration: If n is large, quadrature methods not feasible.

MC methods: Assume that we have a sample,

S =x1, x2, . . . , xN

, xj ∈ Rn.

Write

xCM =∫

Rn

xπ(x | y)dx ≈N∑

j=1

wjxj ,

wherewj = π(xj | y).

Importance sampling: Generate the sample S randomly.

Simple but inefficient (in particular when n is large).

A better idea: Generate the sample using the density π(x | y).

Ideal case: The points xj are distributed according to the density π(x | y),and

xCM =∫

Rn

xπ(x | y)dx ≈ 1N

N∑j=1

xj .

Markov chain Monte Carlo methods (MCMC): Generate the sample sequen-tially,

x0 → x1 → . . . xj → x+1 → . . . → xN .

Idea: Define a transition probability P (xj , Bj+1),

P (xj , Bj+1) = PXj+1 ∈ Bj+1, provided that Xj = xj

.

Assuming that Xj has probability density πj(xj),

Pxj+1 ∈ Bj+1

=

∫Rn

P (xj , Bj+1)πj(xj)dxj = πj+1(Bj+1).

Choose the transition kernel so that π(x | y) is invariant measure:

∫B

π(x | y)dx =∫

Rn

P (x′, B)π(x′ | y)dx′.

Then all the variables Xj are distributed according to π(x | y).

Best known algorithms:

Metropolis-Hastings, Gibbs sampler.

−2 −1 0 1 2−1

−0.5

0

0.5

1

1.5

2

−2 −1 0 1 2−1

−0.5

0

0.5

1

1.5

2

(d)

Gibbs sampler: Update one component at the time as follows:

Given xj = [xj1, x

j2, . . . , x

jn].

Draw xj+11 from t 7→ π(t, xj

2, . . . , xjn | y),

draw xj+12 from t 7→ π(xj+1

1 , t, xj3, . . . , x

jn | y),

...

draw xj+1n from t 7→ π(xj+1

1 , xj+12 , . . . , xj+1

n−1, t | y).

−2 −1 0 1 2−1

−0.5

0

0.5

1

1.5

2

Define a cost function Ψ : Rn × Rn → R.

The Bayes cost of an estimator x = x(y) is defined as

B(x) = EΨ(X, x(Y ))

=

∫ ∫Ψ(x, x(y))π(x, y)dx dy.

Further, we can write

B(x) =∫ ∫

Ψ(x, x)π(y | x)dy πpr(x)dx

=∫

B(x | x)πpr(x)dx = EB(x | x)

,

whereB(x | x) =

∫Ψ(x, x)π(y | x)dy

is the conditional Bayes cost.

The Bayes cost method: Fix Ψ and define the estimator xB so that

B(xB) ≤ B(x)

for all estimators x of x.

By Bayes formula,

B(x) =∫ ∫

Ψ(x, x)π(x | y)dx π(y)dy.

Since π(y) ≥ 0 and x(y) depends only on y,

xB(y) = arg min ∫

Ψ(x, x)π(x | y)dx

= arg min

E

Ψ(x, x)

∣∣ y

.

Mean square error criterion: Choose Ψ(x, x) = ‖x− x‖2, giving

B(x) = E‖X − X‖2

= trace

(corr(X − X)

),

where X = x(Y ), and

corr(X − X

)= E

(X − X)(X − X)T

∈ Rn×n.

This Bayes estimator is called the mean square estimator xMS. We have

xMS =∫

xπ(x | y) dx = xCM.

We have

E‖X − x‖2 | y

= E

‖X‖2 | y

− 2E

X | y

Tx + ‖x‖2

= E‖X‖2 | y

−

∥∥EX | y

2∥∥ +∥∥E

X | y

− x

∥∥2

≥ E‖X‖2 | y

−

∥∥EX | y

2∥∥,

and the equality holds only if

x(y) = EX | y = xCM.

Furthermore,E

X − xCM

= E

X − E

X | y

= 0.

Question: xCM is optimal, but is it informative?

0 0.5 1 1.50

1

2

3

4

5

6

CM MAP

(a)0 0.5 1 1.5

0

1

2

3

4

5

6

MAP CM

(b)

No estimate is foolproof. Optimality is subjective.

DISCRETIZED MODELS

Consider a linear model with additive noise,

y = Af + e, f ∈ H, y, e ∈ Rm.

Discretization, e.g. by collocation,

xn = [f(p1); f(p2); . . . ; f(pn)] ∈ Rn,

Af ≈ Anxn, An ∈ Rm×n.

Assume that the discretization scheme is convergent,

limn→∞

‖Af −Anxn‖ = 0.

Accurate discrete model:

y = ANxN + e, ‖ANxN −Af‖ < tol .

Stochastic extension:Y = ANXN + E,

where Y , XN and E are random variables.

Passing into a coarse mesh. Possible reasons:

1. 2D and 3D applications, problems too large

2. Real time applications

3. Inverse modelling based on prescribed meshing

Coarse mesh model with n < N ,

Af ≈ Anxn, ‖Anxn −Af‖ > tol .

Stochastic extension of the simple reduced model is

Y = AnXn + E.

Inverse crime:

• WriteY = Y = AnXn + E, (1)

and develop the inversion scheme based on this model,

• generate data with the simple reduced model and test the inversionmethod with this data.

Usually, inverse crime results are overly optimistic.

Questions:

1. How to model the discretization error?

2. How to model the prior information?

3. Is the inverse crime always significant?

PRIOR MODELLING

Assume a Gaussian model,

XN ∼ N (xN0 ,ΓN ),

i.e., the prior density is

πpr(xN ) ∝ exp(−1

2(xN − xN

0

)T(ΓN

)−1(xN − xN

0

)).

Projection (e.g. interpolation, averaging or downsampling),

P : RN → Rn, XN 7→ Xn.

Then,

EXn

= E

PXN

= PE

XN

= PxN

0 ,

EXn

(Xn

)T= E

PXN

(XN

)TPT

= PE

XN

(XN

)TPT,

and therefore,Xn ∼ N (xn

0 ,Γn) = N (PxN0 , P ΓN PT).

However, this is not what we normally do!

Example: H = continuous functions on [0, 1].

Discretization by multiresolution bases. Let

ϕ(t) =

1, if 0 ≤ t < 1,0, if t < 0 or t ≥ 1.

Define V j , 0 ≤ j < ∞, V j ⊂ V j+1,

V j = spanϕj

k|1 ≤ k ≤ 2j,

whereϕj

k(t) = 2j/2ϕ(2jt− k − 1).

Discrete representation,

f j(t) =2j∑

k=1

xjkϕj

k(t) ∈ V j .

Projector P : xj 7→ xj−1

P = Ij−1 ⊗ e1 =1√2

1 1 0 0 . . . 0 00 0 1 1 . . . 0 0...

...0 0 0 0 . . . 1 1

∈ R2j−1×2j

.

Assume the prior information f ∈ C20 ([0, 1]).

Second order smoothness prior of XN , N = 2j :


2α‖LNxN‖2

)= exp

(−1

2(xN )T

[α(LN

)TLN

]xN

),

where

LN = 22j

−2 1 0 . . . 01 −2 1

0 1 −2...

.... . . 1

0 . . . 1 −2

∈ RN×N .

The prior covariance is

ΓN =[α(LN

)TLN

]−1

.

Passing to level n = 2j−1 = N/2:

Ln = 22(j−1)

−2 1 0 . . . 01 −2 1

0 1 −2...

.... . . 1

0 . . . 1 −2

= PLNPT ∈ Rn×n.

Natural candidate for the smoothness prior for Xn is

πpr(xn) ∝ exp(−1

2α‖Lnxn‖2

)= exp

(−1

2(xn)T

[α(Ln

)TLn

]xn

),

But this is inconsistent, since

Γn =[α(Ln

)TLn

]−1

6= P[α(LN

)TLN

]−1

PT = Γn.

Numerical example:

Af(t) =∫ 1

0

K(t− s)f(s)ds, K(s) = e−κs2,

where κ = 15. Sampling:

yj = Af(tj) + ej , tj = (j − 1/2)/50, 1 ≤ j ≤ 50,

andE ∼ N (0, σ2I), σ = 2% of max

(Af(tj)

).

Smoothness prior


2α‖LNxN‖2

), N = 512.

Reduced model with n = 8.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−4

−2

0

2

4

6

8x 10

−3

Figure 1: MAP estimate with N = 512, n = 8. Black dots correspond to Γn,red dots to Γn.

DISCRETIZATION ERROR

From fine mesh to coarse mesh: Complete error model

Y = ANXN + E (2)

= AnXn + (AN −AnP )XN + E

= AnXn + Ediscr + E.

Error covariance: Assume that E, XN are mutually independent,

E ∼ N (0,Γe), XN ∼ N (xN0 ,ΓN ).

The complete error E = Ediscr + E is Gaussian,

E ∼ N (e0, Γe),

where

e0 = (AN −AnP )xN0 ,

Γe = (AN −AnP )ΓN (AN −AnP )T + Γe.

Error variance:

var(E

)= E

‖E− e0‖2

= E

‖Ediscr − e0‖2

+ E

‖E‖2

= = trace

((AN −AnP )ΓN (AN −AnP )T

)+ trace

(Γe

)= var

(Ediscr

)+ var

(E

).

The complete error model is noise dominated, if

var(Ediscr

)< var

(E

),

and modelling error dominated if

var(Ediscr

)> var

(E

).

Enhanced error model: Use the likelihood and prior

π(y | xn) ∝ exp(−1

2(y −Anxn − y0)TΓ−1

e (y −Anxn − y0))

,

πpr(xn) ∝ exp(−1

2(xn − xn

0 )T(Γn

pr

)−1(xn − xn0 )

),

where

y0 = EY

= AnEXn+ e0

= AnPxN0 +

(AN −AnP

)xN

0

= ANxN0 .

MAP estimate, denoted by xneem is

xneem = argmin‖Ln

pr

(xn − xn

0

)‖2 + ‖Le

(Anxn − y − y0

)‖2

= argmin∥∥∥∥[

Lnpr

LeAn

]xn −

[Ln

prxn0

Le(y − y0)

]∥∥∥∥2

,

whereLpr = chol

(Γn

pr

)−1, Le = chol

(Γn

e

)−1.

This leads to a normal equation of size n× n.

Note: Enhanced error model is not the complete error model, because Xn iscorrelated with the complete error E through XN .

Complete error model: Assume, for a while that Xn and Y are zero mean. Wehave

Xn = PXN , Y = ANXN + E.

Variable Z = [Xn;Y ] is Gaussian, with mean an covariance

EZZT

=

[E

Xn(Xn)T

E

XnY T

E

Y (Xn)T

E

Y Y T

]=

[PΓNP PΓN (AN )T

ANΓN ANΓN (AN )T + Γe

].

From this, calculate the conditional density π(xn | y).

π(xn | y) ∼ N (xncem,Γn

cem),

where

xncem = PxN

0 + PΓNpr

(AN )T

[ANΓN

pr

(AN

)T + Γe

]−1 (y −ANxN

0

),

and

Γncem = PΓN

prPT − PΓN

pr

(AN

)T[ANΓN

pr

(AN

)T + Γe

]−1

ANΓNprP

T.

Note: The computation of xncem requires soving an m ×m system, indepen-

dently of n. (Compare to xneem).

Example: Full angle tomography.

X−ray source

Detector

Figure 2: True object and the discretized model.

Intensity decrease along a line segment d`:

dI = −Iµd`,

where µ = µ(p) ≥ 0, p ∈ Ω is the mass absorption.

Let I0 be the intensity of the transmitted X-ray.

The received intensity I is

log(

I

I0

)=

∫ I

I0

dI

I= −

∫`

µ(p)d`(p).

Inverse problem of X-ray tomography: Estimate µ : Ω → R+ from the valuesof its integrals along a set of straight lines passing through Ω.

Figure 3: Sinogram data.

Gaussian structural smoothness prior: Three weakly correlated subregions.Inside each region pixels mutually correlated.

20 40 60 80

10

20

30

40

50

60

70

80

Figure 4: Prior geometry

Construction of the prior: Pixel centers pj , 1 ≤ j ≤ N .

Divide the pixels in clicques C1, C2 and C3. In medical imaging, this is calledimage segmenting.

Define the neighbourhood systemN = Ni | 1 ≤ i ≤ N, Ni ⊂ 1, 2, . . . , N,where

j ∈ Ni if and only if pixels pi and pj are neighbours and in the same clicque.

Define the density of a Markov random field X as

πMRF(x) ∝ exp

−12α

N∑j=1

|xj − cj

∑i∈Nj

xi|2

= exp(−1

2αxTBx

),

where the coupling constant cj depends of the clicque.

The matrix B is singular.

Remedy: Select few points pj | j ∈ I ′′, where I ′′ ⊂ I = 1, 2, . . . , N. LetI ′ = I \ I ′′.

Denote x = [x′;x′′].

The conditional density πMRF(x′ | x′′), (i.e., x′′ fixed), is a proper measurewith respect to x′.

Defineπpr(x) = πMRF(x′ | x′′)π0(x′′),

where π0 is Gaussian, e.g.,

π0 ∼ N (0, γ20I).

Figure 5: Four random draws from the prior density.

Data generated in a N = 84 × 84 mesh, inverse solutions computed in an = 42× 42 mesh.

Proper data y ∈ Rm and inverse crime data yic ∈ Rm:

y = ANxNtrue + e, yic = AnPxN

true + e,

where xNtrue is drawn from the prior density, e is a realization of

E ∼ N (0, σ2I),

where

σ2 = κm−1trace((AN −AnP )ΓN (AN −AnP )T

), 0.1 ≤ κ ≤ 10.

In other words,

0.1 ≤ κ =noise variance

discretization error variance≤ 10.

What is the structure of the discretization error? Can we approximate it byGaussian white noise?

5 10 15 20 25 30 35 40

0

0.02

0.04

0.06

0.08

0.1

Γ A

Projection number

ΓA(k,k)

ΓA(k,k+1)

Figure 6: The diagonal and the first off-diagonal of discretization error covari-ance.

Error analysis:

1. Draw a sample xN1 , xN

2 , . . . , xNS , S = 500, from the prior density.

2. Choose the noise level σ = σ(κ) and generate data y1(κ), y2(κ), . . . , yS(κ),both proper and inverse crime version.

3. Calculate the estimates x(y1(κ)), x(y2(κ)), . . . , x(yS(κ).

4. Estimate the estimation error,

E‖X − X(κ)‖2

≈ 1

S

S∑j=1

‖x(yj(κ))− xj‖2.

Estimators: CM, CM with enhanced error model and truncated CGNR byMorozov discrepancy principle, discrepancy

δ2 = τE‖E‖2

= τmσ(κ)2, τ = 1.1

10−2

10−1

10−3

10−2

10−1

100

||^x

− x

||2

Noise level

CG

CG IC

CM

CM Corr

Figure 7: Estimation errors with various noise levels. Dashed line isvar(Ediscr).

Error level 0.0029247




Error level 0.11093

Example: Estimate error: If x = x(y) is an estimator, define the relativeestimation error as

D(x) =E

‖X − X‖2

E

‖X‖2

.

Observe:D(0) = 1.

D(xCM) ≤ D(x)

for any estimator x.

Test case: Limited angle tomography, Reconstructions with truncated singularvalue decomposition (TSVD) versus CM estimate.

Calculate D(xTSVD) and D(xCM) by ensemble averaging (S = 500).

TSVD estimate:y = Ax + e.

SVD decomposition: A = UDV T, where

U = [u1, u2, . . . , um] ∈ Rm×m, V = [v1, v2, . . . , vn] ∈ Rm×n,

and

D = diag(d1, d2, . . . , dmin(n,m)) ∈ Rm×n, d1 ≥ d2 ≥ . . . ≥ dmin(n,m) ≥ 0.

xTSVD(y, r) =r∑

j=1

1dj

(uT

j y)vj ,

and the truncation parameter r is chosen, e.g., by the Morozov discrepancyprinciple,

‖y −AxTSVD(y, r)‖2 ≤ τE‖E‖2

< ‖y −AxTSVD(y, r − 1)‖2.

5 10 15 20 25 30 35 40

10

20

30

40

50

60

5 10 15 20 25 30 35 40

10

20

30

40

50

60

5 10 15 20 25 30 35 40

10

20

30

40

50

60

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

20

40

60

80

100

120

140

160

180

||^x − x||2

Den

sity

CM

TSVD

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

10

20

30

40

50

60

70

80

||^x − x||2

Den

sity CM

TSVD

CONCLUSIONS

• The Bayesian approach is useful for incorporating complex prior infor-mation into inverse solvers.

• It is not a method of producing a single estimator - although it can beused as a tool for that, too.

• It facilitates error analysis of discretization, modelling and estimationby deterministic methods.

• Working with ensembles makes possible to analyze non-linear problemsas well (e.g. EIT, OAST).