Statistical Inverse Problems,
Model Reduction and
Inverse Crimes
Erkki Somersalo, Helsinki University of Technology, Finland
Firenze, March 22–26, 2004
CONTENTS OF THE LECTURES
1. Statistical inverse problems: A brief review
2. Model reduction, discretization invariance
3. Inverse crimes
Material based on the forthcoming book
Jari Kaipio and Erkki Somersalo: Computational and Statistical Inverse Prob-lems. Springer-Verlag (2004)
STATISTICAL INVERSE PROBLEMS
Bayesian paradigm, or “subjective probability”:
1. All variables are random variables
2. The randomness reflects the subject’s uncertainty of the actual values
3. The uncertainty is encoded into probability distributions of the variables
Notation: Random variables X, Y , E etc.
Realizations: If X : Ω → Rn, we denote
X(ω) = x ∈ Rn.
Probability densities:
PX ∈ B
=
∫B
πX(x)dx =∫
B
π(x)dx.
Hierarchy of the variables:
1. Unobservable variables of primary interest, X
2. Unobservable variables of secondary interest, E
3. Observable variables, Y
Example: Linear inverse problem with additive noise,
y = Ax + e, A ∈ Rm×n.
Stochastic extension:Y = AX + E.
Conditioning: Joint probability density of X and Y :
PX ∈ A, Y ∈ B
=
∫A×B
π(x, y)dx dy.
Marginal densities:
PX ∈ A
= P
X ∈ A, Y ∈ Rm
=
∫A×Rm
π(x, y)dx dy,
in other words,
π(x) =∫
Rm
π(x, y)dy.
Conditional probability:
PX ∈ A | Y ∈ B
=
∫A×B
π(x, y)dx dy∫B
π(y)dy.
Shrink B into a single point y:
PX ∈ A | Y = y
=
∫A
π(x, y)π(y)
dx =∫
A
π(x | y)dx,
where
π(x | y) =π(x, y)π(y)
or π(x, y) = π(x | y)π(y).
Bayesian solution of an inverse problem: Given a measurement y = yobserved
of the observable variable Y , find the posterior density of X,
πpost(x) = π(x | yobserved).
Prior density, πpr(x) expresses all prior information independent of the mea-surement.
Likelihood density π(y | x) is the likelihood of a measurement outcome y givenx.
Bayes formula:
π(x | y) =πpr(x)π(y | x)
π(y).
Three steps of Bayesian inversion:
1. Construct the prior density
2. Construct the likelihood density
3. Extract useful information from the posterior density
Example: Linear model with additive noise,
Y = AX + E,
where the density πnoise is known. Fixing X = x yields
π(y | x) = πnoise(y −Ax),
and soπ(x | y) = πpr(x)πnoise(y −Ax).
Assume that X and E are mutually independent and Gaussian,
X ∼ N (x0,Γpr), E ∼ N (0,Γe),
where Γpr ∈ Rn×n and Γe ∈ Rm×m are symmetric positive (semi)definite.
πpr(x) ∝ exp(−1
2(x− x0)TΓ−1
pr (x− x0))
,
π(y | x) ∝ exp(−1
2(y −Ax)TΓ−1
e (y −Ax))
.
From Bayes formula, the posterior covariance is Gaussian,
π(x | y) ∼ N (x∗,Γpost),
where
x∗ = x0 + ΓprAT(AΓprA
T + Γe)−1(y −Ax0),
Γpost = Γpr − ΓprAT(AΓprA
T + Γe)−1AΓpr.
Special case: Assume that
x0 = 0, Γpr = γ2I, Γe = σ2I.
In this case,x∗ = AT(AAT + α2I)−1y, α =
σ
γ,
known as Wiener filtered solution (m×m problem), or, equivalently,
x∗ = (ATA + α2I)−1ATy,
which is the Tikhonov regularized solution (n× n problem).
Engineering rule of thumb: If n < m, use Tikhonov, if m < n use Wiener.
(In practice, ATA or AAT should often not be calculated.)
Frequently asked question: How do you determine α?
Bayesian paradigm: Either
1. You know γ and σ; then α = σ/γ,
or
2. You don’t know them; make them part of the estimation problem.
This is the empirical Bayes approach.
Example: If γ in the previous example in unknown, write
πpr(x | γ) ∝ 1γn
exp(− 1
2γ2‖x‖2
),
and writeπpr(x, γ) = πpr(x | γ)πh(γ),
where πh is a hyperprior or hierarchical prior.
Determine π(x, γ | y).
BAYESIAN ESTIMATION
Classical inversion methods produce estimates of the unknown.
In contrast, Bayesian approach produces a probability density that can beused
• to produce estimates,
• to assess the quality of estimates (statistical and classical).
Example: Conditional mean (CM) and maximum a posteriori (MAP) esti-mates:
xCM =∫
Rn
xπ(x | y)dx,
xMAP = arg maxπ(x | y).
Calculating MAP esitmate is an optimization problem, CM estimate and in-tegration problem.
Monte Carlo integration: If n is large, quadrature methods not feasible.
MC methods: Assume that we have a sample,
S =x1, x2, . . . , xN
, xj ∈ Rn.
Write
xCM =∫
Rn
xπ(x | y)dx ≈N∑
j=1
wjxj ,
wherewj = π(xj | y).
Importance sampling: Generate the sample S randomly.
Simple but inefficient (in particular when n is large).
A better idea: Generate the sample using the density π(x | y).
Ideal case: The points xj are distributed according to the density π(x | y),and
xCM =∫
Rn
xπ(x | y)dx ≈ 1N
N∑j=1
xj .
Markov chain Monte Carlo methods (MCMC): Generate the sample sequen-tially,
x0 → x1 → . . . xj → x+1 → . . . → xN .
Idea: Define a transition probability P (xj , Bj+1),
P (xj , Bj+1) = PXj+1 ∈ Bj+1, provided that Xj = xj
.
Assuming that Xj has probability density πj(xj),
Pxj+1 ∈ Bj+1
=
∫Rn
P (xj , Bj+1)πj(xj)dxj = πj+1(Bj+1).
Choose the transition kernel so that π(x | y) is invariant measure:
∫B
π(x | y)dx =∫
Rn
P (x′, B)π(x′ | y)dx′.
Then all the variables Xj are distributed according to π(x | y).
Best known algorithms:
Metropolis-Hastings, Gibbs sampler.
−2 −1 0 1 2−1
−0.5
0
0.5
1
1.5
2
−2 −1 0 1 2−1
−0.5
0
0.5
1
1.5
2
(d)
Gibbs sampler: Update one component at the time as follows:
Given xj = [xj1, x
j2, . . . , x
jn].
Draw xj+11 from t 7→ π(t, xj
2, . . . , xjn | y),
draw xj+12 from t 7→ π(xj+1
1 , t, xj3, . . . , x
jn | y),
...
draw xj+1n from t 7→ π(xj+1
1 , xj+12 , . . . , xj+1
n−1, t | y).
−2 −1 0 1 2−1
−0.5
0
0.5
1
1.5
2
Define a cost function Ψ : Rn × Rn → R.
The Bayes cost of an estimator x = x(y) is defined as
B(x) = EΨ(X, x(Y ))
=
∫ ∫Ψ(x, x(y))π(x, y)dx dy.
Further, we can write
B(x) =∫ ∫
Ψ(x, x)π(y | x)dy πpr(x)dx
=∫
B(x | x)πpr(x)dx = EB(x | x)
,
whereB(x | x) =
∫Ψ(x, x)π(y | x)dy
is the conditional Bayes cost.
The Bayes cost method: Fix Ψ and define the estimator xB so that
B(xB) ≤ B(x)
for all estimators x of x.
By Bayes formula,
B(x) =∫ ∫
Ψ(x, x)π(x | y)dx π(y)dy.
Since π(y) ≥ 0 and x(y) depends only on y,
xB(y) = arg min ∫
Ψ(x, x)π(x | y)dx
= arg min
E
Ψ(x, x)
∣∣ y
.
Mean square error criterion: Choose Ψ(x, x) = ‖x− x‖2, giving
B(x) = E‖X − X‖2
= trace
(corr(X − X)
),
where X = x(Y ), and
corr(X − X
)= E
(X − X)(X − X)T
∈ Rn×n.
This Bayes estimator is called the mean square estimator xMS. We have
xMS =∫
xπ(x | y) dx = xCM.
We have
E‖X − x‖2 | y
= E
‖X‖2 | y
− 2E
X | y
Tx + ‖x‖2
= E‖X‖2 | y
−
∥∥EX | y
2∥∥ +∥∥E
X | y
− x
∥∥2
≥ E‖X‖2 | y
−
∥∥EX | y
2∥∥,
and the equality holds only if
x(y) = EX | y = xCM.
Furthermore,E
X − xCM
= E
X − E
X | y
= 0.
Question: xCM is optimal, but is it informative?
0 0.5 1 1.50
1
2
3
4
5
6
CM MAP
(a)0 0.5 1 1.5
0
1
2
3
4
5
6
MAP CM
(b)
No estimate is foolproof. Optimality is subjective.
DISCRETIZED MODELS
Consider a linear model with additive noise,
y = Af + e, f ∈ H, y, e ∈ Rm.
Discretization, e.g. by collocation,
xn = [f(p1); f(p2); . . . ; f(pn)] ∈ Rn,
Af ≈ Anxn, An ∈ Rm×n.
Assume that the discretization scheme is convergent,
limn→∞
‖Af −Anxn‖ = 0.
Accurate discrete model:
y = ANxN + e, ‖ANxN −Af‖ < tol .
Stochastic extension:Y = ANXN + E,
where Y , XN and E are random variables.
Passing into a coarse mesh. Possible reasons:
1. 2D and 3D applications, problems too large
2. Real time applications
3. Inverse modelling based on prescribed meshing
Coarse mesh model with n < N ,
Af ≈ Anxn, ‖Anxn −Af‖ > tol .
Stochastic extension of the simple reduced model is
Y = AnXn + E.
Inverse crime:
• WriteY = Y = AnXn + E, (1)
and develop the inversion scheme based on this model,
• generate data with the simple reduced model and test the inversionmethod with this data.
Usually, inverse crime results are overly optimistic.
Questions:
1. How to model the discretization error?
2. How to model the prior information?
3. Is the inverse crime always significant?
PRIOR MODELLING
Assume a Gaussian model,
XN ∼ N (xN0 ,ΓN ),
i.e., the prior density is
πpr(xN ) ∝ exp(−1
2(xN − xN
0
)T(ΓN
)−1(xN − xN
0
)).
Projection (e.g. interpolation, averaging or downsampling),
P : RN → Rn, XN 7→ Xn.
Then,
EXn
= E
PXN
= PE
XN
= PxN
0 ,
EXn
(Xn
)T= E
PXN
(XN
)TPT
= PE
XN
(XN
)TPT,
and therefore,Xn ∼ N (xn
0 ,Γn) = N (PxN0 , P ΓN PT).
However, this is not what we normally do!
Example: H = continuous functions on [0, 1].
Discretization by multiresolution bases. Let
ϕ(t) =
1, if 0 ≤ t < 1,0, if t < 0 or t ≥ 1.
Define V j , 0 ≤ j < ∞, V j ⊂ V j+1,
V j = spanϕj
k|1 ≤ k ≤ 2j,
whereϕj
k(t) = 2j/2ϕ(2jt− k − 1).
Discrete representation,
f j(t) =2j∑
k=1
xjkϕj
k(t) ∈ V j .
Projector P : xj 7→ xj−1
P = Ij−1 ⊗ e1 =1√2
1 1 0 0 . . . 0 00 0 1 1 . . . 0 0...
...0 0 0 0 . . . 1 1
∈ R2j−1×2j
.
Assume the prior information f ∈ C20 ([0, 1]).
Second order smoothness prior of XN , N = 2j :
πpr(xN ) ∝ exp(−1
2α‖LNxN‖2
)= exp
(−1
2(xN )T
[α(LN
)TLN
]xN
),
where
LN = 22j
−2 1 0 . . . 01 −2 1
0 1 −2...
.... . . 1
0 . . . 1 −2
∈ RN×N .
The prior covariance is
ΓN =[α(LN
)TLN
]−1
.
Passing to level n = 2j−1 = N/2:
Ln = 22(j−1)
−2 1 0 . . . 01 −2 1
0 1 −2...
.... . . 1
0 . . . 1 −2
= PLNPT ∈ Rn×n.
Natural candidate for the smoothness prior for Xn is
πpr(xn) ∝ exp(−1
2α‖Lnxn‖2
)= exp
(−1
2(xn)T
[α(Ln
)TLn
]xn
),
But this is inconsistent, since
Γn =[α(Ln
)TLn
]−1
6= P[α(LN
)TLN
]−1
PT = Γn.
Numerical example:
Af(t) =∫ 1
0
K(t− s)f(s)ds, K(s) = e−κs2,
where κ = 15. Sampling:
yj = Af(tj) + ej , tj = (j − 1/2)/50, 1 ≤ j ≤ 50,
andE ∼ N (0, σ2I), σ = 2% of max
(Af(tj)
).
Smoothness prior
πpr(xN ) ∝ exp(−1
2α‖LNxN‖2
), N = 512.
Reduced model with n = 8.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−4
−2
0
2
4
6
8x 10
−3
Figure 1: MAP estimate with N = 512, n = 8. Black dots correspond to Γn,red dots to Γn.
DISCRETIZATION ERROR
From fine mesh to coarse mesh: Complete error model
Y = ANXN + E (2)
= AnXn + (AN −AnP )XN + E
= AnXn + Ediscr + E.
Error covariance: Assume that E, XN are mutually independent,
E ∼ N (0,Γe), XN ∼ N (xN0 ,ΓN ).
The complete error E = Ediscr + E is Gaussian,
E ∼ N (e0, Γe),
where
e0 = (AN −AnP )xN0 ,
Γe = (AN −AnP )ΓN (AN −AnP )T + Γe.
Error variance:
var(E
)= E
‖E− e0‖2
= E
‖Ediscr − e0‖2
+ E
‖E‖2
= = trace
((AN −AnP )ΓN (AN −AnP )T
)+ trace
(Γe
)= var
(Ediscr
)+ var
(E
).
The complete error model is noise dominated, if
var(Ediscr
)< var
(E
),
and modelling error dominated if
var(Ediscr
)> var
(E
).
Enhanced error model: Use the likelihood and prior
π(y | xn) ∝ exp(−1
2(y −Anxn − y0)TΓ−1
e (y −Anxn − y0))
,
πpr(xn) ∝ exp(−1
2(xn − xn
0 )T(Γn
pr
)−1(xn − xn0 )
),
where
y0 = EY
= AnEXn+ e0
= AnPxN0 +
(AN −AnP
)xN
0
= ANxN0 .
MAP estimate, denoted by xneem is
xneem = argmin‖Ln
pr
(xn − xn
0
)‖2 + ‖Le
(Anxn − y − y0
)‖2
= argmin∥∥∥∥[
Lnpr
LeAn
]xn −
[Ln
prxn0
Le(y − y0)
]∥∥∥∥2
,
whereLpr = chol
(Γn
pr
)−1, Le = chol
(Γn
e
)−1.
This leads to a normal equation of size n× n.
Note: Enhanced error model is not the complete error model, because Xn iscorrelated with the complete error E through XN .
Complete error model: Assume, for a while that Xn and Y are zero mean. Wehave
Xn = PXN , Y = ANXN + E.
Variable Z = [Xn;Y ] is Gaussian, with mean an covariance
EZZT
=
[E
Xn(Xn)T
E
XnY T
E
Y (Xn)T
E
Y Y T
]=
[PΓNP PΓN (AN )T
ANΓN ANΓN (AN )T + Γe
].
From this, calculate the conditional density π(xn | y).
π(xn | y) ∼ N (xncem,Γn
cem),
where
xncem = PxN
0 + PΓNpr
(AN )T
[ANΓN
pr
(AN
)T + Γe
]−1 (y −ANxN
0
),
and
Γncem = PΓN
prPT − PΓN
pr
(AN
)T[ANΓN
pr
(AN
)T + Γe
]−1
ANΓNprP
T.
Note: The computation of xncem requires soving an m ×m system, indepen-
dently of n. (Compare to xneem).
Example: Full angle tomography.
X−ray source
Detector
Figure 2: True object and the discretized model.
Intensity decrease along a line segment d`:
dI = −Iµd`,
where µ = µ(p) ≥ 0, p ∈ Ω is the mass absorption.
Let I0 be the intensity of the transmitted X-ray.
The received intensity I is
log(
I
I0
)=
∫ I
I0
dI
I= −
∫`
µ(p)d`(p).
Inverse problem of X-ray tomography: Estimate µ : Ω → R+ from the valuesof its integrals along a set of straight lines passing through Ω.
Figure 3: Sinogram data.
Gaussian structural smoothness prior: Three weakly correlated subregions.Inside each region pixels mutually correlated.
20 40 60 80
10
20
30
40
50
60
70
80
Figure 4: Prior geometry
Construction of the prior: Pixel centers pj , 1 ≤ j ≤ N .
Divide the pixels in clicques C1, C2 and C3. In medical imaging, this is calledimage segmenting.
Define the neighbourhood systemN = Ni | 1 ≤ i ≤ N, Ni ⊂ 1, 2, . . . , N,where
j ∈ Ni if and only if pixels pi and pj are neighbours and in the same clicque.
Define the density of a Markov random field X as
πMRF(x) ∝ exp
−12α
N∑j=1
|xj − cj
∑i∈Nj
xi|2
= exp(−1
2αxTBx
),
where the coupling constant cj depends of the clicque.
The matrix B is singular.
Remedy: Select few points pj | j ∈ I ′′, where I ′′ ⊂ I = 1, 2, . . . , N. LetI ′ = I \ I ′′.
Denote x = [x′;x′′].
The conditional density πMRF(x′ | x′′), (i.e., x′′ fixed), is a proper measurewith respect to x′.
Defineπpr(x) = πMRF(x′ | x′′)π0(x′′),
where π0 is Gaussian, e.g.,
π0 ∼ N (0, γ20I).
Figure 5: Four random draws from the prior density.
Data generated in a N = 84 × 84 mesh, inverse solutions computed in an = 42× 42 mesh.
Proper data y ∈ Rm and inverse crime data yic ∈ Rm:
y = ANxNtrue + e, yic = AnPxN
true + e,
where xNtrue is drawn from the prior density, e is a realization of
E ∼ N (0, σ2I),
where
σ2 = κm−1trace((AN −AnP )ΓN (AN −AnP )T
), 0.1 ≤ κ ≤ 10.
In other words,
0.1 ≤ κ =noise variance
discretization error variance≤ 10.
What is the structure of the discretization error? Can we approximate it byGaussian white noise?
5 10 15 20 25 30 35 40
0
0.02
0.04
0.06
0.08
0.1
Γ A
Projection number
ΓA(k,k)
ΓA(k,k+1)
Figure 6: The diagonal and the first off-diagonal of discretization error covari-ance.
Error analysis:
1. Draw a sample xN1 , xN
2 , . . . , xNS , S = 500, from the prior density.
2. Choose the noise level σ = σ(κ) and generate data y1(κ), y2(κ), . . . , yS(κ),both proper and inverse crime version.
3. Calculate the estimates x(y1(κ)), x(y2(κ)), . . . , x(yS(κ).
4. Estimate the estimation error,
E‖X − X(κ)‖2
≈ 1
S
S∑j=1
‖x(yj(κ))− xj‖2.
Estimators: CM, CM with enhanced error model and truncated CGNR byMorozov discrepancy principle, discrepancy
δ2 = τE‖E‖2
= τmσ(κ)2, τ = 1.1
10−2
10−1
10−3
10−2
10−1
100
||^x
− x
||2
Noise level
CG
CG IC
CM
CM Corr
Figure 7: Estimation errors with various noise levels. Dashed line isvar(Ediscr).
Error level 0.0029247
Error level 0.0047491
Error level 0.0060516
Error level 0.0077115
Error level 0.11093
Example: Estimate error: If x = x(y) is an estimator, define the relativeestimation error as
D(x) =E
‖X − X‖2
E
‖X‖2
.
Observe:D(0) = 1.
D(xCM) ≤ D(x)
for any estimator x.
Test case: Limited angle tomography, Reconstructions with truncated singularvalue decomposition (TSVD) versus CM estimate.
Calculate D(xTSVD) and D(xCM) by ensemble averaging (S = 500).
TSVD estimate:y = Ax + e.
SVD decomposition: A = UDV T, where
U = [u1, u2, . . . , um] ∈ Rm×m, V = [v1, v2, . . . , vn] ∈ Rm×n,
and
D = diag(d1, d2, . . . , dmin(n,m)) ∈ Rm×n, d1 ≥ d2 ≥ . . . ≥ dmin(n,m) ≥ 0.
xTSVD(y, r) =r∑
j=1
1dj
(uT
j y)vj ,
and the truncation parameter r is chosen, e.g., by the Morozov discrepancyprinciple,
‖y −AxTSVD(y, r)‖2 ≤ τE‖E‖2
< ‖y −AxTSVD(y, r − 1)‖2.
5 10 15 20 25 30 35 40
10
20
30
40
50
60
5 10 15 20 25 30 35 40
10
20
30
40
50
60
5 10 15 20 25 30 35 40
10
20
30
40
50
60
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
20
40
60
80
100
120
140
160
180
||^x − x||2
Den
sity
CM
TSVD
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70
10
20
30
40
50
60
70
80
||^x − x||2
Den
sity CM
TSVD
CONCLUSIONS
• The Bayesian approach is useful for incorporating complex prior infor-mation into inverse solvers.
• It is not a method of producing a single estimator - although it can beused as a tool for that, too.
• It facilitates error analysis of discretization, modelling and estimationby deterministic methods.
• Working with ensembles makes possible to analyze non-linear problemsas well (e.g. EIT, OAST).