INFORMATION THEORETIC MEASURES AND THEIR APPLICATIONS TOIMAGE REGISTRATION AND SEGMENTATION
By
FEI WANG
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2006
Copyright 2006
by
Fei Wang
For my wife, Lin, and my parents.
ACKNOWLEDGMENTS
I would like to first thank my advisor, Dr. Baba C. Vemuri, for everything he has
done for me during my doctoral study. This dissertation would not have taken shape
without his invaluable input. Dr. Vemuri introduced me to the field of medical image
analysis. His insight and experience have guided me throughout my research during
which time he provided numerous invaluable suggestions. It was a great pleasure
for me to conduct this dissertation under his supervision. I would also like to thank
Dr.Anand Rangarajan, Dr. Sartaj Sahni, Dr. Arunava Banerjee and Dr. Tan Wong for their
willingness to serve on my committee. In addition, special thanks go to Dr. Jorg Peters
for attending my PhD oral examination.
My doctoral research is a happy cooperation with many people. Dr. Vemuri has been
involved with the whole process, Dr. Rangarajan has guided me a lot in the groupwise
point registration part, Dr. Ilona Schmalfuss and Dr. Stephan Eisenschenk have kindly
provided the data for hippocampal segmentation and taught me what little I know of
Neuroscience. I have also benefitted from Dr. Thomas E. Davis’s guidance when I first
joined the lab. I would also like to thank Dr. Banerjee for stimulating debates, and Dr.
Jeffrey Ho for his professional advice and philosophical discussions. Thanks also goes
to my co-authors, Drs. Murali Rao and Yunmei Chen, who were my co-authors on a set
of papers that introduced, the concept of entropy based on probability distributions and
several properties of the same.
Needless to say, I am grateful for the support of my colleagues and friends at the
Computer and Information Science and Engineering Department at the University of
Florida. Dr. Zhizhou Wang, Dr. Jundong Liu, Dr. Tim Mcgraw, Dr. Eric Spellman, Bing
Jian, Santhosh Kodipaka, Nicholas Lord, Neeti Vohra, Angelos Barmpoutis, Seniha Esen
iv
Yuksel, Ozlem Subakan, Ritwik Kumar, EvrenOzarslan, Ajit Rajwade, Adrian Peter, Dr.
Jie Zhang and Dr. Hongyu Guo all deserve thanks.
And finally, most importantly, I thank my family. I thank my mother and father
for everything and my brother, too. And of course I thank my dearest Lin for her
understanding and love during the past few years. Their support and encouragement are
my source of strength.
This research was supported in part by the grants NIH RO1-NS42075 and NIH
R01-NS046812. I would also like to acknowledge travel support (for attending various
conferences to present research papers) from the IEEE Computer Society, the Department
of Computer and Information Science and Engineering and the College of Engineering of
the University of Florida.
v
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .viii
LIST OF FIGURES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xi
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Image and Point-set Registration. . . . . . . . . . . . . . . . . . . . . . 11.1.1 Image Registration. . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 Groupwise Point-sets Registration. . . . . . . . . . . . . . . . . 3
1.2 Image Segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Outline of Remainder. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 ENTROPY AND RELATED MEASURES . . . . . . . . . . . . . . . . . . . 6
2.1 Shannon Entropy and Related Measures. . . . . . . . . . . . . . . . . . 62.2 Cumulative Residual Entropy: A New Measure of Information. . . . . . 82.3 Properties of CRE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
2.3.1 CRE and Empirical CRE. . . . . . . . . . . . . . . . . . . . . . 112.3.2 Robustness of CRE. . . . . . . . . . . . . . . . . . . . . . . . . 12
3 APPLICATIONS TO MULTIMODALITY IMAGE REGISTRATION . . . . . 14
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143.2 Multimodal Image Registration using CCRE. . . . . . . . . . . . . . . . 17
3.2.1 Transformation Model for Non-rigid Motion. . . . . . . . . . . . 213.2.2 Measure Optimization. . . . . . . . . . . . . . . . . . . . . . . . 213.2.3 Computation ofP (i > λ, k; µ) and ∂P (i>λ,k;µ)
∂µ. . . . . . . . . . . 23
3.2.4 Algorithm Summary. . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Implementation Results. . . . . . . . . . . . . . . . . . . . . . . . . . .25
3.3.1 Synthetic Motion Experiments. . . . . . . . . . . . . . . . . . . 263.3.1.1 Convergence speed. . . . . . . . . . . . . . . . . . . 263.3.1.2 Registration accuracy. . . . . . . . . . . . . . . . . . 283.3.1.3 Noise immunity. . . . . . . . . . . . . . . . . . . . . 293.3.1.4 Partial overlap. . . . . . . . . . . . . . . . . . . . . . 30
vi
3.3.2 Real Data Experiments. . . . . . . . . . . . . . . . . . . . . . . 31
4 DIVERGENCE MEASURES FOR GROUPWISE POINT-SETS REGIS-TRATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
4.1 Previous Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .364.2 Divergence Measures. . . . . . . . . . . . . . . . . . . . . . . . . . . .38
4.2.1 Jensen-Shannon Divergence. . . . . . . . . . . . . . . . . . . . . 384.2.2 CDF-JS Divergence. . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .424.3.1 Energy Function for Groupwise Point-sets Registration. . . . . . 434.3.2 JS Divergence in a Hypothesis Testing Framework. . . . . . . . . 444.3.3 Unbiasness Property of the Divergence Measures. . . . . . . . . 454.3.4 Estimating JS and its Derivative. . . . . . . . . . . . . . . . . . . 47
4.3.4.1 Finite mixture models. . . . . . . . . . . . . . . . . . 474.3.4.2 Optimizing the JS divergence. . . . . . . . . . . . . . 49
4.3.5 Estimating CDF-JS and its Derivative. . . . . . . . . . . . . . . 504.3.5.1 Optimizing the CDF-JS divergence. . . . . . . . . . . 52
4.4 Experiment Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . .534.4.1 JS Divergence Results. . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.1.1 Alignment results. . . . . . . . . . . . . . . . . . . . 534.4.1.2 Atlas construction results. . . . . . . . . . . . . . . . 55
4.4.2 CDF-JS Divergence Results. . . . . . . . . . . . . . . . . . . . . 56
5 APPLICATIONS TO IMAGE SEGMENTATION . . . . . . . . . . . . . . . . 59
5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .595.2 Registration+Segmentation Model. . . . . . . . . . . . . . . . . . . . . 60
5.2.1 Gradient flows. . . . . . . . . . . . . . . . . . . . . . . . . . . .635.2.2 Algorithm Summary. . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .66
6 CONCLUSIONS AND FUTURE WORK. . . . . . . . . . . . . . . . . . . . 72
6.1 Contributions of the Dissertation. . . . . . . . . . . . . . . . . . . . . . 726.2 Image and Point-sets Registration. . . . . . . . . . . . . . . . . . . . . 72
6.2.1 Non-rigid Image Registration. . . . . . . . . . . . . . . . . . . . 726.2.2 Groupwise Point-sets Registration. . . . . . . . . . . . . . . . . 73
6.3 Image Segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . .74
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83
vii
LIST OF TABLES
Table page
3–1 Comparison of the registration results between CCRE and MI for a fixed syn-thetic deformation field. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30
3–2 Comparison of total time taken to achieve registration by the CCRE with MI.. 31
3–3 Comparison of the value S of several brain structures for CCRE and MI.. . . . 33
5–1 Statistics of the error in estimated non-rigid deformation.. . . . . . . . . . . . 68
viii
LIST OF FIGURES
Figure page
1–1 Illustration of groupwise registration of corpus callosum point-sets manuallyextracted from the outer contours of the brain images.. . . . . . . . . . . . . . 4
3–1 CCRE, MI and NMI traces plotted for the misaligned MR & CT image pair. . 20
3–2 Comparison of convergence speed between CCRE and MI. . . . . . . . . . . 27
3–3 Plot demonstrating the change of Mean Deformation Error for CCRE and MIregistration results with time.. . . . . . . . . . . . . . . . . . . . . . . . . . .28
3–4 Results of application of our algorithm to synthetic data (see text for details).. 28
3–5 Registration results of MR T1 and T2 image slice with large non-overlap.. . . 30
3–6 Registration results of different subjects of MR & CT brain data with real non-rigid motion. (see text for details. . . . . . . . . . . . . . . . . . . . . . . . . 32
4–1 Illustration of corpus callosum point-sets represented as density functions.. . 35
4–2 Results of rigid registration in noiseless case. ’o’ and ’+’ indicate the modeland scene points respectively.. . . . . . . . . . . . . . . . . . . . . . . . . . .54
4–3 Non-rigid registration of the corpus callosum pointsets.. . . . . . . . . . . . . 54
4–4 Experiment results on seven 2D corpus collasum point sets.. . . . . . . . . . . 55
4–5 Robustness to outliers in the presence of large noise.. . . . . . . . . . . . . . . 57
4–6 Robustness test on 3D swan data. . . . . . . . . . . . . . . . . . . . . . . . . 57
4–7 Atlas construction from four 3D hippocampal point sets.. . . . . . . . . . . . 58
5–1 Model Illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61
5–2 Illustration of the various terms in the evolution of the level set functionφ. . . . 65
5–3 Results of application of our algorithm to synthetic data. . . . . . . . . . . . . 67
5–4 Results of application of our algorithm to a pair of slices from human brainMRIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .69
5–5 Corpus Callosum segmentation on a pair of corresponding slices from distinctsubjects.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70
ix
5–6 Hippocampal segmentation using our algorithm on a pair of brain scans fromdistinct subjects.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
x
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
INFORMATION THEORETIC MEASURES AND THEIR APPLICATIONS TOIMAGE REGISTRATION AND SEGMENTATION
By
Fei Wang
August 2006
Chair: Baba C. VemuriMajor Department: Computer and Information Sciences and Engineering
Information theory has played a fundamental role in many fields of science and
engineering including computer vision and medical imaging. In this dissertation, various
information theoretic measures that are used in achieving the goal of solving several
important problems in medical imaging namely, image registration, point-set registration
and image segmentation are presented.
To measure the information content in a random variable, we first present a novel
measure based on its cumulative distribution that is dubbed Cumulative Residual Entropy
(CRE). This measure parallels the well-known Shannon entropy but has the following
advantages: (1) it is more general than the Shannon entropy as its definition is valid
in the discrete and continuous domains, (2) it possesses more general mathematical
properties and (3) it can be easily computed from sample data and these computations
asymptotically converge to the true values. Based on CRE, we define the cross-CRE
(CCRE) between two random variables, and apply it to solve the image alignment
problem for parameterized transformations. The key strengths of the CCRE over using
the now popular Mutual Information (based on Shannon’s entropy) between images being
xi
aligned are that the former has significantly larger tolerance to noise and a much larger
convergence range over the field of parameterized transformations.
Jensen-Shannon (JS) divergence has long been known as a measure of cohesion
between multiple probability densities. Similar to the idea of defining an entropy measure
based on distributions, we derived a JS divergence based on probability distributions and
dub it as the CDF-JS divergence. We then apply the JS and the CDF-JS divergence to
the groupwise point-set registration problem, which involves simultaneously registering
multiple shapes (represented as point-sets) for constructing an atlas. Estimating a
meaningful average or mean shape from a set of shapes represented by unlabeled point-
sets is a challenging problem, since this usually involves solving for point correspondence
under a non-rigid motion setting. The novel and robust algorithm we propose avoids the
correspondence problem by minimizing the CDF-JS/JS divergence between the point-sets
represented as probability distribution/density functions. The cost functions are fully
symmetric with no bias toward any of the given shapes to be registered and whose mean
is being sought. We empirically show that CDF-JS is more robust to noise and outliers
than JS divergence. Our algorithm can be especially useful for creating atlases of various
shapes present in images as well as for simultaneously registering 3D range data sets
without having to establish any correspondence.
In the context of image segmentation, we developed a novel model-based seg-
mentation technique that involves segmenting the novel 3D image data by non-rigidly
registering an atlas to it. The key contribution here to the solution of this problem is that
we present a novel variational formulation of the registration assisted image segmenta-
tion task, which leads to solving a coupled set of nonlinear PDEs that are solved using
efficient numerical schemes. Our segmentation algorithm is a departure from earlier
methods in that we have a unified variational principle wherein non-rigid registration and
segmentation are simultaneously achieved; unlike previous solutions to this problem, our
algorithm can accommodate image pairs with very distinct intensity distributions.
xii
CHAPTER 1INTRODUCTION
In 1948, motivated by the problem of efficiently transmitting information over
a noisy communication channel, Claude Shannon introduced a revolutionary new
probabilistic way of thinking about communication and simultaneously created the first
truly mathematical theory of entropy. His ideas created a sensation and were rapidly
developed to create the field of information theory, which employs probability and
ergodic theory to study the statistical characteristics of data and communication systems.
Since then, information theory has played a fundamental role in many fields of science
and engineering including computer vision and medical imaging. In this dissertation, we
endeavor to develop novel information theoretic methods with the application to medical
image analysis.
We examine two applications in particular, image (point-set) registration and
image segmentation. In the first of these applications, we follow a promising avenue of
work in using a probability density or distribution function as the signature of a given
“object” (image or point-set). Then by optimizing certain information theoretic measures
between these functions, we achieve the desired registration. In the segmentation
application, we consider an atlas based approach, in which segmentation and registration
are simultaneously achieved by solving a novel variational principle.
1.1 Image and Point-set Registration
We start with the image registration problem and then move on to the point-set
registration.
1.1.1 Image Registration
The image registration problem is defined as follows: Given a pair of images
I1(x, y) andI2(x′, y′), where(x′, y′)t = T (x, y)t andT is the matrix corresponding
1
2
to the unknown parameterized transformation to be determined, define a match metric
M(I1(x, y), I2(x′, y′)) and optimizeM over allT .
The fundamental characteristic of any image registration technique is the type of
spatial transformation or mapping used to properly overlay two images. The transforma-
tion can be classified into global and local transformations. A global transformation is
given by a single equation which maps the entire image. Local transformations map the
image differently depending on the spatial location and are thus more difficult to express
succinctly. The most common global transformations are rigid, affine and projective
transformations.
A transformation is called rigid if the distance between points in the image being
transformed is preserved. A rigid transformation can be expressed as
u(x, y) = (cos(φ)x− sin(φ)y + dx)− x
v(x, y) = (sin(φ)x + cos(φ)y + dy)− y
(1–1)
whereu(x, y) andv(x, y) denote the displacement at point(x, y) along theX andY
directions;φ is the rotation angle, and(dx, dy) the translation vector.
A transformation is called affine when any straight line in the first image is mapped
onto a straight line in the second image with parallelism being preserved. In 2D, the
affine transformation can be expressed as
u(x, y) = (a11x− a12y + dx)− x
v(x, y) = (a21x + a22y + dy)− y
(1–2)
where( a11 a12a21 a22 ) denotes an arbitrary real-valued matrix. Scaling transformation, which
has a transformation matrix of(
s1 00 s2
)and shearing transformation, which has a matrix
( 1 s30 1 ) are two examples of affine transformation, wheres1,s2 ands3 are positive real
numbers.
3
A more interesting case, in general, is that of a planar surface in motion viewed
through a pinhole camera. This motion can be described as a 2D projective transforma-
tion of the plane.
u(x, y) = a0x+a1y+a2
a6x+a7y+1− x
v(x, y) = a3x+a4y+a5
a6x+a7y+1− y
(1–3)
wherea0, ..., a7 are the global parameters.
When a global transformation does not adequately explain the relationship of a
pair of input images, a local transformation may be necessary. Registering an image
pair obtained at different times with some portion of the body experiencing growth,
or registering two images from different patients, fall into this local transformation
registration category. A motion field is usually used to describe the change in local
transformation problem.
1.1.2 Groupwise Point-sets Registration
Point-set representations of image data, e.g., feature points, are commonly used
in many applications and the problem of registering them frequently arises in a variety
of these application domains. Extensive studies on the point set registration and related
problems can be found in a rich literature covering both theoretical and practical issues
relating to computer vision and pattern recognition.
GivenN point-sets, which are denoted byXp, p ∈ 1, ..., N, each point-setXp
consists of pointsxpi ∈ RD, i ∈ 1, ..., np andnp is the number of points contained
in point-setXp. The task of multiple point pattern matching or point-set registration is
either to establish a consistent point-to-point correspondence between these point-sets or
to recover the spatial transformation which yields the best alignment. For example, we are
given a group of corpus callosum point-sets from the brain image scan, which is shown
in the left column of Figure1–1. All the point-sets are registered simultaneously to the
point-sets shown in the right column in a symmetric manner, meaning that the registration
4
result is not biased towards any of the original point-set. We will discuss these issues in
greater detail in Chapter4.
Figure 1–1:Illustration of groupwise registration of corpus callosum point-sets manuallyextracted from the outer contours of the brain images.
1.2 Image Segmentation
Image segmentation plays a crucial role in many medical imaging applications by
automating or facilitating the delineation of anatomical structures. The segmentation of
structure from 2D and 3D images is an important first step in analyzing medical data. For
example, it is necessary to segment the brain in an MR image, before it can be rendered in
3D for visualization purposes. Segmentation can also be used to automatically detect the
head and abdomen of a fetus from an ultrasound image. The boundaries can then be used
to get quantitative estimates of organ sizes and provide aid in any necessary diagnoses.
Another important application is registration. It may be easier, or at least less error prone
5
to segment objects in multiple images prior to registration. This is especially true in
images from different modalities such as CT and MRI.
Image-guided surgery is another important application that needs image segmen-
tation. Recent advances in technology have made it possible to acquire images of the
patient while the surgery is in-progress. The goal is then to segment relevant regions of
interest and overlay them on an image of the patient to help guide the surgeon in his/her
work.
Segmentation is therefore a very important task in medical imaging. However,
manual segmentation is not only a tedious and time consuming process, but is also
inaccurate. Segmentation by experts has shown to be variable up to 20%. It is therefore
desirable to use algorithms that are accurate and require as little user interaction as
possible.
1.3 Outline of Remainder
In the next chapter, we rigorously define a novel measure of information in a random
variable based on its cumulative distribution that we dub as cumulative residual entropy
(CRE). We also connect the measure to the mean residual life function in reliability
engineering. Thereafter follows the chapter on using the measure for multimodal image
registration. In Chapter 4, we present a simultaneous groupwise point-sets registration
and atlas construction algorithm, in which we minimize the proposed divergence
measures between point sets represented as probability densities or distributions. Based
on these new measures, we propose a novel variational principle in Chapter 5 for
solving the registration assisted image segmentation problem. Lastly, we end with some
concluding points and thoughts for future work.
CHAPTER 2ENTROPY AND RELATED MEASURES
2.1 Shannon Entropy and Related Measures
The concept of entropy is central to the field of information theory and was origi-
nally introduced by Shannon in his seminal paper [1] in the context of communication
theory. The entropy Shannon proposed is a measure of uncertainty in a discrete distri-
bution based on the Boltzman entropy of classical statistical mechanics. The Shannon
Entropy of a discrete distributionF is defined by
H(F ) = −∑
i
pi log pi, (2–1)
Since then, this concept and variants thereof have been extensively utilized in numerous
applications of science and engineering. To date, one of the most widely benefiting
application has been in financial analysis [2], data compression [3], statistics [4], and
information theory [5].
This measure of uncertainty has many important properties which agree with
our intuitive notion of randomness. We mention three: (1) It is always positive. (2) It
vanishes if and only if it is a certain event. (3) Entropy is increased by the addition of
an independent component, and decreased by conditioning. However, extension of this
notion to continuous distribution poses some challenges. A straightforward extension of
the discrete case to continuous distributionsF with densityf called differential entropy
reads
H(F ) = −∫
f(x) log f(x)dx (2–2)
However, this definition raises the following concerns, 1) First of all, it is defined based
on the density of the random variable, which in general may or may not exist, e.g., for
6
7
cases when the cumulative distribution function (cdf) is not differentiable. It would not
be possible to define the entropy of a random variable for which the density function
is undefined; 2) Secondly, the Shannon entropy of a discrete distribution is always
positive, while the differential entropy of a continuous variable may take any value
on the extended real line; 3) Shannon entropy computed from samples of a random
variable lacks the property of convergence to the differential entropy, i.e. even when the
sample size goes to infinity, the Shannon entropy estimated from these samples will not
converge to differential entropy [5]. The consequence of which is that it is impossible,
in general, to approximate the differential entropy of a continuous variable using the
entropy of empirical distributions; 4) Consider the following situation: SupposeX and
Y are two discrete random variables representing the height of a group of people, with
X taking on values5.1, 5.2, 5.3, 5.4, 5.5, each with a probability1/5 andY taking on
values5.1, 5.2, 5.3, 5.4, 7.5 (with Yao Ming in this group) again each with probability
1/5. The information content measured in these two random variables using Shannon
entropy is the same, i.e., Shannon entropy does not bring out any differences between
these two cases. However, if the two random variables represented the winning chances
in a basketball game, the information content in the two random variables would be
considered as being dramatically different. Nevertheless Shannon entropy fails to make
any distinction whatsoever between them. For additional discussion on some of these
issues the reader is referred to [6].
In this work we propose an alternative measure of uncertainty in a random variable
X and call it the Cumulative Residual Entropy (CRE) ofX. The main objective of our
study is to extend Shannon entropy to random variables with continuous distributions.
The concept we proposed overcomes the problems mentioned above, while retaining
many of the important properties of Shannon entropy. For instance, both are decreased
by conditioning, while increased by independent addition. They both obey the data
8
processing inequality, etc. However, the differential entropy does not have the following
important properties of CRE.
1. CRE has consistent definitions in both the continuous and discrete domains;
2. CRE is always non-negative;
3. CRE can be easily computed from sample data and these computations asymptoti-
cally converge to the true values.
The basic idea in our definition is to replace the density function with the cumulative
distribution in Shannon’s definition2–1. The distribution function is more regular than
the density function, because the density is computed as the derivative of the distribution.
Moreover, in practice what is of interest and/or measurable is the distribution function.
For example, if the random variable is the life span of a machine, then the event of
interest is not whether the life span equalst, but rather whether the life span exceeds
t. Our definition also preserves the well established principle that the logarithm of
the probability of an event should represent the information content in the event. The
discussions about the properties of CRE in the next few sections, we trust, are convincing
enough for further development of the concept of CRE.
2.2 Cumulative Residual Entropy: A New Measure of Information
In this section, we define an alternate measure of uncertainty in a random variable
and then derive some properties about this new measurement. We do not delve into the
proofs but refer the reader to a more comprehensive mathematical treatment in [7].
Definition: Let X be a random vector inRN andX = (X1, X2, ..., XN), F (λ) :=
P (|X| > λ) is the cumulative residual distribution, whereλ = (λ1, ....λN) and|X| > λ
means|Xi| > λi. F (λ) is also called survival function in the Reliability Engineering
literature. We define the cumulative residual entropy (CRE) ofX, by
E(X) = −∫
RN+
F (λ) log F (λ)dλ (2–3)
whereRN+ =
(xi ∈ RN ; xi ≥ 0
).
9
CRE can be related to the well-known concept of mean residual life function in
Reliability Engineering which is defined as:
mF (t) = E(X − t|X ≥ t) =
∫∞t
F (x)dx
F (t)(2–4)
ThemF (t) is of fundamental importance in Reliability Engineering & is often used to
measure departure from exponentiality. CRE can be shown to be the expectation ofmF (t)
[8], i.e.
E(x) = E(mF (x)) (2–5)
Now we give a few examples.
• Example 1: (CRE of the uniform distribution)
Consider a general uniform distribution with the density function:
p(x) =
1a
0 ≤ x ≤ a
0 o.w(2–6)
Then its CRE is computed as follows
E(X) = −∫ a
0
P (|X| > x) log P (|X| > x)dx
= −∫ a
0
(1− x
a) log(1− x
a)dx
=1
4a (2–7)
• Example 2: (CRE of the exponential distribution)
The exponential distribution with mean1/λ has the density function:
p(x) = λe−λx (2–8)
Correspondingly, the CRE of the exponential distribution is
E(x) = −∫ ∞
0
e−λx log e−λxdt
=
∫ ∞
0
λte−λxdt
10
=1
λ(2–9)
• Example 3: (CRE of the Gaussian Distribution)
The Gaussian probability density function is
p(x) =1√2πσ
exp[−(x−m)2
2σ2], (2–10)
wherem is the mean andσ2 is the variance.
The cumulative distribution function is:
F (x) = 1− erfc(x−m
σ), (2–11)
whereerfc is the error function:
erfc(x) =1√2π
∫ ∞
x
exp(−t2/2)dt.
Then the CRE of the Gaussian distribution is:
E(x) = −∫ ∞
0
erfc(x−m
σ) log
[erfc(
x−m
σ)]dx (2–12)
We’ll now states important properties that are related to the application of CRE
to image registration. For a complete list of properties, we refer the readers to a more
comprehensive treatment in [7].
2.3 Properties of CRE
The traditional Shannon entropy of a sum of independent variables is larger than that
of either. We have analogously the following theorem:
Theorem 1 For any non-negative and independent variablesX andY ,
max(E(X), E(Y )
) ≤ E(X + Y )
Proof: For a proof, see [7].
11
Similar to the case of Shannon’s entropy, if X and Y are independent random
variables,E(X, Y ) = E(|X|)E(X) + E(|Y |)E(Y ). More generally,
Proposition 1 If Xi are independent, then
E(X) =∑
i
( ∏
i6=j
E(|Xj|))E(Xi)
For a proof, see [7].
Conditional entropy is a fundamental concept in information theory. We now define
the concept of conditioning in the context of CRE.
Definition: Given random vectorsX andY ∈ RN , we define the conditional CRE
E(X|Y ) by :
E(X|Y ) = −∫
RN+
P (|X| > x|Y ) log P (|X| > x|Y )dx (2–13)
As in the Shannon entropy case, conditioning reduces CRE.
Proposition 2 For anyX andY
E[E(X|Y )] ≤ E(X) (2–14)
Equality holds iffX is independent ofY .
2.3.1 CRE and Empirical CRE
Next theorem shows one of the salient feature of CRE. In the discrete case Shannon
entropy is always non-negative, and equals zero if and only if the random variable is a
certain event. However, this is not valid for the Shannon entropy in the continuous case
as defined in Eqn.2–2. In contrast, in this regard CRE does not differentiate between
discrete and continuous cases, as shown by the following theorem:
Theorem 2 E(X) ≥ 0 and equality holds if and only ifP [|X| = λ] = 1 for some vector
λ, ie. |Xi| = λi with probability 1.
12
Shannon entropy computed from samples of a random variable lacks the property
of convergence to the differential entropy (see Eqn.2–2for a definition). In contrast, the
CRE,E(x) computed from the samples converges to the continuous counterpart. This is
summarized in the following theorem.
Proposition 3 (Weak Convergence). Let the random vectorsXk converge in distribution
to the random vectorX; by this we mean
limk→∞
E[ϕ(Xk)] = E[ϕ(X)] (2–15)
for all bounded continuous functionsφ onRN , if all the Xk are bounded inLp for some
p > N , then
limk→∞
E(Xk) = E(X) (2–16)
Proof: Refer to [7] for the proof.
This is a powerful property and as a consequence of it, we can compute CRE of an
random variable from the samples which would converge to the true CRE of the random
variable. Note thatXk can be samples of a continuous random variable.
2.3.2 Robustness of CRE
We now investigate the robustness (or the lack thereof) of differential entropy and
prove that while differential entropy is not robust with respect to small perturbations,
CRE on the contrary is quite robust. This property plays a key role in demonstrating the
noise immunity of CCRE over MI depicted in the experiments in the next Chapter.
Theorem 3 LetX be a discrete R.V., taking value(x1, x2, ..., xN), with probabilities
p1, p2, ..., pN
p(X = xi) = pi 1 ≤ i ≤ N (2–17)
X has Shannon entropy:H(X) = −∑pi log pi. LetYn have densityfn and be
independent ofX. Zn = X + Yn is no longer discrete, and has a density. Let X be as in
13
(2–17) andYn as above. SupposeYn → 0 in probability. Then
h(X + Yn) → −∞ (2–18)
Theorem 4 For X andYn as defined in Theorem3,
limY n→0
E(X + Yn) → E(X) (2–19)
Proof: This is a direct consequence of the Proposition3.
Theorems (3) and (4) are very important properties as they prove that the CRE is
robust to noise which is not the case for differential entropy. Intuitively, the robustness
of CRE maybe attributed to the use of CDF as opposed to a PDF in its definition, i.e., an
integral formulation as opposed to a differential formulation and it is well known that the
former is more robust compared to the later.
CHAPTER 3APPLICATIONS TO MULTIMODALITY IMAGE REGISTRATION
Matching two or more images under varying conditions – illumination, pose,
acquisition parameters etc. – is ubiquitous in Computer Vision, medical imaging,
geographical information systems etc. In the past several years, information theoretic
measures have been very widely used in defining cost functions to be optimized in
achieving a match. An example problem common to all the aforementioned areas is the
image registration problem. In the following, we will review the literature on existing
computational algorithms that have been reported for achieving multimodality image
registration, with the focus on the non-rigid registration methods. We will point out their
limitations and hence motivate the need for a new and efficient computational algorithm
for achieving our goal.
3.1 Related Work
Non-rigid image registration methods in literature to date may be classified into
feature-based and “direct” methods. Most feature-based methods are limited to determin-
ing the registration at the feature locations and require an interpolation at other locations.
If however, the transformation/registration between the images is a global transformation
e.g., rigid, affine etc. then, there is no need for an interpolation step. In the non-rigid case
however, interpolation is required. Also, the accuracy of the registration is dependent on
the accuracy of the feature detector.
Several feature-based methods involve detecting surfaces landmarks [9, 10, 11, 12],
edges, ridges, etc. Most of these assume a known correspondence with the exception
of the work in Chui et al.[9], Jian and Vemuri [13], Wang et al.[14] and Guo et al. [15].
Work reported in Irani and Anandan [16] uses the energy (squared magnitude) in the
directional derivative image as a representation scheme for matching achieved using the
14
15
SSD cost function. Recently, Liu et al. [17] reported the use of local frequency in a robust
statistical framework using the integral squared error a.k.a.,L2E. The primary advantage
of L2E over other robust estimators in literature is that there are no tuning parameters in
it. The idea of using local phase was also exploited by Mellor and Brady [18], who used
mutual information (MI) to match local-phase representation of images and estimated
the non-rigid registration between them. However, robustness to significant non-overlap
in the field of view (FOV) of the scanners was not addressed. For more on feature-based
methods, we refer the reader to the recent survey by Zitova and Flusser [19].
In the context of “direct” methods, the primary matching techniques for intra-
modality registration involve the use of normalized cross-correlation, modified SSD,
and (normalized) mutual information (MI). Ruiz-Alzola et al.[20] presented a unified
framework for non-rigid registration of scalar, vector and tensor data based on template
matching. For scalar images, the cost function is the extension of modified SSD using a
different definition of inner products. However this model can only be used on images
from the same modality as it assumes similar intensity values between images. In [21,
22], a level-set based image registration algorithm was introduced that was designed to
non-rigidly register two 3D volumes from the same modality of imaging. This algorithm
was computationally efficient and was used to achieve atlas-based segmentation. Direct
methods based on the optical-flow estimation form a large class for solving the non-rigid
registration problem. Hellier et al.[23] proposed a registration method based on a dense
robust 3-D estimation of the optical flow with a piecewise parametric description of
the deformation field. Their algorithm is unsuitable for multi-modal image registration
due to the brightness constancy assumption. Variants of optical flow-based registration
that accommodate for varying illumination maybe used for inter-modality registration
and we refer the reader to [24, 25] for such methods. Guimond et al., [26] reported a
multi-modal brain warping technique that uses Thirion’s Demons algorithm [27] with
an adaptive intensity correction. The technique however was not tested for robustness
16
with respect to significant non-overlap in the FOVs. More recently, Cuzol et al. [28]
introduced a new non-rigid image registration technique which basically involves a
Helmoholtz decomposition of the flow field which is then embedded into the brightness
constancy model of optical flow. The Helmholtz decomposition allows one to compute
large displacements when the data contains such displacements. This technique is
an innovation on accommodating for large displacements and not one that allows for
intermodality non-rigid registration. For more on intra-modality methods, we refer the
reader to the comprehensive surveys [29, 19].
A popular framework for “direct” methods is based on the information theoretic
measures [30], among them, mutual information (MI) pioneered by Viola and Wells [31]
and Collignon et al., [32] and modified in Studholme et al., [33] has been effective in
the application of image registration. Reported registration experiments in these works
are quite impressive for the case of rigid motion. The problem of being able to handle
non-rigid deformations in the MI framework is a very active area of research and some
recent papers reporting results on this problem are [18, 34, 35, 36, 37, 38, 39, 40, 41,
42]. In [34], Mattes et al., and in [35], Rueckert et al., presented mutual information
based schemes for matching multi-modal image pairs using B-Splines to represent the
deformation field on a regular grid. Guetter [43] recently incorporated a learned joint
intensity distribution into the mutual information formulation, in which the registration
is achieved by simultaneously minimizing the KL divergence between the observed
and learned intensity distributions and maximizing the mutual information between
the reference and alignment images. Recently, D’Agostino et al., [44] presented an
information theoretic approach wherein tissue class probabilities of each image being
registered are used to match over the space of transformations using a divergence measure
between the ideal case (where tissue class labels between images at corresponding voxels
are similar) and actual joint class distributions of both images. This work expects a
segmentation of either one of the images being registered. Computational efficiency and
17
accuracy (in the event of significant non-overlaps) are issues of concern in most if not all
the MI-based non-rigid registration methods.
Finally, some registration methods under the direct approach are inspired by
models from mechanics, either from elasticity [45, 46], or fluid mechanics [47, 48].
Fluid mechanics-based models accommodate for large deformations, but are largely
computationally expensive. Christensen [49] recently developed an interesting version
of these methods, where the direct deformation field and the inverse deformation field
are jointly estimated to guarantee the symmetry of the deformation with respect to
permutation of input images. A more general and mathematically rigorous treatment
of the non-rigid registration which subsumes the fluid-flow methods was presented in
Trouve [50]. All these methods however are primarily applicable to intra-modality and
not inter-modality registration.
3.2 Multimodal Image Registration using CCRE
Based on CRE, cross-CRE (CCRE) between two random variables was defined,
and applied to solve the image alignment problem, which is defined as: Given a pair
of imagesI1(x) andI2(x′), where(x′)t = T (x)t andT is the matrix corresponding
to the unknown parameterized transformation to be determined, define a match metric
M(I1(x), I2(x′)) and maximize/minimizeM over allT . The class of transformations can
be rigid, affine, projective or non-rigid transformations. Several matching criteria have
been proposed in the past, some of which were reviewed earlier. Amongst them, mutual
information is very popular and is defined as follows for the continuous random variable
case,
MI(X, Y ) = h(X) + h(Y )− h(X, Y ) (3–1)
whereh(X) is the differential entropy of the random variableX and is given byh(x) =∫∞−∞ p(x)lnp(x)dx, where p(x) is the probability density function and can be estimated
from the image data using any of the parametric and nonparametric methods. The reason
18
for defining MI in terms of differential entropy as opposed to Shannon entropy is to
facilitate the optimization of MI with respect to the registration parameters using any
of the gradient based optimization methods. Note that MI defined using the Shannon’s
entropy in discrete form will not converge to continuous case defined here due to the fact
that Shannon’s entropy does not converge to the differential entropy (see [5]).
We now define the cross-CRE (CCRE) using CRE defined in Eqn.2–3.
C(X, Y ) = E(X)− E[E(X/Y )], (3–2)
We will use this quantity as a matching criterion in the image alignment problem.
More specifically, letIT (x) be a test image we want to register to a reference image
IR(x). The transformationg(x; µ) describes the deformation fromVT to VR, where
VT andVR are continuous domains on whichIT andIR are defined,µ is the set of the
transformation parameters to be determined. We pose the task of image registration as
an optimization problem. To align the reference imageIR(x) with the transformed test
imageIT (g(x; µ)), we seek the set of the transformation parametersµ that maximizes
C(IT , IR) over the space of smooth transformations i.e.,
µ = argmaxµ
C(IT g(x; µ), IR
)(3–3)
The computation of CCRE requires estimates of the marginal and joint probability
distributions of the intensity values of the reference and test images. We denotep(l, k; µ)
as the joint probability of(IT g(x; µ), IR
). Let pT (l; µ) andpR(k) represent the
marginal probability for the test image and reference images respectively,LT andLR
are the discrete sets of intensities associated with the test image and reference image
respectively. Then, we can rewrite theCCRE(IT g(x; µ), IR
)as follows:
C(IT g(x; µ), IR
)= E(IT )− E[E(IT g(x; µ)/IR)]
= −∑
λ∈LT
∫ ∞
λ
pT (l; µ)dl log[ ∫ ∞
λ
pT (l; µ)dl]
19
+∑
k∈LR
pR(k)∑
λ∈LT
∫ ∞
λ
p(l, k; µ)
pR(k)dl log
[ ∫ ∞
λ
p(l, k; µ)
pR(k)dl
](3–4)
Let P (i > λ; µ) =∫∞
λpT (l; µ)dl andP (i > λ, k; µ) =
∫∞λ
p(l, k; µ)dl. Using the fact
thatpT (l; µ) =∑
k∈LRp(l, k; µ), we haveP (i > λ; µ) =
∑k∈LR
P (i > λ, k; µ). Eqn.
(3–4) can be further simplified, which leads to,
C(IT g(x; µ), IR
)
= −∑
λ∈LT
P (i > λ; µ) log P (i > λ; µ) +∑
k∈LR
∑
λ∈LT
P (i > λ, k; µ) logP (i > λ, k; µ)
pR(k)
= −∑
λ∈LT
∑
k∈LR
P (i > λ, k; µ) log P (i > λ; µ) (3–5)
+∑
k∈LR
∑
λ∈LT
P (i > λ, k; µ) logP (i > λ, k; µ)
pR(k)
=∑
λ∈LT
∑
k∈LR
P (i > λ, k; µ)[log
P (i > λ, k; µ)
pR(k)− log P (i > λ; µ)
]
=∑
λ∈LT
∑
k∈LR
P (i > λ, k; µ) logP (i > λ, k; µ)
pR(k)P (i > λ; µ)(3–6)
To illustrate the difference between CCRE and the now popular information theoretic
cost functions such as MI & NMI, we choose to plot these functions against a parameter
of the transformation, for illustrative purposes, say the rotations. The image pair we
used here is MR & CT images that were originally aligned, and the MR and CT data
intensities range from 0-255 with the mean 55.6 and 60.6 respectively. The cost functions
are computed over the rotation angle that was applied to the CT image to misalign it with
respect to the MR image. In each plot of the Figure3–1theX-axis shows the 3D rotation
angle aboutZ axis, while theY -axis shows the values of CCRE, MI and NMI computed
from the misaligned (by a rotation) image pairs. The second row shows a zoom-in view
of the plots over a smaller region, so as to get a detailed view of the cost function. The
following observations are made from this plot:
20
−0.4 −0.2 0 0.2 0.41.1115
1.1115
1.1116
1.1117
1.1117Normalized MI
−0.4 −0.2 0 0.2 0.40.4326
0.4328
0.433
0.4332
0.4334
0.4336
0.4338MI
−0.4 −0.2 0 0.2 0.415.985
15.99
15.995
16
16.005
16.01CCRE
−40 −20 0 20 4010
12
14
16
18CCRE
−40 −20 0 20 400.25
0.3
0.35
0.4
0.45
0.5
MI
−40 −20 0 20 401.05
1.1
Normalized MI
−0.4 −0.2 0 0.215.94
15.95
15.96
15.97
15.98CCRE
−0.4 −0.2 0 0.2 0.40.4046
0.4048
0.405
0.4052
0.4054
0.4056MI
−0.4 −0.2 0 0.2 0.41.1027
1.1028
1.1029Normalized MI
Figure 3–1:CCRE, MI and NMI traces plotted for the misaligned MR & CT image pairwhere misalignment is generated by a rotation of the CT image. First row: over the range−40 to 40. Second row: zoom in view between−0.5 to 0.5, where the arrows in thefirst row signify the position. Note that all three cost function are implemented with tri-linear interpolation. Third row: Three cost functions implemented with partial volumeinterpolation [32].
1. Similar to MI and NMI, the maximum of CCRE occurs at0 of rotation, which
confirms that our new information measure needs to be maximized in order to find
optimum transformation between two misaligned images.
2. The CCRE shows much larger range of values than MI & NMI. This feature plays
an important role in the numerical optimization since it leads to a more stable
numerical implementation by avoiding cancelation, round off etc. that often plague
arithmetic operations with smaller numerical values.
21
3. Upon closer inspection, we observe that CCRE is much smoother than the MI and
NMI for the MR& CT data pair, therefore verifies that CCRE is more regular than
other information theoretic measures.
3.2.1 Transformation Model for Non-rigid Motion
We model the non-rigid deformation field between two 3D image pairs using a
cubic B-splines basis in 3D. B-splines have a number of desirable properties for use
in modeling the deformation field. (1) Splines provide inherent control of smoothness
(degree of continuity). (2) B-splines are separable in multiple dimensions which provides
computational efficiency. Another feature of B-splines that is useful in a non-rigid
registration system is the ”local control”. Changing the location of a single control point
modifies only a local neighborhood of the control point.
The basic idea of the cubic B-spline deformation is to deform an object by manipu-
lating an underlying mesh of control pointsγi. The deformationg is defined by a sparse
regular control point grid. In 3D case, the deformation at any pointx = [x, y, z]T in the
test image can be interpolated with a linear combination of cubic B-spline convolution
kernel.
g(x) =∑
j
δjβ(3)(
x− γi
4ρ) (3–7)
whereβ(3)(x) = β(3)(x)β(3)(y)β(3)(z) and4ρ is spacing of the control grid.δj is the
expansion B-spline coefficients computed from the sample values of the image. For the
implementation details, we refer the reader to Forsey[51] and Mattes [34].
3.2.2 Measure Optimization
Calculation of the gradient of the energy function is necessary for its efficient and
robust maximization. The gradient of CCRE is given as,
∇C = [∂C∂µ1
,∂C∂µ2
, ...,∂C∂µn
] (3–8)
22
Each component of the gradient can be found by differentiating Eqn. (3–4) with respect
to a transformation parameters. We consider the two terms in Eqn. (3–4) separately when
computing the derivative. For the first term in Eqn. (3–4), we have,
∂E(IT )
∂µ=
∂
∂µ
[−
∑
λ∈LT
∫ ∞
λ
pT (l; µ)dl × log( ∫ ∞
λ
pT (l; µ)dl)]
= −∑
λ∈LT
log(P (i > λ; µ) + 1
)× ∂P (i > λ; µ)
∂µ(3–9)
whereP (i > λ; µ) =∫∞
λpT (l; µ)dl, and
∂P (i > λ; µ)
∂µ=
∫ ∞
λ
∂pT (l, µ)
∂µdl (3–10)
The derivative of the second term is given by,
∂E[E(IT g(x; µ)/IR)]
∂µ
=∂
∂µ
[ ∑
k∈LR
pR(k)∑
λ∈LT
∫ ∞
λ
p(l, k; µ)
pR(k)dl × log
( ∫ ∞
λ
p(l, k; µ)
pR(k)dl
)]
=∑
k∈LR
∑
λ∈LT
(log
P (i > λ, k; µ)
pR(k)+ 1
)∂P (i > λ, k; µ)
∂µ(3–11)
whereP (i > λ, k; µ) =∫∞
λp(l, k; µ)dl, and
∂P (i > λ, k; µ)
∂µ=
∫ ∞
λ
∂p(l, k; µ)
∂µdl (3–12)
Combining the derivatives of the two terms together, and using the fact that
∂pT (l; µ)
∂µ=
∂∑
k∈LRp(l, k; µ)
∂µ(3–13)
23
we have the analytic gradient of CCRE,
∂C(IT g(x; µ), IR
)
∂µ
= −∑
λ∈LT
[log P (i > λ; µ) + 1]∂∑
k∈LRP (i > λ, k; µ)
∂µ
+∑
k∈LR
∑
λ∈LT
[logP (i > λ, k; µ)
pR(k)+ 1]
∂P (i > λ, k; µ)
∂µ
(3–14)
= −∑
λ∈LT
∑
k∈LR
[log P (i > λ; µ) + 1]∂P (i > λ, k; µ)
∂µ
+∑
λ∈LT
∑
k∈LR
[logP (i > λ, k; µ)
pR(k)+ 1]
∂P (i > λ, k; µ)
∂µ
=∑
λ∈LT
∑
k∈LR
[logP (i > λ, k; µ)
pR(k)− log P (i > λ; µ)]× ∂P (i > λ, k; µ)
∂µ
=∑
λ∈LT
∑
k∈LR
[logP (i > λ, k; µ)
pR(k)P (i > λ; µ)]× ∂P (i > λ, k; µ)
∂µ
note that in the derivation, we use the fact thatP (i > λ; µ) =∑
k∈LRP (i > λ, k; µ).
Comparing the expressions for CCRE and derivative of CCREC(IT g(x; µ), IR
)=
∑λ∈LT
∑k∈LR
log P (i>λ,k;µ)pR(k)P (i>λ;µ)
× P (i > λ, k; µ)
∂C(
IT g(x;µ),IR
)∂µ
=∑
λ∈LT
∑k∈LR
log P (i>λ,k;µ)pR(k)P (i>λ;µ)
× ∂P (i>λ,k;µ)∂µ
(3–15)
we note that the two formulas in (3–15) are similar to each other and they share the
common termlog P (i>λ,k;µ)pR(k)×P (i>λ;µ)
. From a computational viewpoint, this is quite beneficial
since the common term can not only save memory space, but also make the calculation
of gradient more efficient. From the formulation, we can also see that calculation of
CCRE and derivative of CCRE require us to find a method to estimateP (i > λ, k; µ) and
∂P (i>λ,k;µ)∂µ
. We will address the computation of these terms in the next subsection.
3.2.3 Computation ofP (i > λ, k; µ) and ∂P (i>λ,k;µ)∂µ
We will use the parzen window technique to estimate the cumulative distribution
function and its derivative. The calculation ofP (i > λ, k; µ) requires estimate of the
cumulative probability distributions of the intensity values of the reference and test
24
images. Letβ(0) be a zero-order spline Parzen window (centered unit pulse) andβ(3) be a
cubic spline Parzen window, the smoothed joint probability of(IR, IT g) is given by
p(l, k; µ) = α∑x∈V
β(0)(k − IR(x)− f 0
R
4bR
)β(3)
(l − IT (g(x; µ))− f 0
T
4bT
)(3–16)
whereα is a normalization factor that ensures∑
p(l, k) = 1, andIR(x) andIT (g(x; µ)
are samples of the reference and interpolated test images respectively, which is normal-
ized by the minimum intensity value,f 0R, f 0
T , and the intensity range of each bin,4bR,
4bT .
SinceP (i > λ, k; µ) =∫∞
λp(l, k; µ)dl, we have the following,
P (i > λ, k; µ) =
∫ ∞
λ
p(l, k; µ)dl
= α∑x∈V
β(0)(k − IR(x)− f 0
R
4bR
) ∫ ∞
λ
β(3)(l − IT (g(x; µ))− f 0
T
4bT
)dl
= α∑x∈V
β(0)(k − IR(x)− f 0
R
4bR
)Φ
(l − IT (g(x; µ))− f 0
T
4bT
)(3–17)
whereΦ() is the cumulative residual function of cubic spline kernel defined as
follows,
Φ(v) =
∫ ∞
v
β(3)(u)
=
1.0 v < −2
1.0− (v+2)4
24−2 ≤ v < −1
12− 2
3v + v3
3+ v4
8−1 ≤ v < 0
12− 2
3v + v3
3− v4
80 ≤ v < 1
(v−2)4
241 ≤ v < 2
0 v ≥ 2
(3–18)
25
Note thatdΦ(u)du
= −β(3)(u), we can then take the derivative of Eqn.3–17with respect to
µ, and we get
∂P (i > λ, k; µ)
∂µ=
α
4bT
∑x∈V
β(0)(k − IR(x)− f 0
R
4bR
)Φ′(l − IT (g(x; µ))− f 0
T
4bT
)
×(− ∂IT (t)
∂t
∣∣∣t= g(x; µ)
)∂g(x; µ)
∂µ
=α
4bT
∑x∈V
β(0)(k − IR(x)− f 0
R
4bR
)β(3)
(l − IT (g(x; µ))− f 0
T
4bT
)
×(∂IT (t)
∂t
∣∣∣t=g(x;µ)
)∂g(x; µ)
∂µ(3–19)
where∂IT (t)∂t
is the image gradient.
3.2.4 Algorithm Summary
The registration algorithm can be summarized as follows,
1 . For the current deformation field, interpolate the test image byIT g(x; µ).
CalculateP (i > λ, k; µ) and ∂P (i>λ,k;µ)∂µ
using Eqn. (3–17) and Eqn. (3–19)
respectively.
2 . ComputeP (i > λ; µ) as∑
k∈LRP (i > λ, k; µ), which is used to calculate the
common term in both CCRE and gradient of CCRE, i.e.,log P (i>λ,k;µ)pR(k)×P (i>λ;µ)
.
3 . Compute the energy function and its gradient using the formulas given in Eqn.
(3–15), we can then use the Quasi-Newton method to numerically solve the
optimization problem.
4 . Update the deformation fieldg(x; µ). Stop the registration process if the differ-
ence in consecutive iterates is less thanε = 0.01, a pre-chosen tolerance, otherwise
go toStep 1.
3.3 Implementation Results
In this section, we present the results of applying our non-rigid registration algorithm
to several data sets. The results are presented for synthetic as well as real data. The first
set of experiment was done with synthetic motion. We show the advantage of using the
CCRE measure in comparison to other information theoretic registration methods. We
26
show that CCRE is not only more robust, but also converges faster than others. We begin
by applying CCRE to register image pairs for which the ground truth was available.
3.3.1 Synthetic Motion Experiments
In this section, we demonstrate the robustness property of CCRE and will make
a case for its use over Mutual Information in the alignment problem. The case will be
made via experiments depicting faster convergence speed and superior performance
under noisy inputs in matching the image pairs misaligned by a synthetic non-rigid
motion, Additionally we will depict a larger capture range over MI-based methods in the
estimation of the motion parameters.
The data we use for this experiment are corresponding slices from an MR T1 and
T2 image pair, which is from the brainweb site at the Montreal Neurological Institute
[52]. They are originally aligned with each other. The two images are defined on a
1mm isotropic voxel grid in the Talairach space, with dimension(256 × 256). We then
apply a known non-rigid transformation to the T2 image and the goal is to recover this
deformation by applying our registration method. The mutual information scheme which
we are going to compare with is originally reported in literature [34] [53], in which the
explicit gradient forms are presented and thus allowing for the application of gradient
based optimization methods.
3.3.1.1 Convergence speed
In order to compare the convergence speed of CCRE versus MI, we design the
experiment as follows: with the MR T1 & T2 image pair as our data, we choose the MR
T1 image as the source, the target image was obtained by applying a known smooth non-
rigid transformation that was procedurally generated. Notice the significant difference
between the intensity profiles of the source and target images. For comparison purposes,
we use the same gradient descent optimization scheme, and let the two registration
methods run for the same amount of time, and show the registration result visually and
quantitatively.
27
Source Image Target Image
Transformed Source using CCRE Transformed Source using MI
Figure 3–2:Upper left, MR T1 image as source image; Upper right, deformed MR T2image as target image; Lower left and right, results of estimated transformations usingCCRE and MI applied to the source respectively. Both algorithms run for 30 secondsusing the same gradient descent technique.
The source and target image pair along with the results of estimated transformation
using CCRE and MI applied to the source are shown in Figure3–2. As evident visually,
we observe that the result generated by CCRE is more similar in shape with the target
image than the one produced by MI.
Quantitative assessment of accuracy of the registration is presented subsequently
in Figure3–3, where we plotted the change of mean deformation error (MDE) obtained
for the CCRE-based algorithm and the MI-based algorithm respectively. MDE is defined
asdm = 1card(R)
∑xi∈R ||g0(xi) − g(xi)||, whereg0(xi) andg(xi) are the ground truth
and estimated displacements respectively at voxelxi. ||.|| denotes the Euclidean norm,
andR is the volume of the region of interest. In both cases mean deformation error
are decreasing with time, but the solid line is decreasing faster than the dotted line.
For example, it takes about 5 minutes for MI to reach the error level inside 1.2 while
CCRE only requires about half of the time as that required by MI to get to the level. This
28
0 1 2 3 4 5 6 7 80.5
1
1.5
2
2.5
3
3.5
4
4.5
Time (minutes)
Mea
n de
form
atio
n er
ror
MI Results
CCRE Results
Figure 3–3:Plot demonstrating the change of Mean Deformation Error for CCRE andMI registration results with time. Solid line shows the MDE for CCRE registration result,while dotted line illustrates the MDE for MI result.
empirically validates the faster convergence speed of CCRE based algorithm over the
MI-based algorithm.
3.3.1.2 Registration accuracy
Using the same experiment setting as in the previous experiment, we present the
registration error for our algorithm in the estimated non-rigid deformation field as an
indicator of the accuracy of estimated deformations. Figure3.3.1.2depicts the results
0 2 4 60
200
400
600
800
Figure 3–4:Results of application of our algorithm to synthetic data (see text for details).
obtained for this image pair. which is organized as follows, from left to right: the first
29
row depicts the source image with the target image segmentation superposed to depict
the amount of mis-alignment, the registered source image which is obtained using
our algorithm superposed with the target segmentation, followed by the target image;
second row depicts ground truth deformation field which we used to generate the target
image from the MR T2 image, the estimated non-rigid deformation field followed by
histogram of the estimated magnitude error. Note that the error distribution is mostly
concentrated in the small error range indicating the accuracy of our method. As a measure
of accuracy of our method, we also estimated the average,µ, and the standard deviation,
σ, of the error in the estimated non-rigid deformation field. The error was estimated as the
angle between the ground truth and estimated displacement vectors.The average and
standard deviation are 1.5139 and 4.3211 (in degrees) respectively, which is quite
accurate.
3.3.1.3 Noise immunity
In the next experiment, we compare the robustness of the two methods (CCRE,
MI) in the presence of noise. Still selecting the MR T1 image slice from the previous
experiment as our source image, we generate the target image by applying a fixed smooth
synthetic deformation field. We conduct this experiment by varying the amount of
Gaussian noise added and then for each instance of the added noise, we register the
two images using the two techniques. We expect both schemes are going to fail at some
level of noise. (“failed” here means that the optimization algorithm primarily diverged).
By comparing the noise magnitude of the failure point, we can show the degree to
which these methods are tolerant. The numerical schemes we used to implement these
registrations are all based on BFGS quasi-Newton algorithm.
The mean magnitude of the synthetic motion is 4.37 pixel, with the standard
deviation at 1.8852. Table3–1show the registration results for the two schemes. From the
table, we observe that the MI fails when the standard deviation of the noise is increased
to 40, while CCRE is tolerant until66, a significant difference when compared to the MI.
30
Table 3–1:Comparison of the registration results between CCRE and MI for a fixedsynthetic deformation field.
CCRE MIσ MDE Standard Deviation MDE Standard Deviation10 1.0816 0.9345 1.3884 1.453819 1.1381 1.1702 1.4871 1.505230 1.1975 1.3484 1.5204 1.561540 1.3373 1.6609 FAIL60 1.3791 1.907266 FAIL
This experiment conclusively depicts that CCRE has more noise immunity than MI when
dealing with the non-rigid motion.
3.3.1.4 Partial overlap
Figure3–5depicts an example of registration of the MR T1 and T2 data sets with
large nonoverlap. The left image of the figure depicts the MR T1 brain scan as the source
image, and the right image shows the MR T2 data as the target. Note that the FOV for the
data sets are significantly nonoverlapping. The nonoverlap was simulated by cutting 66%
of the MR T1 image (Source image). The middle column depicts the transformed source
image along with an edge map of the target (Deformed MR T2 image) superimposed on
the transformed source. As is evident, the registration is visually quite accurate.
Figure 3–5:Registration results of MR T1 and T2 image slice with large non-overlap.(left) MR T1 source image before registration; (right) Deformed T2 target image; (mid-dle) the transformed MR image superimposed with edge map from target image.
31
3.3.2 Real Data Experiments
In this section, we present the performance of our method on a series of CT &
MR data containing real non-rigid misalignments. For the purpose of comparison, we
also apply traditional MI implemented as was presented in Mattes et al. [34] to these
same data sets. The CT image is of size(512, 512, 120) while the MR image size is
(512, 512, 142), and the voxel dimensions are(0.46, 0.46, 1.5)mm and(0.68, 0.68, 1.05)
for CT and MR respectively. The registration was performed on reduced volumes
(210 × 210 × 120) with the control knots placed every16 × 16 × 16 voxels. The
program was written in the C++ programming language, and all experiments were run on
a 2.6GHZ Pentium PC.
Table 3–2:Comparison of total time taken to achieve registration by the CCRE with MI.
1 2 3 4 5 6 7 8CCRE Time (s) 4827 3452 4345 4038 3910 4510 5470 3721
MI Time (s) 9235 6344 10122 17812 12157 11782 13157 10057
We have used a set of eight volumes of CT data sets and the task was to register
these eight volumes to the MR data chosen as the target image for all registrations,
by using both CCRE and MI algorithms. Note that all CT & MR volumes are from
different subjects and thus contains real non-rigid motion. The parameters used with both
algorithms were identical. For both algorithms, the optimization of the cost functions was
halted when improvements of at least0.0001 in the cost function could not be detected.
The time required for registering all data sets for our algorithm as well as MI method
are given in Tables3–2. This table shows that, on average, our CCRE algorithm is about
2.5 times faster than the traditional MI approach for this set of experiments. For brevity,
we only show one registration result in Figure3–6. Here, one slice of the volume is
shown on first row with the source CT image at left and reference image at right. The
middle image show the transformed CT image slice superimposed with edge map from
target image. On the second row, the source image superimposed with edge map from
32
target image is shown on the left, while shown in the middle and right are the surfaces
reconstructed from the transformed source using CCRE method and the target MR
image respectively. From this figure, we can see that the source and target image depict
considerable non-rigid changes in shape, nevertheless our method was able to register
these two images quite accurately. To validate the conformity of the two reconstructed
surfaces, we randomly sample 30 points from the surface of the transformed source
using CCRE, and then estimate the distances of these points to the surface of the target
MR volume. The average of these distances is about0.47mm, which indicates a very
good agreement between two surfaces. The resemblance of the reconstructed shapes
from transformed source with the target indicates that our CCRE algorithm succeeded in
matching the source CT volume to the target MR image.
Figure 3–6:Registration results of different subjects of MR & CT brain data with realnon-rigid motion. (see text for details.
The accuracy of the information theoretic based algorithm for non-rigid registration
problems was assessed quantitatively by means of an region-based segmentation task
[54]. ROIs (whole brain, eyes) were segmented automatically in these eight CT data sets
33
used as the source image and binary masks were created. The deformation fields between
the CT and MR volume were computed and used to project the masks from each of the
CT to the MR volume. Contours were manually drawn on a few slices chosen at random
in MR volume (four slices/volume). Manual contours on MR and contours obtained
automatically were then compared using an accepted similarity index defined as two
times the number of pixels in the intersection of the contours divided by the sum of the
number of pixels within each contour [41]. This index varies between zero (complete
disagreement) and one (complete agreement) and is sensitive to both displacement and
differences in size and shape. Table3–3lists mean values for the similarity index for
each structure. It is customarily accepted that a value of the similarity index above 0.80
indicates a very good agreement between contours. Our results are well above this value.
For comparison purpose, we also computed the same index for the MI method. We can
conclude from the table that our CCRE can achieve better registration accuracy than the
MI for the task of non-rigid registration of real multi-model images.
Table 3–3:Comparison of the value S of several brain structures for CCRE and MI.
Volume 1 2 3 4 5 6 7 8Whole Brain 0.987 0.996 0.974 0.962 0.975 0.967 0.988 0.981
CCRE Left Eye 0.925 0.935 0.925 0.907 0.875 0.890 0.834 0.871Right Eye 0.840 0.940 0.891 0.872 0.851 0.829 0.910 0.921
Whole Brain 0.986 0.981 0.976 0.96 0.950 0.961 0.942 0.952MI Left Eye 0.911 0.893 0.904 0.791 0.853 0.810 0.851 0.853
Right Eye 0.854 0.917 0.889 0.814 0.849 0.844 0.897 0.854
CHAPTER 4DIVERGENCE MEASURES FOR GROUPWISE POINT-SETS REGISTRATION
Matching point patterns is ubiquitous in many fields of Engineering and Science e.g.,
medical imaging, sports science, archaeology, and others. Point sets are widely used in
computer vision to represent boundary points of shapes contained in images or any other
salient features of objects contained in images. Given two or more images represented
using the salient features contained therein, most often than not, one is interested in
matching these (feature) point patterns to determine a linear or a nonlinear transformation
between the coordinates of the feature point sets. Such transformations capture the
changes in the pattern geometry characterized by the given feature point set.
The primary technical challenge in using point-set representations of shapes is
the correspondence problem. Typically correspondences can be estimated once the
point-sets are properly aligned with appropriate spatial transformations. If the objects
at hand are deformable, the adequate transformation would obviously be a non-rigid
spatial mapping. Solving for non-rigid deformations between point-sets with unknown
correspondence is a hard problem. In fact, many current methods only attempt to solve
for affine transformation for the alignment [55]. Furthermore, we also encounter the issue
of the bias problem in groupwise point-sets registration. If one arbitrarily chooses any one
of the given data sets as a reference, the estimated registration transformation would be
biased toward this chosen reference and it would be desirable to avoid such a bias. The
question that arises is: How do we align all the point-sets in a symmetric manner so that
there is no bias toward any particular point-set?
To overcome these aforementioned problems, we present a novel approach to
simultaneously register multiple point-sets and construct the atlas. The idea is to model
each point set by a kernel probability density or distribution, then quantify the distance
34
35
between these probability densities or distributions using information-theoretic measures.
Figure4–1illustrate this idea, where the right column of the figure is the density function
corresponding to the corpus callosum point-sets shown in the left. The distance is
Figure 4–1:Illustration of corpus callosum point-sets represented as density functions.
optimized over a space of coordinate transformations yielding the desired registrations.
It is obvious that once all the point sets are deformed into the same shape, the distance
measure between these distributions should be minimized since all the distribution are
identical to each other. We impose regularization on each deformation field to prevent
over-deforming of each point-sets (e.g. all the point-sets may deform into a single data
point).
The rest of the chapter is organized as follows: we begin by reviewing all the related
literatures, which is followed by a description of the divergence measures we used for
quantify the distance between densities or distributions. We then present the details of
36
our energy function, and the empirical way of estimating the cost functions and their
derivatives. Finally we will show the experimental results at the end of this chapter.
4.1 Previous Work
Extensive studies on the atlas construction for deformable shapes can be found in
literature covering both theoretical and practical issues relating to computer vision and
pattern recognition. According to the shape representation, they can be classified into
two distinct categories. One is the methods dealing with shapes represented by feature
point-sets, and everything else is in the other category including those shapes represented
as curves and surfaces of the shape boundary, and these curves and surfaces may be
either intrinsically or extrinsically parameterized (e.g. using point locations and spline
coefficients).
The work presented in [56] is a representative method using an intrinsic curve
parameterization to analyze deformable shapes. Shapes are represented as elements of
infinite-dimensional spaces and their pairwise difference are quantified using the lengths
of geodesics connecting them on these spaces, the intrinsic mean (Karcher mean) can
be computed as a point on the manifold (of shapes) which minimize the sum of square
geodesic distance between this unknown point to each individual shape, which lies on the
manifold. However the curves are limited by closed curves, and it has not been extended
to the 3D surface shapes. For methods using intrinsic curve or surface representations
[56, 57, 58], further statistical analysis on these representations is much more difficult
than analysis on the point representation, but the reward maybe higher due to the use of
intrinsic higher order representation.
Among these methods using point-sets parameterization, the idea of using non-
rigid spatial mapping functions, specifically thin-plate splines [59, 60, 61], to analyze
deformable shape has been widely adopted. Bookstein’s work in [59], successfully
initiated the research efforts on the usage of thin-plate splines to model the deformation
of shapes. This method is landmark-based, it avoids the correspondence problem since
37
the placement of corresponding points is driven by the visual perception of experts,
however it suffers from the the typical problem besetting landmark methods, e.g.
inconsistency. Several significant articles on robust and non-rigid point set matching have
been published by Rangaranjan and collaborators [62, 60, 63] using thin-plate splines.
In their recent work [60], they attempt to extend their work to the construction of an
mean shape from a set of unlabeled shapes which are represented by unlabeled point-sets.
The main strength of their work is the ability to jointly determine the correspondences
and non-rigid transformation between each point sets to the emerging mean shape using
deterministic annealing and soft-assign. However, in their work, the stability of the
registration result is not guaranteed in the case of data with outliers, and hence a good
stopping criterion is required. Unlike their approach, we do not need to first solve a
correspondence problem in order to subsequently solve a non-rigid registration problem.
The active shape model proposed in [64] utilized points to represent deformable
shapes. Their work pioneered the efforts in building point distribution models to under-
stand deformable shapes [64, 65]. Objects are represented as carefully-defined landmark
points and variation of shapes are modeled using a principal component analysis. These
landmark points are acquired through a more or less manual landmarking process where
an expert goes through all the samples to mark corresponding points on each sample. It is
a rather tedious process and accuracy is limited. In recent work [66], the authors attempt
to overcome this limitation by attempting to automatically solve for the correspondences
in a non-rigid setting. The resulting algorithm is very similar to the earlier work in [58]
and is restricted to curves. The work in [55] also uses 2D points to learn shape statistics,
which is quite similar to the active shape model method except that more attention has
been paid to the sample point-sets generation process from the shape. Unlike our method,
the transformation between curves are limited by rigid mapping, and process is not
symmetric.
38
There are several papers in the point-sets alignment literature which bear close
relation to our research reported here. For instance, Tsin and Kanade [67] proposed
a kernel correlation based point set registration approach where the cost function is
proportional to the correlation of two kernel density estimates. It is similar to our
work since we too model each of the point sets by a kernel density function and then
quantify the (dis)similarity between them using an information-theoretic measure,
followed by an optimization of a (dis)similarity function over a space of coordinate
transformations yielding the desired transformation. The difference lies in the fact that
divergence measures used in our work is a lot more general than the information-theoretic
measure used in [67], and can be easily extended to multiple point-sets. More recently,
in [68], Glaunes et al. convert the point matching problem into an image matching
problem by treating points as delta functions. Then they ”lift” these delta functions and
diffeomorphically match them. The main problem for this technique is that they need
a 3D spatial integral which must be numerically computed, while we do not need this
due to the empirical computation of the divergence measures. We will show it in the
experimental results that our method, when applied to match point-sets, achieves very
good performance in terms of both robustness and accuracy.
4.2 Divergence Measures
In probability theory and information theory, divergence measures generally stands
for those measures that quantify ”distance” between probability distributions. If there are
multiple distributions, the divergence will serve as a measure of cohesion between these
distributions. Since we are dealing with groupwise point-sets, which will be represented
as multiple probability densities/distributions, we will focus on these divergence measures
between multiple distributions.
4.2.1 Jensen-Shannon Divergence
There are many information and divergence measures exist in the literature on
information theory and statistics. The most famous one among them are Kullback-Leiber
39
(KL) divergence. The KL divergence (also known as the relative entropy) between two
densitiesp andq is defined as
DKL(p‖q) =
∫p(x) log
p(x)
q(x)dx
It is convex inp, non-negative (though not necessarily finite), and is zero if and only if
p = q. In information theory, it has an interpretation in terms of the length of encoded
messages from a source which emits symbols according to a probability density function.
While the familiar Shannon entropy gives a lower bound on the average length per
symbol a code can achieve, the KL-divergence betweenp andq gives the penalty (in
length per symbol) incurred by encoding a source with densityp under the assumption
that it really has densityq; this penalty is commonly called redundancy.
To illustrate this, consider the Morse code, designed to send messages in English.
The Morse code encodes the letter ”E” with a single dot and the letter ”Q” with a
sequence of four dots and dashes. Because ”E” is used frequently in English and ”Q”
seldom, this makes for efficient transmission. However if one wanted to use the Morse
code to send messages in Chinese pinyin, which might use ”Q” more frequently, he would
find the code less efficient. If we assume contrafactually that the Morse code is optimal
for English, this difference in efficiency is the redundancy.
Notice that KL divergence is not symmetric and a popular way to symmetrize it is
J(p, q) =1
2(DKL(p‖q) + DKL(q‖p))
which is called the J-divergence. Jensen-Shannon (JS) divergence, first introduced in [69],
serves as a measure of cohesion between multiple probability distributions. It has been
used by some researchers as a dissimilarity measure for image registration and retrieval
applications [70, 71] with very good results. It has many desirable properties, to name a
few, 1) The square root of JS-divergence (in the case when its parameter is fixed to0.5)
is a metric [72]; 2) JS-divergence relates to other information-theoretic functionals, such
40
as the relative entropy or the Kullback divergence, and hence it shares their mathematical
properties as well as their intuitive appeal; 3) The compared distributions using the
JS-divergence can be weighted, which allows one to take into account the different sizes
of the point set samples from which the probability distributions are computed; 4) The
JS-divergence measure also allows us to have different number of cluster centers in each
point-set. There is no requirement that the cluster centers be in correspondence as is
required by Chui et al [73]. Givenn probability density functionspi, i ∈ 1, ..., n, the
JS-divergence ofpi is defined by
JSπ(p1,p2, ...,pn) = H(∑
πipi)−∑
πiH(pi) (4–1)
whereπ = π1, π2, ..., πn|πi > 0,∑
πi = 1 are the weights of the probability density
functionspi andH(pi) is the Shannon entropy. The two terms on the right hand side
of Equation (4–1) are the entropy ofp :=∑
πipi (theπ- convex combination of the
pis ) and the same convex combination of the respective entropies. We can show that
JS-divergence can be derived from the KL divergence
JSα(p1, p2) = αKL(p1, αp1 + (1− α)p2) + (1− α)KL(p2, αp1 + (1− α)p2) (4–2)
whereα ∈ (0, 1) is a fixed parameter; we will also consider its straightforward general-
ization ton distributions.
4.2.2 CDF-JS Divergence
In Chapter2, we defined an entropy measure which is based on probability dis-
tribution instead of density function. The distribution function is more regular because
it is defined in an integral form unlike the density function, which is the derivative of
the distribution. The definition of Cumulative Residual Entropy also preserves the well
established principle that the logarithm of the probability of an event should represent
the information content in the event. CRE is shown to be more immune to noise and
41
outliers. Based on this idea, we can define a KL-divergence measure between Cumulative
Distribution Functions (CDFs),
Definition: Let Pr(X1 > x) andPr(X2 > x) be the cumulative residual distribution of
two random variablesX1 andX2 respectively, we define the CDF-KL divergence by
KD(P1,P2) =
∫Pr(X1 > x) ln
Pr(X1 > x)
Pr(X2 > x)dx (4–3)
Follow the same relationship between Jensen-Shannon divergence and KL divergence, we
can derive the so-called CDF-JS divergence from the definition of CDF-KL divergence,
(denoted asJ ), the result of which is shown in the following theorem,
Theorem 5 GivenN probability distributionsPi, i ∈ 1, ..., n, the CDF-JS divergence
of Pi is given by
J (P1,P2, ...,Pn) = E(∑
i
πiPi)−∑
i
πiE(Pi) (4–4)
whereπ = π1, π2, ..., πn|πi > 0,∑
πi = 1 are the weights of the probability
distributionsPi andE is the Cumulative Residual Entropy defined in Eqn. (2–3) of
Chapter2.
42
Proof: Without loss of generality, we can prove it for two random variable case, for
which the CDF-JS can be written as follows,
J (P1,P2)
= αKD(P1,P) + (1− α)KD(P2,P)
= α
∫Pr(X1 > x) ln
Pr(X1 > x)
Pr(X > x)dx + (1− α)
∫Pr(X2 > x) ln
Pr(X2 > x)
Pr(X > x)dx
= α
∫Pr(X1 > x) ln Pr(X1 > x)dx + (1− α)
∫Pr(X2 > x) ln Pr(X1 > x)dx
−∫ [
α Pr(X1 > x) + (1− α) Pr(X2 > x)]ln Pr(X > x)dx
= −αE(P1)− (1− α)E(P2)
−∫ [
α Pr(X1 > x) + (1− α) Pr(X2 > x)]ln Pr(X > x)dx
(4–5)
where,P is the distribution function corresponding to the density function
p = αp1 + (1− α)p2, which is the convex combination of the two probability densities,
therefore
Pr(X > x) =
∫ ∞
x
p(x)dx
=
∫ ∞
x
αp1(x) + (1− α)p2(x)dx
= α Pr(X1 > x) + (1− α) Pr(X2 > x)
(4–6)
Consequently, CDF-JS divergence for two random variable can be rewritten as
J (P1,P2) = −αE(P1)− (1− α)E(P2)−∫
P(X > x) lnP(X > x)dx
= E(P)− αE(P1)− (1− α)E(P2)
(4–7)
4.3 Methodology
In this section, we present the details of the proposed simultaneous atlas construction and
non-rigid registration method. Note that atlas construction normally requires the task of
43
non-rigid registration following which an atlas is constructed from the registered data.
However, in our work, the atlas emerges as a byproduct of the non-rigid registration. The
basic idea is to model each point set by a probability density or distribution function, then
quantify the distance between these functions using an information-theoretic measure.
The distance measure is optimized over a space of coordinate transformations yielding the
desired transformations. We will begin by presenting the energy function for solving the
groupwise point-sets registration problem.
4.3.1 Energy Function for Groupwise Point-sets Registration
We use the following notation: The data point-sets are denoted byXp, p ∈ 1, ..., N.Each point-setXp consists of pointsxp
i ∈ RD, i ∈ 1, ..., np . The atlas points-set is
denoted byZ. Assume that each point setXp is related toZ via a functionfp, µp is the
set of the transformation parameters associated with each functionf p. To compute the
mean shape from these point-sets and register them to the emerging mean shape, we need
to recover these transformation parameters to construct the mean shape. This problem can
modeled as an optimization problem with the objective function being the JS-divergence
or CDF-JS divergence between the distributions of the deformed point-sets, represented
asPi = p(f i(X i)), the atlas construction problem can now be formulated as,
minµiD(P1,P2, ...,PN) + λ
N∑i=1
||Lf i||2(4–8)
In (4–8),D is the divergence measure for multiple distributions, which we propose to use
either JS divergence or CDF-JS divergence. The weight parameterλ is a positive constant
the operatorL determines the kind of regularization imposed. For example,L could
correspond to a thin-plate spline, a Gaussian radial basis function, etc. Each choice ofL
is in turn related to a kernel and a metric of the deformation from and toZ.
Following the approach in [73], we choose the thin-plate spline (TPS) to represent the
non-rigid deformation. Givenn control pointsx1, . . . ,xn in Rd, a general non-rigid
44
mappingf : Rd → Rd represented by thin-plate spline can be written analytically as:
f(x) = WU(x) + Ax + t HereAx + t is the linear part off . The nonlinear part is
determined by ad× n matrix,W. And U(x) is ann× 1 vector consisting ofn basis
functionsUi(x) = U(x,xi) = U(‖x− xi‖) whereU(r) is the kernel function of
thin-plate spline. For example, if the dimension is 2 (d = 2) and the regularization
functional is defined on the second derivatives off , we haveU(r) = 1/(8π)r2ln(r).
Therefore, the cost function for non-rigid registration can be formulated as an energy
functional in a regularization framework, where the regularization term in equation4–8is
governed by the bending energy of the thin-plate spline warping and can be explicitly
given bytrace(WKWT ) whereK = (Kij), Kij = U(pi, pj) describes the internal
structure of the control point sets. Note the linear part can be obtained by an initial affine
registration, then an optimization can be performed to find the parameterW.
4.3.2 JS Divergence in a Hypothesis Testing Framework
In this section we show that the Jensen-Shannon divergence can be interpreted in the the
frame work of statistical hypothesis testing. To see this, we construct a likelihood ratio
between i.i.d. samples drawn from a mixture (∑
a πapa) and i.i.d. samples drawn from a
heterogeneous collection of densities (p1,p2, ...,pN ) with the samples being indexed by
the specific member distribution in the family from which they are drawn. Assume thatn1
samples are drawn fromp1, n2 from p2 etc. Let the total number of pooled samples be
defined asMdef=
∑Na=1 na. The likelihood ratio then is,
Λ =
∏Mk=1
∑Na=1 πapa(xk)∏N
a=1
∏na
ka=1 pa(xaka
). (4–9)
wherexk consists of pointsxai , i ∈ 1, ..., na, a ∈ 1, ..., N, which is the pooled data
of all the samples. In contrast to the typical statistical test relative to a threshold, we seek
the maximum of the likelihood ration in Eqn. (4–9). The following theorem shows the
relationship between Jensen-Shannon divergence and the likelihood ration.
45
Theorem 6 GivenN probability density functionspa, a ∈ 1, ..., N, maximizing the
hypothesis ratio in Eqn. (4–9) is equivalent to minimizing the Jensen-Shannon divergence
between theN probability densitiespa, a ∈ 1, ..., N.
Proof: The proof follows by taking logarithm of the likelihood ratio, and then using the
weak law of large numbers, we can show that the log-likelihood ratio is the negative of
Jensen-Shannon divergence.
We seek to maximize the probability that the samples are drawn from the mixture rather
than from separate members of the family (p1,p2, ...,pN ). In the context of groupwise
matching of point-sets, this makes eminent sense since maximizing the above ratio is
tantamount to increasing the chance that all of the observed point-sets are warped
versions of the same underlying warped and pooled data model.The notion of the pooled
data model is defined as follows. In our process of groupwise registration, the warping
does not have a fixed target data set. Instead, the warping is between the input data sets
and an evolving target which we call the pooled model. The target evolves to a fully
registered pooled data set at the end of the optimization process. The pooled model then
consists of input data sets which have undergone groupwise matching and are now fully
registered with each other.The connection to the JS-divergence arises from the fact that
the negative logarithm of the above ratio (Eqn.4–9) asymptotically converges to the
JS-divergence when the samples are assumed to be drawn from the mixture∑
a πapa.
4.3.3 Unbiasness Property of the Divergence Measures
Typically we are required to construct an atlas from very large number of point-sets, and
this process will usually take a long time since the computational complexity grows
polynomially with the increase of number of point-sets (N ) that we want to register.
However the following hierarchical method will significantly reduce the computational
complexity.
Assume that we are givenN point-sets, from which we are going to construct the atlas,
we can then divide theN point-sets intom subsets (generallym ¿ N ), therefore we can
46
constructm atlases from each subsets using our algorithms, and all the point-sets inside
each subsets are registered. Then we can either construct a single atlas from thesem atlas
point-sets, or we can further dividem atlas point-sets into even smaller subsets, and
follow the same process until a single atlas is constructed. The remaining question is
whether the atlas thus obtained is biased or not? The following theorem will lead us to the
answer.
Theorem 7 GivenN probability distributionsPa, a ∈ 1, ..., N, each having a weight
πa in the JS divergence or CDF-divergence. If we further divide theN distributions into
m subsets, such thatith subset containsni distributionsPa, a ∈ k(i)1 , k
(i)2 , ..., k
(i)ni , and
∑i ni = N . AssumeSi is the convex combination of the all the distributions in theith
subset, with the weightsπ
k(i)
βi, whereβi =
∑j π
k(i)j
, i.e. Si =∑ni
j=1 πk(i)j
Pk(i)j
/βi. We then
have the following relationship between the JS divergence of thePas and divergence of
theSis
Dπ(P1,P2, · · · ,PN)−Dβ(S1,S2, · · · ,Sm)
=m∑
i=1
βiDπk(i)
βi
(Pk(i)1
,Pk(i)2
, · · · ,Pk(i)ni
)(4–10)
Proof: It is trivial to derive the relationship in Eqn. (4–10) by simple algebraic
operations.
In our registration algorithm, all the point-sets are represented as probability distributions,
and the atlas thus constructed can be considered as convex combination of these
distributions. Therefore, we can treatPas andSis as the distributions corresponding to
the point-sets and the constructed atlases from the subsets respectively. Therefore from
Theorem7, we know that the relationship in Eqn. (4–10) holds between the JS divergence
of thePas andSis. Notice that the right hand side of Eqn. (4–10) is the JS/CDF-JS
divergences of the distributions in all the subsets, which are minimized in each steps of
the hierarchical method we proposed. Intuitively, if these point sets are aligned properly,
47
the corresponding distribution functions should be statistically similar. Therefore the
divergences of all the subsets should be zero all very close to zero, which means the right
hand side of Eqn. (4–10) is zero. Consequently, the JS/CDF-JS divergence of thePas and
divergence of theSis are equal to each other, therefore minimizing JS/CDF-JS divergence
of all the resultant atlas point-sets is equivalent to minimizing divergence of the original
point-sets, implying that there is no bias toward any particular partitioning of the
point-sets.
Having introduced the cost function and the transformation model, now the task is to
design an efficient way to estimate empirical divergence measures between multiple
densities or distributions and derive the analytic gradient of the estimated divergence in
order to achieve the optimal solution efficiently. We design two complete different
approaches for estimating JS divergence and CDF-JS divergence. We use finite mixture
model for estimating JS divergence and the parzen window technique for CDF-JS
divergence, the details of which will be introduced next.
4.3.4 Estimating JS and its Derivative
4.3.4.1 Finite mixture models
Considering the point set as a collection of Dirac Delta functions, it is natural to think of a
finite mixture model as representation of a point set. As the most frequently used mixture
model, a Gaussian mixture [74] is defined as a convex combination of Gaussian
component densities.
To model each point-set as a Gaussian mixture, we define a set of cluster centers, one for
each point-set, to serve as the Gaussian mixture centers. Since the feature point-sets are
usually highly structured, we can expect them to cluster well. Furthermore we can greatly
improve the algorithm efficiency by using limited number of clusters. Note that we can
choose the cluster centers to be the point-set itself if the size of point-set is quite small.
The cluster center point-sets are denoted byV p, p ∈ 1, ..., N. Each point-setV p
consists of pointsvpi ∈ RD, i ∈ 1, ..., Kp . Note that there areKp points in eachV p,
48
and the number of clusters for each point-set may be different (in our implementation, the
number of clusters were usually chosen to be proportional to the size of the point-sets).
The cluster centers are estimated by using a clustering process over the original sample
pointsxpi , and we only need to do this once before the process of joint atlas estimation
and point-sets registration. In our implementation, we utilize deterministic annealing
(DA) procedure with its proven benefit of robustness in clustering [75]. We begin by
specifying the density function of each point set.
pp(x) =Kp∑a=1
αpap(x|vp
a) (4–11)
In Equation (4–11), the occupancy probability which is different for each data point-set is
denoted byαp. The component densitiesp(x|vpa) is
p(x|vpa) =
1
(2π)D2 Σ
12a
exp(− 1
2
(x− vp
a
)TΣ−1
a
(x− vp
a
))(4–12)
Probability of the point setXp coming from this mixture is then
Pr(Xp|V p, αp) =
np∏i=1
pp(xpi ) =
np∏i=1
Kp∑a=1
αpap(xp
i |vpa) (4–13)
Later, we set the occupancy probability to be uniform and make the covariance matrices
Σa to be proportional to the identity matrix in order to simplify atlas estimation
procedure.
For simplicity, we chooseβi = 1N
, ∀i = 1, 2, ..., N. Let
Qxj
ip :=
∑Ka=1 αp
aPr(f j(xji )|fp(vp
a)) be a mixture model containing component densities
Pr(f j(xji )|f p(vp
a)),
p(f j(xji )|fp(vp
a)) =1
(2π)D2 Σ
12a
exp(− 1
2
(f j(xj
i )− fp(vpa)
)TΣ−1
a
(f j(xj
i )− f p(vpa)
))
(4–14)
WhereΣa, a ∈ 1, ..., K is the set of cluster covariance matrices. For the sake of
simplicity and ease of implementation, we assume that the occupancy probabilities are
49
uniform (αpa = 1
K) and the covariance matricesΣa are isotropic, diagonal, and identical
[(Σa = σ2ID)]. Having specified the density function of the data, we can then rewrite
Equation (4–8) as follows,
JSβ(P1,P2, ...,PN)
=1
N
[H(
∑ 1
NPi)−
∑H(P1)]
+[H(∑ 1
NPi)−
∑H(P2)]
+ · · ·+ [H(∑ 1
NPi)−
∑H(PN)]
(4–15)
For each term in the equation, we can estimate the entropy using the weak law of large
numbers, which is given by,
H(∑ 1
NPi)−H(Pj)) = − 1
ni
ni∑i=1
logQ
xji
1 + Qxj
i2 + ... + Q
xji
N
N+
1
ni
ni∑i=1
log Qxj
ij
=1
ni
ni∑i=1
logNQ
xji
j
Qxj
i1 + Q
xji
2 + ... + Qxj
iN
Combining these terms we have,
JS(P1,P2, ...,PN)
= 1
n1
n1∑i=1
logNQ
x1i
1
Qx1
i1 + Q
x1i
2 + ... + Qx1
iN
+1
n2
n2∑i=1
logNQ
x2i
2
Qx2
i1 + Q
x2i
2 + ... + Qx2
iN
+ · · ·+ 1
nN
nN∑i=1
logNQ
xNi
N
QxN
i1 + Q
xNi
2 + ... + QxN
iN
(4–16)
4.3.4.2 Optimizing the JS divergence
Computation of the gradient of the energy function is necessary in the minimization
process when employing a gradient-based scheme. If this can be done in analytical form,
it leads to an efficient optimization method. We now present the analytic form of the
gradient of the JS-divergence (our cost function):
∇JS = [∂JS
∂µ1,∂JS
∂µ2, ...,
∂JS
∂µN] (4–17)
50
Each component of the gradient maybe found by differentiating Eqn. (4–16) with respect
to the transformation parameters. In order to compute this gradient, let’s first calculate the
derivative ofQxj
ip with respect toµl,
∂Qxj
ip
∂µl=
1
(2π)D2 σ3K
∑Ka=1− exp
(− 1
2σ2 |Fjp|2)(Fjp · ∂fj(xj
i )
∂µl ) if l = j 6= p
1
(2π)D2 σ3K
∑Ka=1 exp
(− 1
2σ2 |Fjp|2)(Fjp · ∂fp(vp
a)∂µl ) if l = p 6= j
1
(2π)D2 σ3K
∑Ka=1 exp
(− 1
2σ2 |Fjp|2)(Fjp · [∂fp(vp
a)∂µl − ∂fj(xj
i )
∂µl ] if l = p = j
(4–18)
whereFjp := f j(xji )− f p(vp
a). Based on this, it is straight forward to derive the gradient
of the JS-divergence with respect to the transformation parametersµl, which is given by
∂JS
∂µl=
− 1
n1
n1∑
i=1
( 1
Qx1
i1 + Q
x1i
2 + ... + Qx1
iN
)∂Qx1
il
∂µl
− 1n2
n2∑
i=1
( 1
Qx2
i1 + Q
x2i
2 + ... + Qx2
iN
)∂Qx2
il
∂µl+ ......
− 1nl
nl∑
i=1
( 1
Qxl
i1 + Q
xli
2 + ... + Qxl
iN
)
[∂Qxl1
1
∂µl+ ... +
∂Qxl
NN
∂µl
]+
1nl
nl∑
i=1
( 1
Qxl
il
)∂Qxl
il
∂µl
− ......− 1nN
nN∑
i=1
( 1
QxN
i1 + Q
xNi
2 + ... + QxN
iN
)∂QxN
il
∂µl
(4–19)
4.3.5 Estimating CDF-JS and its Derivative
We will use the parzen window technique to estimate the cumulative distribution function
and its derivative. The calculation of CDF-JS divergence requires the estimation of the
cumulative probability distributions of each point-set. Without lost of generality, we only
discuss the derivation for the 2D case, which can be extended to 3D case easily. For each
pointxai with the coordinates[xa
i , yai ] in the point setXa, it is transformed by the function
fa to fa(xai ,µ
a) = [fa(xai ,µ
a), fa(yai ,µ
a)]. Let β(3) be a cubic spline Parzen window.
The smoothed probability density functionpa(l, k; µa) of the point-set
51
Xa, a ∈ 1, ..., N is given by
pa(l, k; µa) = αa
na∑i
β(3)(l − fa(xa
i ,µa)− x0
4bX
)
β(3)(k − fa(ya
i ,µa)− y0
4bY
) (4–20)
whereαa is a normalization factor that ensures∫ ∫
p(l, k)dldk = 1, [l, k] are the
coordinate values in the X and Y axis respectively, the transformed point coordinates
[fa(xai ,µ
a), fa(yai , µ
a)] is normalized by the minimum coordinate value,x0, y0, and the
range of each bin,4bX ,4bY . From the density function, we can calculate the cumulative
residual distribution function by the formula
P a(l > λ, k > γ; µa) =∫∞
λ
∫∞γ
pa(l, k; µa)dldk, whereλ ∈ Lx, γ ∈ Ly, Lx, Ly are
discrete sets of coordinate values in the X and Y axis respectively. To express it in further
detail, we have the following,
P a(λ, γ; µa) = αana∑
i
∫ ∞
λβ(3)
(l − fa(xa
i ,µa)− x0
4bX
)dl
∫ ∞
γβ(3)
(k − fa(ya
i , µa)− y0
4bY
)dk
= αana∑
i
Φ(λ− fa(xa
i , µa)− x0
4bX
)Φ
(γ − fa(ya
i ,µa)− y0
4bY
)
whereΦ(.) is the cumulative residual function of cubic spline kernel which is defined as
follows,
Φ(v) =
∫ ∞
v
β(3)(u)du
Note thatdΦ(u)du
= −β(3)(u).
Having specified the distribution function of the data, we can then rewrite Eqn. (4–8) as
follows, (For simplicity, we chooseπa = 1N
,∀a = 1, 2, ..., N. )
J (P1,P2, ...,PN ) = E(N∑
a=1
πaPa)−N∑
a=1
πaE(Pa)
= −∑
λ
∑γ
P log P +1N
∑a
∑
λ
∑γ
P a log P a
(4–21)
52
whereP is the cumulative residual distribution function for the density function
1N
∑pa(l, k; µa), which can be expressed as
P (λ, γ; µa) =1N
N∑
a=1
αana∑
i
Φ(λ− fa(xa
i , µa)− x0
4bX
)
Φ(γ − fa(ya
i ,µa)− y0
4bY
) (4–22)
4.3.5.1 Optimizing the CDF-JS divergence
We now present the analytic form of the gradient of the CDF-JS divergence (our cost
function):
∇J = [∂J∂µ1
,∂J∂µ2
, ...,∂J∂µN
] (4–23)
Each component of the gradient maybe found by differentiating Eqn (4–21) with respect
to the transformation parameters. It can be easily shown that∂P (λ,γ;µa)∂µa = 1
N∂P a(λ,γ;µa)
∂µa .
Based on these facts, it is straight forward to derive the gradient of the CDF-JS
divergence with respect to the transformation parametersµa, which is given by
∂J∂µa
= −∑
λ
∑γ
[1 + logP ]∂P (λ, γ;µa)
∂µa
+1N
∑
λ
∑γ
[1 + logP a]∂P a(λ, γ; µa)
∂µa
=1N
∑a
∑
λ
∑γ
∂P a(λ, γ; µa)∂µa
logP a
P
(4–24)
As a byproduct of the groupwise registration using the CDF-JS divergence, we get the
atlas of the given population of data sets, which is simply obtained by substituting the
estimated transformation functionsfa in to the formula for the atlasp(A) =∑N
a=1 πaPa.
Note that our algorithm can be applied to yield a biased registration in situations that
demand such a solution. This is achieved by fixing one of the data sets (say the reference)
and estimating the transformation from this to the novel scene data. We will present
53
experimental results on point-set alignment between two given point-sets as well as atlas
construction from multiple point-sets in the next section.
4.4 Experiment Results
We now present experimental results using JS divergence and CDF-JS divergence for
point-sets registration. And the results will be demonstrated on both synthetic and real
data sets.
4.4.1 JS Divergence Results
To demonstrate the robustness and accuracy of our algorithm, we show the alignment
results by applying the JS-divergence to the point-set matching problem. Then, we will
present the atlas construction results in the second part.
4.4.1.1 Alignment results
First, to test the validity of our approach, we perform a set of exact rigid registration
experiments on both synthetic and real data sets without noise and outliers. Some
examples are shown in Figure4–2. The top row shows the registration result for a 2D real
range data set of a road (which was also used in Tsin and Kanade’s experiments [67]).
The figure depicts the real data and the registered (using rigid motion). Top left frame
contains two unregistered point sets superposed on each other. Top right frame contains
the same point sets after registration using our algorithm. A 3D helix example is
presented in the second row (with the same arrangement as the top row). We also tested
our method against the KC method [67] and the ICP methods, as expected, our method
and KC method exhibit a much wider convergence basin/range than the ICP and both
achieve very high accuracy in the noiseless case.
We also applied our algorithm to non-rigidly register medical datasets (2D point-sets).
Figure4–3depicts some results of our registration method applied to a set of 2D corpus
callosum slices with feature points manually extracted by human experts. Registration
result is shown in the left column with the warping of 2D grid under the recovered motion
which is shown in the middle column. Our non-rigid alignment performs well in the
54
−40 −20 0−10
0
10
20
30
Initial setup
−30 −20 −10 0 10
−5
0
5
10
15
20
25
30
35
After registration
−5
0
5
−5
0
5
10
15
20−20
−10
0
10
20
30
40
50
60
Initial setup
−5
0
5
0
5
10
15
20−20
−10
0
10
20
30
40
50
60
After registration
Figure 4–2:Results of rigid registration in noiseless case. ’o’ and ’+’ indicate the modeland scene points respectively.
presence of noise and outliers (Figure4–3right column). For the purpose of comparison,
we also tested the TPS-RPM program provided in [62] on this data set, and found that
TPS-RPM can correctly register the pair without outliers (Figure4–3top left) but failed
to match the corrupted pair (Figure4–3top right).
0.4 0.6 0.8 1
Initial Setup
0.4 0.6 0.8 1
After registration0.4 0.6 0.8 1
00.10.2
Original point set
0.4 0.6 0.8 1
00.10.2
Deformed point set0.4 0.6 0.8 1
00.10.2
Initial Setup
0.2 0.4 0.6 0.8 1
00.10.2
After registration
Figure 4–3:Non-rigid registration of the corpus callosum data. Left column: two man-ually segmented corpus callosum slices before and after registration; Middle column:warping of the 2D grid using the recovered motion; Top right: same slices with one cor-rupted by noise and outliers, before and after registration.
55
4.4.1.2 Atlas construction results
In this section, we begin with a simple but demonstrative example of our algorithm for 2D
atlas estimation. The structure we are interested in this experiment is the corpus callosum
as it appears in MR brain images. Constructing an atlas for the corpus callosum and
subsequently analyzing the individual shape variation from ”normal” anatomy has been
regarded as potentially valuable for the study of brain diseases such as agenesis of the
corpus callosum(ACC), and fetal alcohol syndrome(FAS).
0.4 0.6 0.8 1
0
0.1
0.2
Point−sets Before Registration
0.8 1
Point Set1
0.4 0.6 0.8 1−0.1
0
0.1
0.2
Point Set2
0.4 0.6 0.8 1−0.1
0
0.1
0.2
Point Set3
0.8 1
Point Set4
0.4 0.6 0.8 1−0.1
00.10.2
Point Set5
0.4 0.6 0.8 1−0.1
00.10.2
Point Set6
0.8 1
Point Set7
0.4 0.6 0.8 1
0
0.1
0.2
Deformed Point−sets
Figure 4–4:Experiment results on seven 2D corpus collasum point sets. The first tworows and the left image in third row show the deformation of each point-set to the at-las, superimposed with initial point set (show in ’o’) and deformed point-set (shown in’*’). Middle image in the third row: The estimated atlas is shown superimposed over allthe point-sets. Right: The estimated atlas is shown superimposed over all the deformedpoint-sets.
We manually extracted points on the outer contour of the corpus callosum from seven
normal subjects, (as shown Figure4–4, indicated by ”o”). The recovered deformation
between each point-set and the mean shape are superimposed on the first two rows in
Figure4–4. The resulting atlas (mean point-set) is shown in third row of Figure4–4, and
is superimposed over all the point-sets. As we described earlier, all these results are
computed simultaneously and automatically. This example clearly demonstrate that our
56
joint matching and atlas construction algorithm can simultaneously align multiple shapes
(modeled by sample point-sets) and compute a meaningful atlas/mean shape.
4.4.2 CDF-JS Divergence Results
First, to see how the CDF-JS method behaves in the presence of noise and outliers, we
designed the following procedure to generate a corrupted template point set from a model
set. For a model set withn points, we control the degree of corruption by (1) discarding a
subset of size(1− ρ)n from the model point set, (2) applying a rigid transformation (R,t)
to the template, (3) perturbing the points of the template with noise (of strengthε), and (4)
adding(τ − ρ)n spurious, uniformly distributed points to the template. Thus, after
corruption, a template point set will have a total ofτn points, of which onlyρn
correspond to points in the model set. Since ICP is known to be sensitive to outliers, we
only compare our method with the more robust Jensen-Shannon divergence method in
terms of the sensitivity of noise and outliers. The comparison is done via a set of 2D
experiments.At each of several noise levels and outliers strengths, we generate five
models and six corrupted templates from each model for a total of 30 pairs at each noise
and outlier strength setting. For each pair, we use our algorithm and the JS method to
estimate the known rigid transformation which was partially responsible for the
corruption. Results show when the noise level is low, both JS and CDF-JS have strong
resistance to outliers. However, we observe that when the noise level is high, CDF-JS
method exhibits stronger resistance to outliers than the JS method, as shown in Figure
4–5, which confirm that CDF-JS is indeed more robust in the presence of high noise and
outlier level. A 3D example is also presented in Figure4–6.
Next, we present groupwise registration results on 3D hippocampal point-sets. Four 3D
point-sets were extracted from epilepsy patients with left anterior temporal lobe foci
identified with EEG. An interactive segmentation tool was used to segment the
hippocampus from the 3D brain MRI scans of 4 subjects. The point-sets differ in shape,
with the number of points450, 421, 376, 307 in each point-set respectively. In the first
57
Figure 4–5:Robustness to outliers in the presence of large noise. Errors in estimated rigidtransform vs. proportion of outliers ((τ − ρ)/(ρ)) for both our method and KC method.
050
1000
100
Initial setup
40608010012050100
150200
0
50
100
150
200
After registration
Figure 4–6: Robustness test on 3D swan data. ’o’ and ’+’ indicate the model and scenepoints respectively. Note that the scene point-set is corrupted by noise and outliers.
four images of Figure4–7, the recovered nonrigid deformation between each
hippocampal point-set to the atlas is shown along with a superimposition on all of the
original data sets. In second row of the Figure4–7, we also show the scatter plot of
original point-sets along with all the point-sets after the non-rigid warping. An
examination of the two scatter plots clearly shows the efficacy of our recovered non-rigid
warping. Note that validation of what an atlas shape ought to be in the real data case is a
difficult problem and we relegate its resolution to a future paper.
58
Pointset 1 Pointset 2 Pointset 3
Pointset 4 Pooled Pointsets Deformed Pointsets
Figure 4–7: Atlas construction from four 3D hippocampal point sets. The first row andthe left image in second row shows the deformation of each point-set to the atlas (rep-resented as cluster centers), superimposed with initial point set (show in green ’o’) anddeformed point-set (shown in red ’+’). Left image in the second row: Scatter plot of theoriginal four hippocampal point-sets. Right: Scatter plot of all the warped point-sets.
CHAPTER 5APPLICATIONS TO IMAGE SEGMENTATION
In Medical Imaging applications, segmentation can be a daunting task due to possibly
large inhomogeneities in image intensities across an image e.g., in MR images. These
inhomogeneities combined with volume averaging during the imaging and possible lack
of precisely defined shape boundaries for certain anatomical structures complicates the
segmentation problem immensely. One possible solution for such situations is atlas-based
segmentation. The atlas once constructed can be used as a template and can be registered
non-rigidly to the image being segmented (henceforth called a target image) thereby
achieving the desired segmentation. Many of the methods that achieve atlas-based
segmentation are based on a two stage process involving, (i) estimating the non-rigid
deformation field between the atlas image and the target image and then, (ii) applying the
estimated deformation field to the desired shape/atlas to achieve the segmentation of the
corresponding structure/s in the target image. In this chapter, we develop a novel
technique that will simultaneously achieve the non-rigid registration and segmentation.
There is a vast body of literature for the tasks of registration and segmentation
independently however, methods that combine them into one algorithm are far and few in
between. In the following, we will briefly review the few existing methods that attempt to
achieve simultaneous registration and segmentation.
5.1 Related Work
In one of the earliest attempts at joint registration & segmentation, Bansal et al., [76]
developed a minmax entropy framework to rigidly register & segment portal and CT data
sets. In [77], Yezzi et al., present a variational principle for achieving simultaneous
registration and segmentation, however, the registration part is limited to rigid motions. A
similar limitation applies to the technique presented by Noble et al., in [78]. A variational
59
60
principle in a level-set based formulation was presented in Pargios et. al., [79], for
segmentation and registration of cardiac MRI data. Their formulation was again limited
to rigid motion and the experiments were limited to 2D images. In Fischl et al., [80], a
Bayesian method is presented that simultaneously estimates a linear registration and the
segmentation of a novel image. Note that linear registration does not involve non-rigid
deformations. The case of joint registration and segmentation with non-rigid registration
has not been addressed adequately in literature with the exception of the recent work
reported in Soatto and Yezzi [81] and Vemuriet al.,[82]. However, these methods can
only work with image pairs that are necessarily from the same modality or the intensity
profiles are not too disparate.
In this paper, we present a unified variational principle that will simultaneously register
the atlas shape (contour/surface) to the novel brain image and segment the desired shape
(contour/surface) in the novel image. In this work, the atlas serves in the segmentation
process as a prior and the registration of this prior to the novel brain scan will assist in
segmenting it. Another key feature/strength of our proposed registration+segmentation
scheme is thatit accommodates for image pairs having very distinct intensity
distributions as in multimodality data sets. More details on this are presented in section
5.2.
5.2 Registration+Segmentation Model
We now present our formulation of joint registration & segmentation model. LetI1 be the
atlas image containing the atlas shapeC, I2 the novel image that needs to be segmented
andv be the vector field, fromI2 to I1 i.e., the transformation is centered inI2, defining
the non-rigid deformation between the two images. The variational principle describing
our formulation of the registration assisted segmentation problem is given by:
minE(v, C) = Seg(I2, C) + dist(v(C), C) + Reg(I1, I2,v). (5–1)
61
Figure 5–1:Model Illustration
Where, the first term denotes the segmentation functional.C is the boundary contour
(surface in 3D) of the desired anatomical shape inI2. The second term measures the
distance between the transformed atlasv(C) and the current segmentationC in the novel
brain image i.e., the target image and the third term denotes the non-rigid registration
functional between the two images. Our joint registration & segmentation model is
illustrated in Figure5.2.
For the segmentation functional, we use a piecewise constant Mumford Shah model,
which is one of the well-known variational models for image segmentation, wherein it is
assumed that the image to be segmented can be modeled by piece-wise constant regions,
as was done in [54]. This assumption simplifies our presentation but our model itself can
be easily extended to the piecewise smooth regions case. Additionally, since we are only
interested in segmenting a desired anatomical shape (e.g., the hippocampus, the corpus
callosum, etc.), we will only be concerned with a binary segmentation i.e., two classes
namely, voxels inside the desired shape and those that are outside it. These assumptions
can be easily relaxed if necessary but at the cost of making the energy functional more
complicated and hence computationally more challenging. The segmentation functional
takes the following form:
Seg(I2, C) =
∫
Ω
(I2 − u)2dx + α
∮
C
ds (5–2)
62
Where,Ω is the image domain andα is a regularization parameter.u = ui if x ∈ Cin and
u = uo if x ∈ Cout. Cin andCout denote the regions inside and outside of the curve,C
representing the desired shape boundaries inI2.
For the non-rigid registration term in the energy function, we use the information
theoretic-based criteria, cross cumulative residual entropy (CCRE) which we introduced
in Chapters2. CCRE was shown to outperform Mutual Information based registration in
the context of noise immunity and convergence range, motivating us to pick this criteria
over the MI-based cost function. The new registration functional is defined by
Reg(I1, I2,v) = −(C(I1(v(x)), I2(x)) + µ
∫
Ω
(||∇v(x)||2))
(5–3)
where, cross-CREC(I1, I2) is given by,
C(I1, I2) = E(I1)− E[E(I1/I2)] (5–4)
with E(I1) = − ∫R+
P (|I1| > λ) log P (|I1| > λ)dλ andR+ = (x ∈ R; x ≥ 0). v(x) is as
before andµ is the regularization parameter and|| · || denotes Frobenius norm. Using a
B-spline representation of the non-rigid deformation, one need only compute this field at
the control points of the B-splines and interpolate elsewhere, thus accruing computational
advantages. Using this representation, we have derived analytic expressions for the
gradient of the energy with respect to the registration parameters. This in turn makes our
optimization more robust and efficient.
In order for the registration and the segmentation terms to “talk” to each other, we need a
connection term and that is given by
dist(v(C), C) =
∫
R
φv(C)(x) dx (5–5)
where,R is the region enclosed byC, φv(C)(x) is the embedding signed distance function
of the contourv(C), which can be used to measure the distance betweenv(C) andC.
The level-set functionφ : R2 → R is chosen so that its zero level-set corresponds to the
63
transformed template curvev(C). Let Edist := dist(v(C), C), one can show that
∂Edist
∂C= φv(C)(C)N whereN is the normal toC. The corresponding curve evolution
equation given by gradient descent is then
∂C
∂t= −φv(C)(C)N (5–6)
Not only does the signed distance function representation make it easier for us to convert
the curve evolution problem to the level-set framework, it also facilitates the matching of
the evolving curveC and the transformed template curvev(C), and yet does not rely on a
parametric specification of eitherC or the transformed template curve. Note that since
dist(v(C), C) is a function of the unknown registrationv and the unknown segmentation
C, it plays the crucial role of connecting the registration and the segmentation terms.
Combining these three functionals together, we get the following variational principle for
the simultaneous registration+segmentation problem:
minE(C,v, uo, ui) =
∫
Ω
(I2 − u)2dx + α1
∮
C
ds + α2 dist(v(C), C)
− α3C(I1(v(x)), I2(x)) + α4
∫
Ω
‖∇v(x)‖2dx.
(5–7)
αi are weights controlling the contribution of each term to the overall energy function and
can be treated as unknown constants and either set empirically or estimated during the
optimization process. This energy function is quite distinct from those used in methods
existing in literature because it is achieving the Mumford-Shah type of segmentation in an
active contour framework jointly with non-rigid registration and shape distance terms. We
are now ready to discuss the level-set formulation of the energy function in the following
section.
5.2.1 Gradient flows
The level set method have been used extensively for implementing the curve evolution
based segmentation, primarily due to its many advantages over the competing approaches.
These include the ability to elegantly handle changes in the topology of the curve (splits
64
and merges), the ability to deal with the formation of cusps and corners, which are
extremely common in curve evolution, and the numerical stability and efficiency afforded
in its implementation. For our model where the equation for the unknown curveC is
coupled with the equations forv(x), uo, ui, it is convenient for us to use the level set
approach as proposed in [54].
Taking the variation ofE(.) with respect toC and writing down the gradient descent
leads to the following curve evolution equation:
∂C
∂t= −
[−(I2 − ui)
2 + (I2 − uo)2 + α1κ + α2φv(C)(C)
]N (5–8)
Note that equation (5–6) is used in the derivation. Equation (5–8) in the level-set
framework is given by:
∂φ
∂t=
[−(I2 − ui)
2 + (I2 − uo)2 + α1∇ · ∇φ
|∇φ| + α2φv(C)(C)
]|∇φ| (5–9)
whereui anduo are the mean values inside and outside of the curveC in the imageI2. To
drive the curve towards the template’s level-set functionφv(C) more efficiently, rather than
just having the zero level-sets match, we can add another termφ(C) into the level-set
evolution equation giving us,
∂φ
∂t=
[− (I2 − ui)
2 + (I2 − uo)2 + α1∇ · ∇φ
|∇φ|+α2
(φv(C)(C)− φ(C)
)] |∇φ|.(5–10)
As illustrated in Figure5–2, the two parametersα1 andα2 are used to balance the
influence of the shape distance model and the region-based model. Note thatφ(C) = 0 at
any location of the curve by the definition of level-set functionφ, this added term does
not affect the curve evolution equation [83].
As mentioned before, we use a B-spline basis to represent the displacement vector field
v(x, µ), whereµ is the transformation parameters of the B-spline basis.
∂E
∂µ= α2
∂∫
Rφv(C)(x) dx
∂µ− α3
∂C(I1(v(x)), I2(x))
∂µ+ α4
∂∫
Ω‖∇v(x)‖2dx
∂µ(5–11)
65
Figure 5–2:Illustration of the various terms in the evolution of the level set functionφ.To updateφ, we combine the standard region based update termS, and level set functioncorresponding to the shape distance term.
The first term of equation(5–11) can be rewritten as follows:
∂∫
Rφv(C)(x) dx
∂µ=
∫
R
∂φv(C)(x) dx
∂µ
=
∫
R
∂φv(C)
∂v
∣∣∣v=v(x,µ)
· ∂v(x, µ)
∂µdx
(5–12)
where∂φv(C)
∂vis the directional derivative in the direction ofv(x, µ). The second term of
Eqn. (5–11) has been derived in Eqn. (3–15) of the chapter3. . We simply state the result
here without the derivations for the sake of brevity,
∂C(I2, I1 v(x; µ))
∂µ=
∑
λ∈I1
∑
k∈I2
logP (i > λ, k; µ)
pI2(k)P (i > λ; µ)· ∂P (i > λ, k; µ)
∂µ(5–13)
whereP (i > λ, k; µ) andP (i > λ; µ) are the joint and marginal cumulative residual
distributions respectively.pI2(k) is the density function of imageI2. The last term of Eqn.
(5–11) leads to,∂
∫Ω‖∇v(x)‖2dx
∂µ= 2
∫
Ω
∇v · ∂v
∂µdx (5–14)
where both the matrices∇v and ∂v∂µ
are vectorized before the dot product is computed.
66
Substituting equations (5–12), (5–13) and (5–14) respectively back into the equation
(5–11), we get the analytical gradient of our energy function with respect to the B-spline
transformation parametersµ. We then solve for the stationary point of this nonlinear
equation numerically using a quasi-Newton method.
5.2.2 Algorithm Summary
Given the atlas imageI1 and the unknown subject’s brain scanI2, we want the
segmentation resultC in I2. Initialize C in I2 to C and set the initial displacement field to
zero.
1. For fixedC, update deformation field using gradient-based numerical method for
one step.
2. For fixed deformation fieldv, evolveφ in I2 and thereby updateC as the zero
level-set ofφ.
3. Stop the registration process if the difference in consecutive iterates is less than
ε = 0.01, a pre-chosen tolerance, else go toStep 1.
5.3 Results
In this section, we present several examples results from an application of our algorithm.
The results are presented for synthetic as well as real data. The first three experiments
were performed in 2D, while the fourth one was performed in 3D. Note that the image
pairs used in all these experiments have significantly different intensity profiles, which is
unlike any of the previous methods, reported in literature, used for joint registration and
segmentation. The synthetic data example contains a pair of MR T1 and T2 weighted
images which are from the MNI brainweb site [52]. They were originally aligned with
each other. We use the MR T1 image as the source image and the target image was
generated from the MR T2 image by applying a known non-rigid transformation that was
procedurally generated using kernel-based spline representations (cubic B-Spline). The
possible values of each direction in deformation vary from−15 to 15 in pixels. In this
67
Figure 5–3:Results of application of our algorithm to synthetic data (see text for details).
case, we present the error in the estimated non-rigid deformation field, using our
algorithm, as an indicator of the accuracy of estimated deformations.
Figure5–3depicts the results obtained for this image pair. With the MR T1 image as the
source image, the target was obtained by applying a synthetically generated non-rigid
deformation field to the MR T2 image. Notice the significant difference between the
intensity profiles of the source and target images. Figure5–3is organized as follows, from
left to right: the first row depicts the source image with the atlas-segmentation superposed
in red, the registered source image which is obtained using our algorithm followed by the
target image with the unregistered atlas-segmentation superposed to depict the amount of
mis-alignment; second row depicts ground truth deformation field which we used to
generate the target image from the MR T2 image, followed by the estimated non-rigid
deformation field and finally the segmented target. As visually evident, the
registration+segmentation are quite accurate from a visual inspection point of view. As a
measure of accuracy of our method, we estimated the average,µ, and the standard
deviation,σ, of the error in the estimated non-rigid deformation field. The error was
estimated as the angle between the ground truth and estimated displacement vectors.The
68
average and standard deviation are 1.5139 and 4.3211 (in degrees) respectively, which is
quite accurate.
Table5–1depicts statistics of the error in estimated non-rigid deformation when
compared to the ground truth. For the mean ground truth deformation (magnitude of the
displacement vector) in Column-1 of each row, 5 distinct deformation fields with this
mean are generated and applied to the target image of the given source-target pair to
synthesize 5 pairs of distinct data sets. These pairs (one at a time) are input to our
algorithm and the mean(µ) of the mean deformation error (MDE) is computed over the
five pairs and reported in Column-2 of the table. MDE is defined as
Table 5–1:Statistics of the error in estimated non-rigid deformation.
µg µ of MDE σ of MDE2.4 0.5822 0.04643.3 0.6344 0.09234.5 0.7629 0.02535.5 0.7812 0.0714
dm = 1card(R)
∑xi∈R ||v0(xi)− v(xi)||, wherev0(xi) v(xi) is the ground truth and
estimated displacements respectively at voxelxi. ||.|| denotes the Euclidean norm, andR
is the volume of the region of interest. Column-3 depicts the standard deviation of the
MDE for the five pairs of data in each row. As evident, the mean and the standard
deviation of the error are reasonably small indicating the accuracy of our joint registration
+ segmentation algorithm.Note that this testing was done on a total of 20 image pairs
(=40) as there are 5 pairs of images per row.
For the first real data experiment, we selected two image slices from two different
modalities of brain scans. The two slices depict considerable changes in shape of the
ventricles, the region of interest in the data sets. One of the two slices was arbitrarily
selected as the source and segmentation of the ventricle in the source was achieved using
an active contour model. The goal was then to automatically find the ventricle in the
target image using our algorithm given the input data along with the segmented ventricles
69
Figure 5–4:Results of application of our algorithm to a pair of slices from human brainMRIs (see text for details).
in the source image. Figure5–4is organized as follows, from left to right: the first row
depicts the source image with the atlas-segmentation superposed in black followed by the
target image with the unregistered atlas-segmentation superposed to depict the amount of
mis-alignment; second row depicts the estimated non-rigid vector field and finally the
segmented target. As evident from figures5–4, the accuracy of the achieved
registration+segmentation visually very good. Note that the non-rigid deformation
between the two images in these two examples is quite large and our method was able to
simultaneously register and segment the target data sets quite accurately.
The second real data example is obtained from two brain MRIs of different subjects and
modalities, the segmentation of the cerebellum in the source image is given. We selected
two “corresponding” slices from these volume data sets to conduct the experiment. Note
that even though the number of slices for the two data sets are the same, the slices may
not correspond to each other from an anatomical point of view. However, for the purposes
of illustration of our algorithm, this is not very critical. We use the corresponding slice of
the 3D segmentation of the source as our atlas-segmentation. The results of an application
70
Figure 5–5:Corpus Callosum segmentation on a pair of corresponding slices from dis-tinct subjects.
of our algorithm are organized as before in figure5–5. Once again, as evident, the visual
quality of the segmentation and registration are very high.
Finally we present a 3D real data experiment. In this experiment, the input is a pair of 3D
brain scans with the segmentation of the hippocampus in one of the two images (labeled
the atlas image) being obtained using the well known PCA on the several training data
sets. Each data set contains 19 slices of size 256x256. The goal was then to automatically
find the hippocampus in the target image given the input. Figure5–6depicts the results
obtained for this image pair. From left to right, the first image shows the given (atlas)
hippocampus surface followed by one cross-section of this surface overlaid on the source
image slice; the third image shows the segmented hippocampus surface from the target
image using our algorithm and finally the cross-section of the segmented surface overlaid
on the target image slice. To validate the accuracy of the segmentation result, we
randomly sampled 120 points from the segmented surface and computed the average
distance from these points to the ground truth hand segmented hippocampal surface in the
target image. The hand segmentation was performed by an expert neuroanatomist. The
71
Figure 5–6:Hippocampal segmentation using our algorithm on a pair of brain scans fromdistinct subjects. (see text for details)
average and standard deviation of the error in the aforementioned distance in estimated
hippocampal shape are 0.8190 and 0.5121(in voxels) respectively, which is very accurate.
CHAPTER 6CONCLUSIONS AND FUTURE WORK
6.1 Contributions of the Dissertation
We have introduced a variety of information theoretic measures and showed various
applications. The novel information measures we presented in this dissertation include
• Entropy defined on distributions, Cumulative Residual Entropy (CRE)
• Cross-Cumulative Residual Entropy (CCRE)
• CDF based Kullback-Leiber (KL) divergence
• CDF based Jensen-Shannon (JS) divergence
We demonstrated their applications to the following medical image analysis problems,
• Non-rigid image registration.
• Simultaneous groupwise point-sets registration and atlas construction.
• Atlas based image segmentation.
Our contributions to each of these topics are summarized in the following sections.
6.2 Image and Point-sets Registration
6.2.1 Non-rigid Image Registration
For non-rigid image registration, we presented a novel way to register multi-modal
datasets based on the matching criterion called cross cumulative residual entropy(CCRE)
[84] to measure the similarity between two images. The matching measure is defined
based on a new information measure, namely cumulative residual entropy (CRE), which
is defined based on the probability distributions instead of probability densities, therefore
CCRE is valid for both discrete and continuous domain. Furthermore, CCRE also inherits
the robustness property of the CRE measure. In [84], we presented results of rigid and
affine registration under a variety of noise levels and showed significantly superior
performance over MI-based methods.
72
73
The Cross-CRE between two images to be registered is maximized over the space of
smooth and unknown non-rigid transformations, which is represented by a tri-cubic
BSplines placed on a regular gird. The analytic gradient of this matching measure is then
derived in this paper to achieve efficient and accurate non-rigid registration. It turns out
that the gradient of the CCRE has a similar formulation with the cost function, which
greatly saves memory space in the optimization process. The matching criterion is
optimized using Quasi-Newton method to recover the transformation parameters.
The key strengths of our proposed non-rigid registration scheme are demonstrated
through the registration of the synthetic as well as real data sets from multi-modality (MR
T1 and T2 weighted, MR & CT) imaging sources. It is showed that our CCRE not only
can accommodate images to be registered of varying contrast+brightness, but it is also
robust in the presence of noise. CCRE converges faster when compared with other
information theory-based registration methods. Finally we also showed that CCRE is well
suited for situations where the source and the target images have FOVs with large
non-overlapping regions (which is quite common in practice). Comparisons were made
between CCRE and traditional MI [34, 51], which was defined using the Shannon
entropy. All the experiments depicted significantly better performance of CCRE over the
MI-based methods currently used in literature.
Our future work will focus on extending the transformations model to the one that permits
the spatial adaptation of the transformation’s compliance, which will allow us to reduce
the number of degrees of freedom in the overall transformation. Validation of non-rigid
registration on real data with the aid of segmentations and landmarks obtained manually
from a group of trained anatomists are the goals of our ongoing work.
6.2.2 Groupwise Point-sets Registration
We presented a novel and robust algorithm for the groupwise non-rigid registration of
multiple unlabeled point-sets with no bias toward any of the given point-sets. To quantify
the divergence between multiple probability distributions estimated from the given point
74
sets, we proposed several divergence measures, the first of which is the Jensen-Shannon
divergence. Since it lacks robustness, we develop a novel measure based on their
cumulative distribution functions that we dub as the CDF-JS divergence. The measure
parallels the well known Jensen-Shannon divergence (defined for probability density
functions) but is more regular than the JS divergence since its definition is based on CDFs
as opposed to density functions. As a consequence, CDF-JS is more immune to noise and
statistically more robust than the JS.
Our proposed methods do not require any knowledge of correspondence between the
input point-sets, and therefore these point-sets need not have the same cardinality. One
other salient feature of our proposed algorithms is that we get a probabilistic atlas as the
byproduct of the registration process. Our algorithm can be especially useful for creating
atlases of various shapes present in images as well as for simultaneously (rigidly or
non-rigidly) registering 3D range data sets without having to establish any
correspondence.
Our future work will focus on using maximum likelihood estimation (MLE) to
automatically determine weighting coefficients in the divergence measures and smoothing
term; We are also attempting to extend our techniques to diffeomorphic point-sets
matching.
6.3 Image Segmentation
In the part of image segmentation, we presented a novel variational formulation of the
joint (non-rigid) registration and segmentation problem which requires the solution to a
coupled set of nonlinear PDEs that are solved using efficient numerical schemes. Our
work is a departure from earlier methods in that we presented aunified variational
principlewherein non-rigid registration and segmentation are simultaneously achieved.
Unlike earlier methods presented in literature,a key feature of our algorithm is that it can
accommodate for image pairs having distinct intensity distributions. We presented several
examples (twenty) on synthetic and (three) real data sets along with quantitative accuracy
75
estimates of the registration in the synthetic data case. The accuracy as evident in these
experiments is quite satisfactory. Our future efforts will focus on adapting our
algorithm+software for the clinic use.
REFERENCES
[1] C. E. Shannon, “A mathematical theory of communication,”Bell System TechnicalJournal, pp. 379–423 and 623–656, 1948.
[2] W. F. Sharpe,Investments. London: Prentice Hall, 1985.
[3] D. Salomon,Data Compression. New York: Springer, 1998.
[4] S. Kullback,Information Theory and Statistics. New York: Wiley, 1959.
[5] T. M. Cover and J. A. Thomas,Elements of Information Theory. New York: Wiley,1991.
[6] G. Jumarie,Relative Information. New York: Springer, 1990.
[7] M. Rao, Y. Chen, B. C. Vemuri, and F. Wang, “Cumulative residual entropy, a newmeasure of information,”IEEE Transactions on Information Theory, vol. 50, no. 6,pp. 1220–1228, June 2004.
[8] M. Asadi and Y. Zohrevand, “On the dynamic cumulative residual entropy,”Unpublished Manuscript, 2006.
[9] H. Chui, L. Win, R. Schultz, J. Duncan, and A. Rangarajan, “A unified non-rigidfeature registration method for brain mapping,”Medical Image Analysis, vol. 7,no. 2, pp. 112–130, 2003.
[10] N. Paragios, M. Rousson, and V. Ramesh, “Non-rigid registration using distancefunctions,”Comput. Vis. Image Underst., vol. 89, no. 2-3, pp. 142–165, 2003.
[11] M. A. Audette, K. Siddiqi, F. P. Ferrie, and T. M. Peters, “An integratedrange-sensing, segmentation and registration framework for the characterization ofintra-surgical brain deformations in image-guided surgery,”Comput. Vis. ImageUnderst., vol. 89, no. 2-3, pp. 226–251, 2003.
[12] A. Leow, P. M. Thompson, H. Protas, and S.-C. Huang, “Brain warping withimplicit representations.” inInternational Symposium on Biomedical Imaging, 2004,pp. 603–606.
[13] B. Jian and B. C. Vemuri, “A robust algorithm for point set registration usingmixture of gaussians.” inIEEE International Conference on Computer Vision, 2005,pp. 1246–1251.
76
77
[14] F. Wang, B. C. Vemuri, A. Rangarajan, I. M. Schmalfuss, and S. J. Eisenschenk,“Simultaneous nonrigid registration of multiple point sets and atlas construction,” inEuropean Conference on Computer Vision, 2006, pp. 551–563.
[15] S. J. H. Guo, A. Rangarajan, “A new joint clustering and diffeomorphism estimationalgorithm for non-rigid shape matching,” inIEEE Computer Vision and PatternRecognition, 2004, pp. 16–22.
[16] M. Irani and P. Anandan, “Robust Multi-sensor Image Alignment,” inInternationalConference on Computer Vision, Bombay, India, 1998, pp. 959–965.
[17] J. Liu, B. C. Vemuri, and J. L. Marroquin, “Local frequency representations forrobust multimodal image registration,”IEEE Transactions on Medical Imaging,vol. 21, no. 5, pp. 462–469, 2002.
[18] M. Mellor and M. Brady, “Non-rigid multimodal image registration using localphase,” inMedical Image Computing and Computer-Assisted Intervention,Saint-Malo, France, Sep 2004, pp. 789–796.
[19] B. Zitova and J. Flusser, “Image registration methods: a survey.”Image VisionComput., vol. 21, no. 11, pp. 977–1000, 2003.
[20] J. Ruiz-Alzola, C.-F. Westin, S. K. Warfield, A. Nabavi, and R. Kikinis, “Nonrigidregistration of 3d scalar vector and tensor medical data,” inThird InternationalConference on Medical Image Computing and Computer-Assisted Intervention,A. M. DiGioia and S. Delp, Eds., Pittsburgh, October 11–14 2000, pp. 541–550.
[21] L. Marroquin, B. Vemuri, S. Botello, F. Calderon, and A. Fernandez-Bouzas, “Anaccurate and efficient bayesian method for automatic segmentation of brain mri,” inIEEE Transactions on Medical Imaging, 2002, pp. 934–945.
[22] B. C. Vemuri, J. Ye, Y. Chen, and C. M. Leonard, “A level-set based approach toimage registration,” inIEEE Workshop on Mathematical Methods in BiomedicalImage Analysis, 2000, pp. 86–93.
[23] P. Hellier, C. Barillot, E. Mmin, and P. Prez, “Hierarchical estimation of a densedeformation field for 3d robust registration,”IEEE Transactions on MedicalImaging, vol. 20, no. 5, pp. 388–402, May 2001.
[24] R. Szeliski and J. Coughlan, “Spline-based image registration,”Int. J. Comput.Vision, vol. 22, no. 3, pp. 199–218, March 1997.
[25] S. H. Lai and M. Fang, “Robust and efficient image alignment with spatially-varyingillumination models,” inIEEE Conference on Computer Vision and PatternRecognition, 1999, pp. II: 167–172.
[26] A. Guimond, A. Roche, N. Ayache, and J. Menuier, “Three-DimensionalMultimodal Brain Warping Using the Demons Algorithm and Adaptive Intensity
78
Corrections,”IEEE Transactions on Medical Imaging, vol. 20, no. 1, pp. 58–69,2001.
[27] J.-P. Thirion, “Image matching as a diffusion process: an analogy with maxwell’sdemons,”Medical Image Analysis, vol. 2, no. 3, pp. 243–260, 1998.
[28] A. Cuzol, P. Hellier, and E. Memin, “A novel parametric method for non-rigid imageregistration,” inProc. Information Processing in Medical Imaging (IPMI’05), ser.LNCS, G. Christensen and M. Sonka, Eds., no. 3565, Glenwood Springes,Colorado, USA, July 2005, pp. 456–467.
[29] A. W. Toga and P. M. Thompson, “The role of image registration in brain mapping,”Image Vision Comput., vol. 19, no. 1-2, pp. 3–24, 2001.
[30] E. D’Agostino, F. Maes, D. Vandermeulen, and P. Suetens, “Non-rigidatlas-to-image registration by minimization of class-conditional image entropy.” inMedical Image Computing and Computer-Assisted Intervention, 2004, pp. 745–753.
[31] P. A. Viola and W. M. Wells, “Alignment by maximization of mutual information,”in IEEE International Conference on Computer Vision, MIT, Cambridge, 1995.
[32] A. Collignon, F. Maes, D. Delaere, D. Vandermeulen, P. Suetens, and G. Marchal,“Automated multimodality image registration based on information theory,”Proc.Information Processing in Medical Imaging, pp. 263–274, 1995.
[33] C. Studholme, D. Hill, and D. J. Hawkes, “Automated 3D registration of MR andCT images in the head,”Medical Image Analysis, vol. 1, no. 2, pp. 163–175, 1996.
[34] D. Mattes, D. R. Haynor, H. Vesselle, T. K. Lewellen, and W. Eubank, “Pet-ct imageregistration in the chest using free-form deformations.”IEEE Transactions onMedical Imaging, vol. 22, no. 1, pp. 120–128, 2003.
[35] D. Rueckert, A. F. Frangi, and J. A. Schnabel, “Automatic construction of 3dstatistical deformation models of the brain using non-rigid registration.”IEEETransactions on Medical Imaging, vol. 22, no. 8, pp. 1014–1025, 2003.
[36] G. Hermosillo, C. Chefd’hotel, and O. Faugeras, “Variational methods formultimodal image matching,”Int. J. Comput. Vision, vol. 50, no. 3, pp. 329–343,2002.
[37] D. Rueckert, L. I. Sonoda, C. Hayes, D. L. G. Hill, M. O. Leach, and D. J. Hawkes,“Nonrigid registration using free-form deformations: Application to breast mrimages,”IEEE Transactions on Medical Imaging, vol. 18, no. 8, pp. 712–721,August 1999.
[38] M. E. Leventon and W. E. L. Grimson, “Multimodal volume registration using jointintensity distributions,” inMedical Image Computing and Computer-AssistedIntervention (MICCAI), Cambridge, MA, 1998, pp. 1057–1066.
79
[39] T. Gaens, F. Maes, D. Vandermeulen, and P. Suetens, “Non-rigid multimodal imageregistration using mutual information,” inProc. Conference on Medical ImageComputing and Compter-Assisted Intervention (MICCAI), 1998, pp. 1099–1106.
[40] D. Loeckx, F. Maes, D. Vandermeulen, and P. Suetens, “Nonrigid image registrationusing free-form deformations with a local rigidity constraint.” inMedical ImageComputing and Computer-Assisted Intervention, 2004, pp. 639–646.
[41] G. K. Rohde, A. Aldroubi, and B. M. Dawant, “The adaptive bases algorithm forintensity based nonrigid image registration.”IEEE Transactions on MedicalImaging, vol. 22, no. 11, pp. 1470–1479, 2003.
[42] V. Duay, P.-F. D’Haese, R. Li, and B. M. Dawant, “Non-rigid registration algorithmwith spatially varying stiffness properties.” inInternational Symposium onBiomedical Imaging, 2004, pp. 408–411.
[43] C. Guetter, C. Xu, F. Sauer, and J. Hornegger, “Learning based non-rigidmulti-modal image registration using kullback-leibler divergence.” inMedical ImageComputing and Computer-Assisted Intervention, 2005, pp. 255–262.
[44] E. D’Agostino, F. Maes, D. Vandermeulen, and P. Suetens, “An informationtheoretic approach for non-rigid image registration using voxel class probabilities,”Medical Image Analysis, vol. 10, no. 3, pp. 413–431, 2006.
[45] C. Davatzikos, “Spatial transformation and registration of brain images usingelastically deformable models,”Comput. Vis. Image Underst., vol. 66, no. 2, pp.207–222, 1997.
[46] J. C. Gee, M. Reivich, and R. Bajcsy, “Elastically deforming 3d atlas to matchanatomical brain images,”J. Comput. Assist. Tomogr., vol. 17, no. 2, pp. 225–236,1993.
[47] M. Bro-Nielsen and C. Gramkow, “Fast fluid registration of medical images,” inProc. of the 4th International Conference on Visualization in BiomedicalComputing. London, UK: Springer-Verlag, 1996, pp. 267–276.
[48] G. E. Christensen, R. D. Rabbitt, and M. I. Miller, “Deformable templates usinglarge deformation kinematics,”IEEE Transactions On Image Processing, vol. 5,no. 10, pp. 1435–1447, October 1996.
[49] X. Geng, D. Kumar, and G. E. Christensen, “Transitive inverse-consistent manifoldregistration.” inProc. Information Processing in Medical Imaging, 2005, pp.468–479.
[50] A. Trouve, “Diffeomorphisms groups and pattern matching in image analysis,”Int.J. Comput. Vision, vol. 28, no. 3, pp. 213–221, 1998.
[51] D. R. Forsey and R. H. Bartels, “Hierarchical b-spline refinement,”ComputerGraphics, vol. 22, no. 4, pp. 205–212, 1988.
80
[52] C. Cocosco, V. Kollokian, R.-S. Kwan, and A. Evans, “Brainweb: online interface toa 3-d mri simulated brain database,” 1997, last accessed: July 2005. [Online].Available: http://www.bic.mni.mcgill.ca/brainweb/
[53] P. Thevenaz and M. Unser, “Optimization of mutual information for multiresolutionimage registration,”IEEE Transactions on Image Processing, vol. 9, no. 12, pp.2083–2099, December 2000.
[54] T. Chan and L. Vesse, “An active contour model without edges,” inIntl. Conf. onScale-space Theories in Computer Vision, 1999, pp. 266–277.
[55] N. Duta, A. K. Jain, and M.-P. Dubuisson-Jolly, “Automatic construction of 2d shapemodels,”IEEE Transactions Pattern Anal. Mach. Intell., vol. 23, no. 5, pp. 433–446,2001.
[56] E. Klassen, A. Srivastava, W. Mio, and S. H. Joshi, “Analysis of planar shapes usinggeodesic paths on shape spaces.”IEEE Transactions Pattern Anal. Mach. Intell.,vol. 26, no. 3, pp. 372–383, 2003.
[57] T. B. Sebastian, P. N. Klein, B. B. Kimia, and J. J. Crisco, “Constructing 2d curveatlases,” inIEEE Workshop on Mathematical Methods in Biomedical ImageAnalysis, Washington, DC, USA, 2000, pp. 70–77.
[58] H. Tagare, “Shape-based nonrigid correspondence with application to heart motionanalysis.”IEEE Transactions on Medical Imaging, vol. 18, no. 7, pp. 570–579, 1999.
[59] F. L. Bookstein, “Principal warps: Thin-plate splines and the decomposition ofdeformations.”IEEE Transactions Pattern Anal. Mach. Intell., vol. 11, no. 6, pp.567–585, 1989.
[60] H. Chui, A. Rangarajan, J. Zhang, and C. M. Leonard, “Unsupervised learning of anatlas from unlabeled point-sets.”IEEE Transactions Pattern Anal. Mach. Intell.,vol. 26, no. 2, pp. 160–172, 2004.
[61] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition usingshape contexts,”IEEE Transactions Pattern Anal. Mach. Intell., vol. 24, no. 4, pp.509–522, 2002.
[62] H. Chui and A. Rangarajan, “A new algorithm for non-rigid point matching.” inIEEE Computer Vision and Pattern Recognition, 2000, pp. 2044–2051.
[63] H. Guo, A. Rangarajan, S. Joshi, and L. Younes, “Non-rigid registration of shapesvia diffeomorphic point matching.” inInternational Symposium on BiomedicalImaging, 2004, pp. 924–927.
[64] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Active shape models: theirtraining and application,”Comput. Vis. Image Underst., vol. 61, no. 1, pp. 38–59,1995.
81
[65] Y. Wang and L. H. Staib, “Boundary finding with prior shape and smoothnessmodels,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22,no. 7, pp. 738–743, 2000.
[66] A. Hill, C. J. Taylor, and A. D. Brett, “A framework for automatic landmarkidentification using a new method of nonrigid correspondence.”IEEE TransactionsPattern Anal. Mach. Intell., vol. 22, no. 3, pp. 241–251, 2000.
[67] Y. Tsin and T. Kanade, “A correlation-based approach to robust point setregistration.” inEuropean Conference on Computer Vision, 2004, pp. 558–569.
[68] J. Glaunes, A. Trouve, and L. Younes, “Diffeomorphic matching of distributions: Anew approach for unlabelled point-sets and sub-manifolds matching.” inIEEEComputer Vision and Pattern Recognition, 2004, pp. 712–718.
[69] J. Lin, “Divergence measures based on the shannon entropy,”IEEE TransactionsInformation Theory, vol. 37, pp. 145–151, 1991.
[70] A. Hero, O. M. B. Ma, and J. Gorman, “Applications of entropic spanning graphs,”IEEE Transactions Signal Processing, vol. 19, pp. 85–95, 2002.
[71] Y. He, A. Ben-Hamza, and H. Krim, “A generalized divergence measure for robustimage registration,”IEEE Transactions Signal Processing, vol. 51, pp. 1211–1220,2003.
[72] D. M. Endres and J. E. Schindelin, “A new metric for probability distributions,”IEEE Transactions Information Theory, vol. 49, pp. 1858–60, 2003.
[73] H. Chui and A. Rangarajan, “A new point matching algorithm for non-rigidregistration,”Computer Vision and Image Understanding (CVIU), vol. 89, pp.114–141, 2003.
[74] G. McLachlan and K. Basford,Mixture Model:Inference and Applications toClustering. New York: Marcel Dekker, 1988.
[75] A. L. Yuille, P. Stolorz, and J. Utans, “Statistical physics, mixtures of distributions,and the em algorithm,”Neural Comput., vol. 6, no. 2, pp. 334–340, 1994.
[76] R. Bansal, L. Staib, Z. Chen, A. Rangarajan, J. Knisely, R. Nath, and J. Duncan.,“Entropy-based, multiple-portal-to-3d ct registration for prostate radiotherapy usingiteratively estimated segmentation,” inMedical Image Computing andComputer-Assisted Intervention, 1999, pp. 567–578.
[77] A. Yezzi, L. Zollei, and T. Kapur, “A variational framework for joint segmentationand registration,” inIEEE Workshop on Mathematical Methods in Biomedical ImageAnalysis, 2001, pp. 388–400.
[78] P. Wyatt and J. Noble, “Mrf-map joint segmentation and registration,” inMedicalImage Computing and Computer-Assisted Intervention, 2002, pp. 580–587.
82
[79] N. Paragios, M. Rousson, and V. Ramesh, “Knowledge-based registration &segmentation of the left ventricle: A level set approach.” inWACV, 2002, pp. 37–42.
[80] B. Fischl, D. Salat, E. Buena, and M. A. et.al., “Whole brain sementation:Automated labeling of the neuroanatomical structures in the human brain,” inNeuron, vol. 33, 2002, pp. 341–355.
[81] S. Soatto and A. J. Yezzi, “Deformotion: Deforming motion, shape average and thejoint registration and segmentation of images,” inEuropean Conference onComputer Vision, 2002, pp. 32–57.
[82] B. C. Vemuri, Y. Chen, and Z. Wang, “Registration assisted image smoothing andsegmentation,” inEuropean Conference on Computer Vision, 2002, pp. 546–559.
[83] T. Zhang and D. Freedman, “Tracking objects using density matching and shapepriors.” in IEEE International Conference on Computer Vision, 2003, pp.1056–1062.
[84] F. Wang, B. C. Vemuri, M. Rao, and Y. Chen, “A new & robust information theoreticmeasure and its application to image alignment.” inProc. Information Processing inMedical Imaging, 2003, pp. 388–400.
BIOGRAPHICAL SKETCH
Fei Wang was born in Yan Cheng, JiangSu, P. R. China. He received his Bachelor of
Science degree from the University of Science and Technology of China, P. R. China, in
2001. He earned his Master of Science and Doctor of Philosophy degree from the
University of Florida in December 2002 and August 2006 respectively. His research
interests include medical imaging, computer vision, pattern recognition, computer
graphics and shape modeling.
83