Post on 08-Feb-2021
transcript
Introducing Anisotropic Minkowski Functionals for Local Structure Analysis and
Prediction of Biomechanical Strength of Proximal Femur Specimens
By
Titas De
Submitted in Partial Fulfillment of the
Requirements for the Degree
Master of Science
Supervised by
Professor Axel W.E. Wismüller
Department of Electrical and Computer Engineering
Arts, Sciences and Engineering
Edmund A. Hajim School of Engineering and Applied Sciences
University of Rochester
Rochester, New York
2013
iii
Biographical Sketch
The author was born in Kolkata, India. He attended the Institute of
Engineering and Management under West Bengal University of Technology, and
graduated with a Bachelor of Technology degree in Electronics and
Communication Engineering with an emphasis on Digital Signal and Image
Processing in July 2011. He began interdisciplinary graduate studies in Electrical
and Computer Engineering at the University of Rochester in August 2011 with
continued study on Digital Signal Processing, Digital Image Processing, Computer
Vision and Medical Imaging. He was awarded merit-based tuition scholarships
from 2011 to 2013. During this time, he pursued research in Computational
Radiology Lab under Dr. Wismüller (M.D., PhD), who himself is a radiologist.
The following presentations and publications were a result of work conducted
during this study:
Axel Wismüller, Titas De, Eva Lochmüller, Felix Eckstein and Mahesh B.
Nagarajan, ”Introducing Anisotropic Minkowski Functionals and Quantitative
Anisotropy Measures for Local Structure Analysis in Biomedical Imaging”,
Proceedings of SPIE Medical Imaging Conference, 2013.
iv
Acknowledgements
I would like to thank my committee members, Dr. Mark Bocko and Dr.
Kevin Parker, and my advisor Dr. Axel Wismüller, for their attention and guidance
during the course of my research.
Further, I greatly appreciate the direction and guidance received from my
senior lab members Mahesh Nagarajan and Chien-chun Yang, from the
Department of Biomedical Engineering. Their guidance and direction helped me
successfully overcome the challenges and the difficulties encountered
throughout my thesis research and academic life in Rochester, NY.
Lastly, I would like to thank all my collaborators, i.e. professors, faculty
members and researchers from different parts of the world for their help and
support. They include Dr. Felix Eckstein and Dr. Eva Lochmüller from Paracelsus
Medical University at Salzburg, Austria.
Finally, I would also like to thank the Department of Electrical and
Computer Engineering at the University of Rochester for their support through
tuition scholarships.
v
Abstract
Bone fragility and fracture caused by osteoporosis or injury are prevalent
in adults over the age of 50 and can reduce their quality of life. Hence, predicting
the biomechanical bone strength, specifically of the proximal femur, through
non-invasive imaging-based methods is an important goal for the diagnosis of
Osteoporosis as well as estimating fracture risk. Dual X-ray absorptiometry (DXA)
has been used as a standard clinical procedure for assessment and diagnosis of
bone strength and osteoporosis through bone mineral density (BMD)
measurements. However, previous studies have shown that quantitative
computer tomography (QCT) can be more sensitive and specific to trabecular
bone characterization because it reduces the overlap effects and interferences
from the surrounding soft tissue and cortical shell.
This study proposes a new method to predict the bone strength of
proximal femur specimens from quantitative multi-detector computer
tomography (MDCT) images. Texture analysis methods such as conventional
statistical moments (BMD mean), Isotropic Minkowski Functionals (IMF) and
Anisotropic Minkowski Functionals (AMF) are used to quantify BMD properties of
the trabecular bone micro-architecture. Combinations of these extracted
features are then used to predict the biomechanical strength of the femur
specimens using sophisticated machine learning techniques such as multi-
regression (MultiReg) and support vector regression with linear kernel (SVRlin).
The prediction performance achieved with these feature sets is compared to the
standard approach that uses the mean BMD of the specimens and multi-
regression models using root mean square error (RMSE).
vi
The best prediction performance using Anisotropic Minkowski Functionals
(AMF) gives RMSE = 0.904 ± 0.105, which is significantly better than the ones
obtained using Isotropic Minkowski Functionals (RMSE = 1.585 ± 0.167) and DXA
BMD (RMSE = 0.960 ± 0.131).
vii
Contributors and Funding Sources
This research was funded in part by the Clinical and Translational Science Award
5-28527 within the Upstate New York Translational Research Network (UNYTRN)
of the Clinical and Translational Science Institute (CTSI), University of Rochester,
and by the Center for Emerging and Innovative Sciences (CEIS), a NYSTAR-
designated center for Advanced Technology.
viii
Table of Contents
Chapter No. Title Page
Chapter 1 Introduction 1
1.1 Motivation for this work 1
1.2 Computer-Aided Diagnosis: Principles and Mechanisms 2
1.3 Research Background and Focus 3
1.4 Experimental Materials and Data 6
Chapter 2 Feature Analysis 12
2.1 Conventional Statistical Features 12
2.2 Minkowski Functionals 13
2.3 Isotropic Minkowski Functionals 14
2.4 Anisotropic Minkowski Functionals 15
2.5 Features obtained for Prediction Performance 17
Chapter 3 Machine Learning Algorithms 18
3.1 Introduction 18
3.2 Linear Regression with one variable 21
3.3 Linear Regression with multiple variables (Multi-Regression)
24
3.4 Logistic Regression 28
3.5 Regularization 32
3.6 Support Vector Machines 35
3.7 Prediction Performance 39
Chapter 4 Bone Strength Prediction: Performance Results 41
4.1 Identification of Femur Region for Analysis 41
ix
4.2 Conventional Statistical Features 42
4.3 Isotropic Minkowski Functionals 43
4.4 Anisotropic Minkowski Functionals 44
4.5 Prediction Performance Comparison and Conclusion 50
Chapter 5 Discussion 52
References 54
x
List of Tables
Table No. Title Page
1.1 Hounsfield Unit readings for selected substances 11
3.1 Comparison between Gradient Descent and Normal Equation
27
4.1 Values of investigated parameters for femur specimens 41
4.2 Prediction performance of DXA BMD and IMF features 44
4.3 Prediction performance of DXA BMD and AMF features 45
4.4 Prediction performance of DXA BMD combined with AMF and IMF features
51
xi
List of Figures
Figure No. Title Page
1.1 Overview of the experiment setup and methods used 5
1.2 MDCT images of selected femur specimens 7
1.3 Results of ROI-fitting and BMD conversion in selected specimens
11
2.1 2D Gaussian kernels oriented in different directions 16
4.1 Scatter plots showing relationship between measured Failure Load and QCT BMD
42
4.2 Prediction performance of DXA BMD versus MF.Volume (AMF + IMF)
46
4.3 Prediction performance of DXA BMD versus DXA BMD + MF.Volume (AMF + IMF)
47
4.4 Prediction performance of DXA BMD versus MF.Surface (AMF + IMF)
47
4.5 Prediction performance of DXA BMD versus DXA BMD + MF.Surface (AMF + IMF)
48
4.6 Prediction performance of DXA BMD versus MF.Mean Breadth (AMF + IMF)
48
4.7 Prediction performance of DXA BMD versus DXA BMD + MF.Mean Breadth (AMF + IMF)
49
4.8 Prediction performance of DXA BMD versus MF.Euler Characteristic (AMF + IMF)
49
4.9 Prediction performance of DXA BMD versus DXA BMD + MF.Euler Characteristic (AMF + IMF)
50
1
Chapter 1
Introduction
1.1 Motivation for this work
Examining and interpreting medical images such as MRI and CT can be a
tedious and exhaustive task for radiologists; extraction of relevant and precise
clinical findings for correct clinical decision-making requires tremendous training
and clinical experience [1]. In spite of their training and experience, clinical
findings can be overlooked or misinterpreted for various reasons including
distraction, reader fatigue, anatomical structure overlapping, etc. [2-6]. In
addition, the interpretation is also subject to inter-observer variations which can
lead to incorrect decisions as well. Finally, the native constraint of human eye-
brain visual system also places certain limitations on the ability of radiologists in
discerning and recognizing brightness, morphology and patterns on medical
images [7, 8]. As a consequence, it imposes challenges and difficulties in making
precise and objective interpretation while evaluating clinical findings.
Computer-aided diagnosis system (CADx), on the other hand, allows
extraction and analysis of image features that are inaccessible to human eye-
brain visual system and thus, provide a more objective and consistent decision,
which can be used as a complementary opinion to help radiologists in clinical
evaluations [4,8]. This proposal aims to propose a novel CADx system in the
skeleton disease setting, in order to improve accuracy in the diagnosis of
osteoporosis and fracture risk prediction. A brief summary of CADx mechanism,
research background of the relevant diagnostic modality, and the proposed
solution are described below.
2
1.2 Computer-Aided Diagnosis Principles and Mechanisms
CADx [1-6], as used in this study, can be divided into four stages:
region/volume of interest (ROI/VOI) selection, texture feature extraction,
decision/regression determining algorithm, and decision output, all of which are
described below.
CADx usually begins with ROI/VOI selection – regions that contain relevant
information for clinical findings such as lesions, or anatomical sites such as
vertebral body or femur head are selected for further detailed investigation.
ROI/VOI selection can be accomplished by manual, semi-automatic or fully
automatic methods.
Texture feature extraction utilizes texture feature analysis methods,
evolved over many years, to extract quantified features that characterize
patterns on an image. Some popular texture analysis methods include
conventional statistics, Minkowski Functionals (MF), Gray-Level Co-occurrence
Matrix (GLCM) [9, 10] and Scaling Index Method (SIM) [11-13]. Although many
texture feature analysis methods exist, the ultimate purpose is similar – to
extract features from the ROIs/VOIs of medical images.
The extracted features are used to construct a mathematical model using
a decision/regression algorithm, also known as machine learning algorithm. This
model is subsequently used to provide quantitative analysis for undetermined
cases where the designated features are provided.
The outcome of such a system can serve as a complementary opinion and
assist radiologists in clinical decision making.
3
1.3 Research Background and Focus
Research Background
Osteoporosis, a disease related to the imbalance between trabecular bone
formation and resorption, is one of the most common age related diseases
targeting elderly people [18]. The progression of osteoporosis can lead to
osteoporotic fractures, which not only reduces the quality of life but also
increases the mortality rate [18]. Previous studies have predicted that the
osteoporotic fracture risk population will reach 6.26 million worldwide by the
year 2050 [19, 20]. Thus, accurate prediction of osteoporotic fracture risks is an
important aid for clinical assessment and management of osteoporosis [21-25].
Dual-energy X-ray absorptiometry (DXA) has been the standard technique
for measuring bone quality in terms of bone mineral density (BMD) for purposes
of osteoporotic fracture risk estimation [20-24]. BMD measurements through
DXA at the site of the proximal femur have shown to be highly predictive of bone
fractures when compared to other sites [24-27]. Such BMD measurements can
contribute to increased accuracy in bone fracture risk assessment at the hip.
However, BMD measurements alone do not account for a complete profile
of the trabecular bone microarchitecture; thus leading to some inconsistency in
osteoporosis diagnosis. Kanis et al. suggested that the presence of normal values
of BMD within the average range does not necessarily indicate the absence of
osteoporosis but rather a lower risk of developing osteoporosis or related
fractures [25, 29]. In fact, BMD measurements for people with and without
prevalent femur fractures have been shown to overlap, which indicates that
other factors need to be taken into account for bone strength estimation [29]. In
addition, previous studies have also suggested that DXA-derived BMD
measurements are adversely affected by interference from surrounding cortical
4
shell, adipose tissue and soft tissue, which result in inaccuracies for bone
strength estimation and mislead the diagnostic interpretation [26-28, 21-35].
Quantitative computer tomography (QCT), in contrast with DXA, can be
used to eliminate any interference from surrounding tissue and allow a direct and
independent estimation of either the cortical or trabecular compartment; thus
providing an exclusive measure of BMD in the trabecular compartment.
Therefore, QCT can be used to successfully improve the efficacy of bone loss and
fracture risk assessment, which has been previously demonstrated in the spinal
fracture studies. In fact, such studies showed that QCT measurements in the
central trabecular region of interest excluded sources of error such as
osteophytes and hypertrophic posterior elements, which may artificially elevate
integral BMD measures and reduce diagnostic efficacy [34-37].
Research Focus
Although BMD measured by QCT is strongly correlated with fracture risk, it
is still not a satisfactory predictor for bone strength due to variations in bone
morphology and structure [38]. Therefore, improving the accuracy of in-vivo
estimation of the biomechanical strength of proximal femurs through novel
techniques is an important goal in osteoporosis research. In this regard, previous
studies have reported that QCT-derived BMD, when used in combination with
anatomical variables such as bone volume, trabecular separation or femoral hip
axis length (HAL), exhibit better bone strength estimation over the DXA-derived
BMD in the femur [37-39]. Such findings indicate that bone features other than
BMD may also play a role in determining bone strength [32, 37-39].
Therefore, we propose an improved characterization of trabecular bone,
as visualized on multi-detector CT images, with higher order geometric feature
5
vectors derived from Isotropic and Anisotropic Minkowski Functionals [9, 10].
Such features, along with conventionally used BMD measurements, are then
used to construct bone strength prediction models with different supervised
machine learning techniques such as multi-regression and support vector
regression with linear kernel [13-16], and the ability of such models to predict the
bone strength is evaluated. The following figure gives an overview summary of
the experiment setup and data presented in this research.
Figure 1.1 - Overview of my experimental setup and methods used. The trabecular features (BMD mean, Isotropic and Anisotropic Minkowski Functionals) were extracted from VOIs annotated on MDCT images of the femur specimens post-processed to facilitate conversion of intensity values from Hounsfield units to BMD values. Two function approximation methods, i.e. multi-regression and support vector regression analysis, were then used to predict the failure load (FL); the similarity between predicted FL and actual values determined through biomechanical testing was quantified through RMSE.
6
1.4 Experimental Materials and Data
This section describes our experimental materials and the relevant pre-
processing procedures. These include femur specimens, imaging modalities, VOI
selection and biomechanical test, and bone mineral density unit conversion.
Femur Specimens
Left femoral specimens were harvested from fixed human cadavers over a
time period of four years. The donors had dedicated their body to the
investigators at the Institute of Anatomy and Musculoskeletal Research,
Paracelsus Private Medical University Salzburg for educational and research
purposes prior to death, in line with local institutional and legislative
requirements. To exclude donors with diffuse metastatic bone disease and
hematological or metabolic bone disorders other than osteoporosis, biopsies
were obtained from the iliac crest and examined histologically as part of the
general research protocol. The histological assessment was performed by a
surgeon who had been trained as a pathologist for 3 years with a focus on bone
pathology. Specimens where signs of fractures were detected either in
radiographs or during preparation as well as specimens that displayed a fracture
of the femoral shaft (rather than of the proximal femur) during the mechanical
test were excluded. Using the above criteria, a subset of 146 human femur
specimens were used for this study. The bones were removed from the cadavers
with a variable amount of surrounding soft tissues. To create uniform scanning
conditions, the soft tissue surrounding the bones was removed for imaging and
biomechanical testing.
7
Multi-Detector Computed Tomography (MDCT)
Cross-sectional images of the femora were acquired using a 16-row multi-
detector (MD)-CT scanner (Sensation 16; Siemens Medical Solutions, Erlangen,
Germany). The specimens were placed in plastic bags filled with 4%
formalin/water solution. Air was removed with a vacuum pump and plastic bags
were sealed. These were positioned in the scanner as in an in-vivo exam of the
pelvis and proximal femur with mild internal rotation of the femur. Each
specimen was scanned once, except for 3 specimens who were scanned twice for
precision measurements, with a protocol using collimation and table feed of 0.75
mm, and a reconstruction index of 0.5 mm. A high resolution reconstruction
algorithm (kernel U70u) was used, resulting in an in-plane resolution of 0.19 x
0.19 mm2 and anisotropic voxel size of 0.19 x 0.19 x 0.5 mm3. A kilovolt peak of
120 kVp was used with 100 mA. The image matrix was 512 x 512 pixels, with a
field of view of 100 mm. For calibration purposes, a reference phantom (Osteo
Phantom, Siemens) was placed below the specimens (Fig. 1.2)
Figure 1.2 - MDCT images of selected femur specimens. From left to right, the specimens are categorized as high, medium and low biomechanical strength, respectively based on failure load tests. The osteo phantom used for each specimen is also shown at the bottom.
8
Image Processing and Volume of Interest (VOI) Selection
The outer surface of the cortical shell of the femur was segmented by
using bone attenuations of the phantom in each image. The specimens were
segmented automatically; however, the shape of the binary mask was manually
corrected if errors in segmentation occurred due to a thin cortical shell caused by
high grade focal bone loss or to adjacent anatomic structures such as blood
vessels penetrating the cortex. The corrections for all specimens were performed
by one of two radiologists. Based on a priori knowledge about the orientation of
the specimens in the CT scans, the superior part of the femoral head was
identified automatically. Based on the size and shape of the contours and the
center of mass of the contours of consecutive slices, the superior part of the
femoral head was detected. A sphere was fitted to the superior surface points of
the femoral head using a Gaussian Newton Least Squares technique. The fitted
sphere was scaled down to 75% of its original size to account for cortical bone
and shape irregularities like the fovea capitis, and then saved as the femoral head
volume of interest (VOI). Because a cylinder can approximate the shape of the
femur neck, with a similar procedure of head VOI selection, a cylindrical VOI was
computed and automatically fitted to the neck region. The resulting cylinder was
saved as the femur neck VOI.
For the trochanter VOI selection, a cone-like shape VOI was fitted into the
trochanter region based on the bone surface points relative to the neck axis, the
surface regions corresponding to the trochanter, inferior part of the neck and
superior part of the shaft. Main eigenvector of these regions was used as an
initial estimate of the axis of a cone that was fitted to the bone surface points in
these regions. Bone surface points in these regions were matched to the fitted
cone axis and to the original neck axis. The trochanter bone surface points were
9
then saved as the trochanter VOI. Further detail of the VOI selection algorithms
can be referred to Huber et al. [43].
Biomechanical Tests
The failure load was assessed using a side-impact test, simulating a lateral
fall on the greater trochanter as described in paper [54]. Briefly, the femoral shaft
and head faced downward could be moved independently of one another while
the load was applied on the greater trochanter using a universal materials testing
machine (Zwick 1445, Ulm, Germany) with a 10kN force sensor and dedicated
software. The failure load was defined as the peak of the load-deformation curve.
VOI extraction and BMD conversion
The first step was to extract the trabecular VOI from original MDCT images
(shown in Fig. 2). These MDCT images were segmented by the pre-defined VOIs
with respect to the head, neck and trochanter regions. Three different shapes of
VOIs (sphere, cylinder and cone) were designed to fit into different regions (head,
neck and trochanter) of the femur specimens (Fig. 3) as described in Huber et al.
[43]. Within each of the extracted VOIs the Hounsfield Unit (HU) is converted into
BMD unit (mg/cm3) based on the HU value of the Osteo calibration phantom and
the following equation:
BMD = [HAB/ (HUB - HUW)] * (HU – HUW), …… (1)
Each of the above variables is explained below.
The calibration phantom is composed of two portions of hydroxyapatite
which contains the hydroxyapatite density values of HAW = 0 mg/cm3 and HAB =
200 mg/cm3 for the water-like and bone-like parts of the calibration phantom,
respectively. In addition to these constants, HUW and HUB are the attenuations
10
(HU readings) from the MDCT image for water-like and bone-like parts of the
phantom, respectively. So, the HU values of the water-like and bone-like
phantom were recorded for each slice throughout the scan.
The following table provides a brief table of HU readings for selected substances:
Substance HU
Air -1000
Fat -84
Water 0
Blood +35 to +45
Muscle +40
Soft Tissue 100 to 300
Bone +700(cancellous bone) to +3000(dense bone)
Table 1.1 - Hounsfield Unit readings for selected substances. Air tend to have large negative HU readings; whereas, fat has minor negative HU reading. Soft tissue has HU reading between 100 and 300. Bone tissue, depends on the density, and has HU reading from 700 to 3000. Note that the BMD of trabecular region has range between [-300 1400] after converting from HU readings to BMD.
After segmentation, the Hounsfield Unit images within the VOIs were
converted into the BMD unit images (Figure 1.3).
11
Figure 1.3 - Results of ROI-fitting and BMD conversion in selected specimens. ROI fitting and BMD conversion in specimens shown in Figure 1. (Top row) Three shapes (circle, quadrilateral and irregular shape) of ROIs were fit into the head, neck and trochanter region of femur specimens, respectively. ROI boundaries are overlaid on the corresponding MDCT images of the three regions. From left to right are head, neck and trochanter. Note the three images are not shown in the consistent scales since head region is the largest, trochanter second and neck being the smallest. (Bottom row) Hounsfield Unit (HU) images within each ROI are converted to corresponding BMD values.
After the ROI selection and the BMD conversion, the BMD images are then
ready for feature extraction and analysis, which are discussed in Chapter 2.
12
Chapter 2
Feature Analysis
Feature analysis techniques are utilized to represent massive original or
raw information, as found in medical imaging (for example), in a more compact
and concise manner. As soon as one can represent the large volume of medical
images with a compact size of features, these features can then be used to
construct mathematical models with machine learning techniques.
This chapter describes three different feature extraction techniques used
to characterize the femur BMD images in this study. These methods include the
conventional statistical features, Isotropic Minkowski Functionals (IMF) and
Anisotropic Minkowski Functionals (AMF).
2.1 Conventional Statistical Features
Conventional statistical features are usually the most common and the
simplest features used in pattern recognition. Here, the BMD distributions within
VOIs for 3D images and ROIs for 2D images are characterized by their statistical
moments. We have Dual Energy X-ray Absorptiometry (DXA) Bone Mineral
Density (BMD) images which are 2D images, as opposed to Quantitative
Computed Tomography (QCT) Bone Mineral Density (BMD) images which are
actually 3D images. The current clinical standard for bone density evaluation is
using DXA BMD obtained from 2D DXA image of the bone. But we will be
extracting all kinds of morphometric features (IMF and AMF) from 3D QCT
images.
13
2.2 Minkowski Functionals
The concept of Minkowski Functionals is explained in detail in the paper
“INTEGRAL-GEOMETRY MORPHOLOGICAL IMAGE ANALYSIS.” (Michielsen, De
Raedt) [9]. In short, if we have a 2D image black and white image, we can find out
the 3 Minkowski Functionals (Area, Perimeter and Euler characteristic) from that
whole black and white image using the following formula :
Area = ns , Perimeter = – 4ns + 2ne , Euler Characteristic = ns – ne + nv
Here ns = the total number of white pixels, ne = total number of edges, and nv =
total number of vertices.
Similarly, if we have a 3D black and white image, we can find out the 4 Minkowski
Functionals (Volume, Surface Area or Surface, Mean Breadth and Euler
Characteristic) from the entire black and white 3D image volume by using the
following formula:
Volume = ns , Surface = – 6ns + 2nf , Mean Breadth = 3ns – 2nf + ne, , and Euler
Characteristic = -ns + nf – ne + nv
Here ns = the total number of white pixels, ne = total number of edges, nv = total
number of vertices and nf = total number of faces.
14
2.3 Isotropic Minkowski Functionals
We already know there are four Minkowski Functionals (MFs) for a 3D
image which are Volume, Surface, Mean Breadth and Euler Characteristic which
measures the topological characteristic of the entire image as a whole. But in my
study, instead of calculating the Minkowski Functionals for the entire 3D images, I
will calculate it for each white voxel in the binary image using the information
about the local neighborhood of that voxel. The neighborhood voxels including
the central voxel are first weighted by a pre-defined kernel of the same size as
the neighborhood, and these resultant weighted voxels are used to calculate the
kernel Minkowski Functionals (as may be called). Thus instead of getting just one
value corresponding to each Minkowski Functional, we now get a vector (column)
of values and the size of the vector depends on the number of white voxels in the
image.
Let me give you an example. Say I have a 3D black and white (binary)
image with white voxels represented by 1s and black voxels by 0s. Say the size of
the image is M x N x P. Let’s say the total number of white voxels in the image is
NWP (< M*N*P). Let’s say we are using a kernel of dimensions m x n x p to
compute the kernel Minkowski functionals. The output which we get would be a
set of 4-D row vectors, with each vector containing the Volume, Surface, Mean
Breadth and Euler Characteristic values for each voxel obtained using the above-
mentioned kernel of size m x n x p. The number of such row vectors would be
number of white voxels in the image, and which is NWP. In short, our output
would be a NWP x 4 matrix.
Choice of a suitable kernel is a very important task, as these are used to
describe the local texture features in the image. The simplest kernel to use would
15
be a plain cubic kernel with all weights equal to 1. If we use such a kernel, we
notice that such a kernel is isotropic in nature i.e. it does not change its shape if
we rotate it in any direction. We have named the kernel Minkowski Functionals
obtained using an isotropic kernel (such as a plain cubic kernel) as isotropic
Minkowski Functionals. Talking about isotropic kernels, we can also use a
Gaussian kernel which can be made isotropic or rotation invariant by having its
standard deviation in all the three axes (x, y and z) as the same.
2.4 Anisotropic Minkowski Functionals
We now wanted to impose anisotropy or specificity of direction in the
measurement of our Minkowski Functionals. So instead of using a Gaussian
kernel which is rotation invariant i.e. having the same standard deviation in all
the three axes, we are using Gaussian kernels which have a longer standard
deviation in a specific direction as compared to the two other orthogonal
directions. (Note - The three directions does not have to be only x, y and z axes.
They can be any three orthogonal directions in the 3-D space). As before, we are
calculating the kernel Minkowski Functionals for each white voxel, but this time
for a number of different direction-oriented kernels.
16
Before we discuss any further, let me talk about direction-oriented
kernels. Even though, we are talking about 3D images and co-ordinates, but I will
try to explain the direction orientation in 2D, as it will make things simple.
Figure 2.1: Figure showing 2D Gaussian kernels oriented in different directions.
The above picture shows 4 Gaussian kernels oriented at angles 0, 45, 90
and 135 degrees respectively. By looking at the above picture, you can have a
sense of how orientation in 3D co-ordinate space would look like. The difference
between 2D and 3D orientation is that in 2D you care about only one angle i.e
theta (which is the angle between the projections in the x and y axes), while in 3D
you care not only about theta, but also about phi (which is the angle between
projections in the xy plane and the z axis).
So we are using Gaussian kernels oriented in different directions in 3D
space to calculate the kernel-wise Minkowski Functionals. At the end, what we
get corresponding to each Minkowski Functionals (Volume, Surface, Mean
Breadth and Euler Characteristic) is a set of vectors (columns) containing the
Minkowski Functional values for each direction. In short, each white voxel has a
set of values for each Minkowski Functionals. We then use these set of values
and Principal Component Analysis to find the resultant angles [ a) theta - angle
between projected values in the x and y axes; b) phi - angle between projected
values in the z axes and the xy plane] and also the fractional anisotropy (degree
17
of anisotropy or direction specificity) for each white voxel. Fractional Anisotropy
(FA) is obtained using the formula
FA = √( )
( ) ( )
√ (
)
Basically what it comes to is that for each Minkowski Functional, we now
have 3 vectors (columns) which are theta, phi and the fractional anisotropy (FA).
Now theta and phi can contain values only between 0 and 180 degrees, and FA
can have values only between 0 and 1. This is in contrast to the isotropic
Minkowski Functionals where the minimum and maximum limits are subject to
the local structure of the 3D image and also the size and characteristic of the
kernel used.
2.5 Features obtained for Prediction Performance
After getting the FA, theta and phi vectors for each minkowski functional,
we are extracting histogram of values from them with pre-defined bin centers.
These histograms are the Anisotropic Minkowski Functionals (AMF) features
which are used in our research for prediction performance. For Isotropic
Minkowski Functionals, which do not have universal minimum and maximum
limits, we are first finding the min and max limits from the training set. Then we
are using these limits to define the bin centers of the histogram. Ultimately we
are obtaining the histogram features from the entire dataset which is then fed
into machine learning techniques to obtain prediction performance.
18
Chapter 3
Machine Learning Algorithms
3.1 Introduction
Two definitions of Machine Learning are offered. Arthur Samuel described
it as: "the field of study that gives computers the ability to learn without being
explicitly programmed." This is an older, informal definition.
Tom Mitchell provides a more modern definition: "A computer program is
said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E."
Example: playing checkers.
E = the experience of playing many games of checkers
T = the task of playing checkers.
P = the probability that the program will win the next game.
Machine Learning involves programming computerized mathematical
models to optimize a performance criterion using example training data or past
experience. Such models are defined with weight parameters in the sense of
weighting the importance of different attributes or features. The model may be
predictive i.e. to make future predictions, or descriptive, i.e. to gain knowledge
from data, or both [49].
Machine Learning uses the theory of statistics in building mathematical
models, denoted as Solution, Decision Function, Target Function, Hypothesis or
Classifiers, where the core task is drawing inferences from a sample. The role of
computer science is two-fold: First, in training, we need efficient algorithms,
19
known as learning algorithm, to solve the optimization problem, as well as to
store and process the massive amount of training data or training set we
generally have. Second, once a model is learned, its representation and
algorithmic solution for inference or prediction needs to be efficient as well. In
certain applications, the efficiency of the learning or inference algorithms,
namely, its space and time complexity, may be as important as its predictive
accuracy [50].
Supervised Learning
In supervised learning, we are given a data set and already know what our
correct output should look like, having the idea that there is a relationship
between the input and the output [51].
Supervised learning problems are categorized into "regression" and
"classification" problems. In a regression problem, we are trying to predict results
within a continuous output, meaning that we are trying to map input variables to
some continuous function. In a classification problem, we are instead trying to
predict results in a discrete output. In other words, we are trying to map input
variables into discrete categories.
Examples: given data about the size of houses on the real estate market, try to
predict their price. Price as a function of size is a continuous output, so this is a
regression problem.
We could turn this example into a classification problem by instead making our
output about whether the house "sells for more or less than the asking price."
Here we are classifying the houses based on price into two discrete categories.
20
Unsupervised Learning
Unsupervised learning, on the other hand, allows us to approach problems with
little or no idea what our results should look like. We can derive structure from
data where we don't necessarily know the effect of the variables [51].
We can derive this structure by clustering the data based on relationships among
the variables in the data.
With unsupervised learning there is no feedback based on the prediction results,
i.e., there is no teacher to correct you. It’s not just about clustering. For example,
associative memory is unsupervised learning.
Examples: Clustering.-Take a collection of 1000 essays written on the US
Economy, and find a way to automatically group these essays into a small
number that are somehow similar or related by different variables, such as word
frequency, sentence length, page count, and so on.
Suppose a doctor over years of experience forms associations in his mind
between patient characteristics and illnesses that they have. If a new patient
shows up then based on this patient’s characteristics such as symptoms, family
medical history, physical attributes, mental outlook, etc. the doctor associates
possible illness or illnesses based on what the doctor has seen before with similar
patients. This is not the same as rule based reasoning as in expert systems. In this
case we would like to estimate a mapping function from patient characteristics
into illnesses.
21
3.2 Linear Regression with one variable
Model Representation
Recall that in regression problems, we are taking input variables and trying to
map the output onto a continuous expected result function.
Linear regression with one variable is also known as "univariate linear
regression." Univariate linear regression is used when you want to predict a
single output value from a single input value. We're doing supervised learning
here, so that means we already have an idea what the input/output cause and
effect should be.
Our hypothesis function has the general form: hθ (x) = θ0 + θ1 x. We assign hθ with
values for θ0 and θ1 to get our output 'y'. In other words we are trying to create a
function called hθ that is able to reliably map our input data (the x's) to our
output data (the y's).
Example:
x (input) y (output)
0 4
1 7
2 7
3 8
Now we can make a random guess about our hθ function: θ0 = 2 and θ1 = 2.
The hypothesis function becomes hθ (x) = 2 + 2 x.
So for input of 1 to our hypothesis, y will be 4. This is off by 3.
22
Cost Function
We can measure the accuracy of our hypothesis function by using a cost function.
This takes an average (actually a sophisticated version of an average) of all the
results of the hypothesis with inputs from x's compared to the actual output y's.
The cost function is:
J(θ0 , θ1) =
∑ ( (
( )) ( ) ) 2 , where m is the size of the training set.
You can think of this equation as taking the average of the differences of all the
results of our hypothesis and the actual correct results.
Now we are able to concretely measure the accuracy of our predictor function
against the correct results we have, so that we can predict new results we don't
have.
Gradient Descent
So we have our hypothesis function and we have a way of measuring how
accurate it is. Now what we need is a way to automatically improve our
hypothesis function. That's where gradient descent comes in.
Imagine that we graph our hypothesis function based on its parameters θ0 and θ1
(actually we are graphing the cost function for the combinations of parameters).
This can be kind of confusing; we are moving up to a higher level of abstraction.
We are not graphing x and y itself, but the guesses of our hypothesis function.
We put θ0 on the x axis and θ1 on the z axis, with the cost function on the vertical
y axis. The points on our graph will be the result of the cost function using our
hypothesis with those specific θ parameters.
23
We will know that we have succeeded when our cost function is at the very
bottom of the pits in our graph and our result is 0 (or close to it).
The way we do this is by taking the derivative (the line tangent to a function) of
our cost function. The slope of the tangent is the derivative at that point and it
will give us a direction to move towards. We step down that derivative by a
constant value called alpha (α).
The gradient descent equation is:
repeat until convergence:
θj := θj – α
J(θ0 , θ1) , for j = 0 and j = 1
Intuitively, this could be thought of as:
repeat until convergence:
θj := θj – α slope or derivative
Gradient Descent for Linear Regression
When specifically applied to the case of linear regression, a new form of the
gradient descent equation can be derived. We can substitute our actual cost
function and our actual hypothesis function and modify the equation to:
repeat until convergence: {
θ0 := θ0 – α ∑ ( ( ( )) – ( ) )
θ1 := θ1 – α ∑ ( ( ( ( )) ( )) ( ) ) }
24
Over here m is the size of the training set, θ0 is a constant that will be changing
simultaneously with θ1 and ( ) ( ) are values of the given training set (data).
Note that we have separated out the two cases for θj and that for θ1 we are
multiplying ( ) at the end due to the derivative.
The point of all this is that if we start with a guess for our hypothesis and then
repeatedly apply these gradient descent equations, our hypothesis will become
more and more accurate.
3.3 Linear Regression with Multiple variables (Multi-Regression)
Multiple Features
Linear regression with multiple variables is also known as "multivariate linear
regression."
We now introduce notation for equations where we can have any number of
input variables.
( )
= value of feature j in the ith training example.
( ) = the column vector of all the input features of the ith training example
m = the number of training examples
n = | ( )| (the number of features)
Now define the multivariable form of the hypothesis function as follows,
accommodating these multiple features:
hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 +⋯+ θnxn
25
Using the definition of matrix multiplication, our multivariable hypothesis
function can be concisely represented as:
hθ(x) = [ θ0 θ1...θn ] [
] = θTx
This is a vectorization of our hypothesis function for one training example.
Now we can collect all m training examples each with n features and record them
in an n+1 by m matrix. In this matrix we let the values of the subscript (feature)
represent the row number (except the initial row is the "zeroth" row), and the
values of the superscript (the training example) represent the column number, as
shown in the next page:
X =
[
( )
( )
( )
( )
( )
( )
( )
( )
( )
]
= [
( )
( )
( )
( )
( )
( )
]
Notice above that the first column is the first training example, the second
column is the second training example, and so forth.
Now we can define hθ(x) as a row vector that gives the value of hθ(x) at each of
the m training examples:
hθ(x) = [θ0 ( )
+θ1 ( )
+θ2 ( )
+….+θn ( )
….. θ0 ( )
+θ1 ( )
+θ2 ( )
+….+θn ( )
]
But again using the definition of matrix multiplication, we can represent this
more concisely:
hθ(x) = [ θ0 θ1...θn ]
[
( )
( )
( )
( )
( )
( )
( )
( )
( )
]
= θTX
26
Cost function
For the parameter vector θ (of type Rn+1 or in R(n+1)×1), the cost function is:
J(θ) =
∑ ( (
( )) ( ) ) 2
The vectorized version is:
J(θ) =
(Xθ − y )T(Xθ − y ) , where y denotes the vector of all y values.
Gradient Descent for Multiple Variables
The gradient descent equation itself is generally the same form; we just have to
repeat it for our 'n' features:
repeat until convergence: {
θ0 := θ0 – α
∑ ( ( (
( )) ( )) ( )
)
θ1 := θ1 – α
∑ ( ( (
( )) ( )) ( )
)
θ2 := θ2 – α
∑ ( ( (
( )) ( )) ( )
)
…… }
In other words:
repeat until convergence: {
θj := θj – α
∑ ( ( (
( )) ( )) ( )
) for j := 0,1,..,n }
The matrix notation (vectorized) of the Gradient Descent rule is:
θ := θ −
XT (Xθ−y )
27
Normal Equation
The "normal equation" is a version of finding the optimum without iteration.
θ=(XTX)−1XTy
There is no need to do feature scaling with the normal equation.
The following is a comparison of gradient descent and the normal equation:
Gradient Descent Normal Equation
Need to choose alpha No need to choose alpha
Needs many iterations No need to iterate
Works well when n is large Slow if n is very large
Table 3.1 – Comparison between Gradient Descent and Normal Equation
With the normal equation, computing the inversion has complexity O(n3). So if
we have a very large number of features, the normal equation will be slow.
According to Andrew Ng (Professor at Stanford) when n exceed 10,000 it might
be a good time to go from a normal solution to an iterative process.
Normal Equation Non-invertibility
When implementing the normal equation in octave we want to use the 'pinv'
function rather than 'inv.', i.e. we should use the pseudo-inverse rather than
actual inverse.
28
XTX may be non-invertible. The common causes are:
Redundant features, where two features are very closely related (i.e. they
are linearly dependent)
Too many features (e.g. m ≤ n). In this case, delete some features or use
“regularization" (explained later).
Solutions to the above problems include deleting a feature that is linearly
dependent with another or deleting one or more features when there are too
many features.
3.4 Logistic Regression
Now we are switching from regression problems to classification problems. We
should not be confused by the named "Logistic Regression"; it is named that way
for historical reasons and is actually an approach to classification problems, not
regression problems.
Classification
Instead of our output vector y being a continuous range of values, it will only be 0
or 1 i.e. y ∈ {0,1}
0 is usually taken as "negative class" and 1 as "positive class", but you are free to
assign any representation to it. We're only doing two classes for now, and it is
called a "Binary Classification Problem."
29
One method is to use linear regression and map all predictions greater than 0.5
as a 1 and all less than 0.5 as a 0. This method doesn't work well because
classification is not actually a linear function.
Hypothesis Representation
Our hypothesis should satisfy: 0 ≤ hθ(x) ≤ 1
Our new form uses the "Sigmoid Function," also called the "Logistic Function",
which is as follows:
hθ(x) = g(θTx) = g(z) =
It is the same as the old hypothesis function (for linear regression), except that
we are wrapping it in a call to g(), which is the Logistic Function.
hθ will give us the probability that our output is 1. For example, hθ(x) = 0.7 gives
us the probability of 70% that our output is 1.
hθ(x) = P(y=1|x ;θ) = 1−P(y=0|x ;θ)
Our probability that our prediction is 0 is just the opposite of our probability that
it is 1 (e.g. if probability that it is 1 is 70%, then the probability that it is 0 is 30%).
Decision Boundary
In order to get our discrete 0 or 1 classification, we can translate the output of
the hypothesis function as follows:
hθ(x) ≥ 0.5 → y = 1 ; hθ(x) < 0.5 → y = 0
30
The way our logistic function g behaves is that when its input is greater than or
equal to zero, its output is greater than or equal to 0.5 i.e. g(z) ≥ 0.5 when z ≥ 0.
When its input is less than zero, its output is less than 0.5
Remember that:
z=0, e0=1, g(z)=1/2
z=∞, e−∞=0, g(z)=1
z=−∞, e∞=∞, g(z)=0
So if our input to g is θTX, then that means: hθ(x) = g(θTx) ≥ 0.5 when θTx ≥ 0
From these statements we can now say: θTx ≥ 0 → y =1
θTx < 0 → y=0
Example:
θ = [
] , y = 1 if 5 + (-1) x1 + (0) x2 ≥ 0 i.e. if x1 < 5
The decision boundary is the line that separates the area where y=0 and where
y=1. It is created by our hypothesis function.
Again, our hypothesis function need not be linear, and could be a function that
describes a circle or any shape to fit our data.
Cost Function
We cannot use the same cost function that we use for linear regression because
the Logistic Function will cause the output to be wavy, causing many local
optima. In other words, it will not be a convex function.
Instead, our cost function for logistic regression looks like:
31
J(θ) =
∑ ( (
( )) ( ))
Cost(hθ(x),y) = −log(hθ(x)) if y = 1
Cost(hθ(x),y) = −log(1−hθ(x)) if y = 0
The more our hypothesis is off from y, the larger the cost function’s output. If our
hypothesis is equal to y, then our cost is 0.
Cost(hθ(x),y) = 0 if hθ(x) = y
Cost(hθ(x),y) → ∞ if y = 0 and hθ(x) → 1
Cost(hθ(x),y) → ∞ if y = 1 and hθ(x) → 0
If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis
function also outputs 0. If our hypothesis approaches 1, then the cost function
will approach infinity.
If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis
function outputs 1. If our hypothesis approaches 0, then the cost function will
approach infinity.
Simplified Cost Function and Gradient Descent
We can compress our cost function's two conditional cases into one case:
Cost(hθ(x),y) = − y log(hθ(x)) − (1−y) log(1−hθ(x))
Notice that when y is equal to 1, then the second term ((1−y) log(1−hθ(x))) will be
negated and will not affect the result. If y is equal to 0, then the first term
(− y log(hθ(x))) will be negated and will not affect the result.
We can fully write out our entire cost function as follows:
32
J(θ) = −
∑ [ ( ) ( (
( ))) ( ( )) ( ( ( ))) ]
A vectorized implementation is:
J(θ) = −
[log(g(Xθ))T y + log(1−g(Xθ))T (1−y)]
Gradient Descent
Remember that the general form of gradient descent is:
Repeat until convergence: {
θj := θj – α
J(θ) }
We can work out the derivative part using calculus to get:
Repeat until convergence: { θj := θj –
∑ ( ( (
( )) ( )) ( )
) }
Notice that this algorithm is identical to the one we used in linear regression, but
the hypothesis function is different for linear and logistic regression. We still have
to simultaneously update all values in theta.
A vectorized implementation is:
θ := θ −
XT (g(Xθ) − y )
3.5 Regularization
The Problem of Overfitting
Regularization is designed to address the problem of overfitting.
33
High bias or underfitting is when the form of our hypothesis maps poorly to the
trend of the data. It is usually caused by a function that is too simple or uses too
few features.
At the other extreme, overfitting or high variance is caused by a hypothesis
function that fits the available data but does not generalize well to predict new
data. It is usually caused by a complicated function that creates a lot of
unnecessary curves and angles unrelated to the data.
This terminology is applied to both linear and logistic regression.
There are two main options to address the issue of overfitting:
1. Reduce the number of features.
o Manually select which features to keep.
o Use a model selection algorithm.
2. Regularization
o Keep all the features, but reduce the parameters θj
Regularization works well when we have a lot of slightly useful features.
Regularized Linear Regression
Gradient Descent
We will modify our gradient descent function to separate out θ0 from the rest of
the parameters because we do not want to penalize θ0.
Repeat until convergence: {
θ0 := θ0 –
∑ ( ( (
( )) ( )) ( )
)
34
θj := θj – α [
∑ ( ( (
( )) ( )) ( )
)
] j∈{1,2...n}
}
The term
performs our regularization.
With some manipulation our update rule can also be represented as:
θj := θj (
) –
∑ ( ( (
( )) ( )) ( )
)
The first term in the above equation, (
) will always be less than 1.
Intuitively you can see it as reducing the value of θj by some amount on every
update.
Notice that the second term is now exactly the same as it was before.
Regularized Logistic Regression
We can regularize logistic regression in a similar way that we regularize linear
regression. Let's start with the cost function.
Cost Function
Recall that our cost function for logistic regression was:
J(θ) = −
∑ [ ( ) ( (
( ))) ( ( )) ( ( ( ))) ]
We can regularize this equation by adding a term to the end:
J(θ) = −
∑ [ ( ) ( (
( ))) ( ( )) ( ( ( ))) ] +
∑
Note Well: The second sum ∑
means to explicitly exclude the bias term θ0
35
Gradient Descent
Just like with linear regression, we will want to separately update θ0 and the rest
of the parameters because we do not want to regularize θ0.
Repeat until convergence: { θ0 := θ0 –
∑ ( ( (
( )) ( )) ( )
)
θj := θj – α [
∑ ( ( (
( )) ( )) ( )
)
] , j=1,2...n
}
This is identical to the gradient descent function presented for linear regression.
3.6 Support Vector Machines
Optimization Objective
The Support Vector Machine (SVM) is yet another type of supervised
machine learning algorithm. It is sometimes cleaner and more powerful.
Recall that in logistic regression, we use the following rules:
if y=1, then hθ(x) ≈ 1 and θTx > 0
if y=0, then hθ(x) ≈ 0 and θTx < 0
Recall the cost function for (unregularized) logistic regression:
J(θ) = −
∑ [ ( ) ( (
( ))) ( ( )) ( ( ( ))) ]
= −
∑ [ ( ) (
( )
) ( ( )) (
( )
) ]
36
To make a support vector machine, we will modify the first term of the
cost function [ ( ( )) =
( )
so that when θTx (from now on, we shall
refer to this as z) is greater than 1, it outputs 0. Furthermore, for values of z less
than 1, we shall use a straight decreasing line instead of the sigmoid curve.(In the
literature, this is called a hinge loss function.)
Similarly, we modify the second term of the cost function
[ ( ( )) =
( )
so that when z is less than -1, it outputs 0. We
also modify it so that for values of z greater than -1, we use a straight increasing
line instead of the sigmoid curve.
We shall denote these as cost1(z) and cost0(z) respectively (note that
cost1(z) is the cost for classifying when y=1, and cost0(z) is the cost for classifying
when y=0), and we may define them as follows (where k is an arbitrary constant
defining the magnitude of the slope of the line):
z = θTx , cost0(z) = max(0, k(1+z)) , cost1(z) = max(0, k(1−z))
Recall the full cost function from (regularized) logistic regression:
J(θ) =
∑ [ ( ) ( ( (
( ))) ) ( ( )) ( ( ( ( ))) )] +
∑
Note that the negative sign has been distributed into the sum in the above
equation.
We may transform this into the cost function for support vector machines by
substituting cost0(z) and cost1(z):
J(θ) =
∑ [ ( ) (
( )) ( ( )) ( ( ))] +
∑
37
We can optimize this a bit by multiplying this by m (thus removing the m
factor in the denominators). Note that this does not affect our optimization, since
we're simply multiplying our cost function by a constant (for example, minimizing
(u−5)2+1 gives us 5; multiplying it by 10 to make it 10(u−5)2+10 still gives us 5
when minimized).
J(θ) = ∑ [ ( ) ( ( )) ( ( )) (
( ))] +
∑
Furthermore, convention dictates that we regularize using a factor C, instead of λ,
as given by the following:
J(θ) = ∑ [ ( ) ( ( )) ( ( )) (
( ))] +
∑
This is equivalent to multiplying the equation by C =
, and thus results in
the same values when optimized. Now, when we wish to regularize more, we
decrease C, and when we wish to regularize less, we increase C.
Finally, note that the hypothesis of the Support Vector Machine is not
interpreted as the probability of y being 1 or 0 (as it is for the hypothesis of
logistic regression). Instead, it outputs either 1 or 0. (In technical terms, it is a
discriminant function)
hθ(x)={
38
Large Margin Intuition
A useful way to think about Support Vector Machines is to think of them
as Large Margin Classifiers.
If y=1, we want (not just ≥ 0)
If y=0, we want ≤ −1 (not just < 0)
Now when we set our constant C to a very large value (e.g. 100,000), our
optimizing function will constrain θ and the equation involving the sum of the
cost of each example equals 0.
We impose the following constraints on θ:
, if y=1 and ≤ −1, if y=0.
If C is very large, then we must choose θ parameters such that:
∑ [ ( ) ( ( )) ( ( )) (
( ))] = 0
This reduces our cost function to:
J(θ) = C⋅0 +
∑
=
∑
Recall the decision boundary from logistic regression (the line separating
the positive and negative examples). In SVMs, the decision boundary has the
special property that it is as far away as possible from both the positive and the
negative examples.
The distance of the decision boundary to the nearest example is called the
margin. Since SVMs maximize this margin, it is often called a Large Margin
Classifier.
The SVM will separate the negative and positive examples by a large margin.
This large margin is only achieved when C is very large.
39
Data is linearly separable when a straight line can separate the positive and
negative examples.
If we have outlier examples that we don't want to affect the decision boundary,
then we can reduce C.
Increasing and decreasing C is similar to increasing and decreasing λ because it
can simplify our decision boundary.
3.7 Prediction Performance
After the construction of mathematical models with multi-regression and
support vector regression, the models are used to predict the true label value.
During the prediction process, for any feature group, the entire data set is split at
random into 80% training set and 20% test set. These feature vectors from the
training set are used to construct the mathematical models by using Multi-
Regression or Support Vector Regression with linear kernel to predict failure load
values. Once the mathematical models are constructed, the feature vectors from
the test set are fetched into the models to predict the failure load values, and
these predicted failure load values are compared with the ground truth by Root-
Mean Square Error (RMSE), defined as:
RMSE = √( ) , where is the predicted failure
load and is the true failure load for the test set.
Fifty iterations of this prediction process are performed and the RMSE
measured from different feature groups such as Isotropic Minkowski Functionals
(IMF) and Anisotropic Minkowski Functionals (AMF) using different regression
40
methods (Multi-Regression or Support Vector Regression with linear kernel) is
compared to the RMSE with the standard approach, which uses BMD mean with
a Multi-Regression model. A Wilcoxon signed-rank test was used to compare two
RMSE distributions corresponding to the prediction performance of different
features. Significance thresholds were adjusted for multiple comparisons using
the Holm-Bonferroni correction to achieve an overall type I error rate
(significance level) less than α (where α = 0.05) [47, 48].
41
Chapter 4
Bone Strength Prediction: Performance Results
The prediction performance of different texture analysis and machine
learning techniques discussed previously are compared with the current clinical
standard.
4.1 Identification of Femur Region for Analysis
The fundamental BMD statistics distributions of the dataset were
examined (Table 4.1) to investigate the correlation between BMD measurements
from different regions and FL. In addition, the FL was estimated using multi-
regression analysis (Fig 4.1) to identify the ideal candidate for further analysis.
Max Min Mean SD r with FL
Age (years) 100 52 79.39 10.57 -
Failure Load (kN) 8.156 0.664 3.943 1.557 -
BMD.mean Head (mg/cm3) 406.91 57.33 218.96 64.73 0.706
BMD.mean Neck (mg/cm3) 225 -46.22 44.98 53.38 0.467
BMD.mean Troch (mg/cm3) 226 -35.52 70.79 52.94 0.596
Table 4.1: Values of investigated parameters for femur specimens. Representative statistical values of Age, Failure Load (FL) and the mean BMD of three femur specimen regions are listed. The correlation between the mean BMD from different regions with FL are calculated. BMD mean of head region has higher correlation. Adapted from [44]
42
Figure 4.1: Scatterplots show relationships between failure load (FL) and BMD. Coefficients (r) for correlations with failure load were as follows: (a) 0.706 for correlation with quantitative CT (QCT) BMD in femur head, (b) 0.467 for correlation with quantitative CT BMD in femur neck, and (c) 0.596 for correlation with quantitative CT BMD in trochanter (troch.). All correlations were significant (p < 0.001). Each solid line represents the fit to a linear regression model. Adapted from [43].
Since the head region BMD yields the highest correlation (r = 0.706), it was
selected for further texture feature analysis and prediction performance tests. So
in our future analysis, we will only be using the head region.
4.2 Conventional Statistical Features
The DXA BMD value was extracted from the trochanter, neck, ward and
shaft regions of the DXA image of each proximal femur specimen, and the mean
of these 4 BMD values corresponding to these 4 regions, denoted as total DXA
BMD or simply DXA BMD, was used as a feature to construct mathematical
models for prediction of biomechanical strength (failure load).
Using DXA BMD and Multi-Regression, the prediction performance results
obtained were RMSE = 0.960 ± 0.131
Using DXA BMD and Support Vector Regression (SVR) with linear kernel, the
prediction performance results obtained were RMSE = 0.959 ± 0.132
43
Thus, we can conclude that for DXA BMD feature, both Multi-Regression and
Support Vector Regression (SVR) gives equally good prediction performance.
4.3 Isotropic Minkowski Functionals
In order to obtain the Isotropic Minkowski Functionals (IMFs), we first
need to threshold the BMD image to convert it into a black and white image. In
our case, we have empirically chosen the threshold BMD value to be 400.
Then we need to identify optimal values for free parameters (in this case
the kernel size) in order to obtain the best prediction performance. We have
used a number of different kernel sizes (ranging from 5x5x5 to 19x19x19 in
increments of 2) and fixed histogram bin size of 10 in order to get the IMF
features. We have then evaluated the prediction performance for each kernel
size and each IMF feature i.e. Volume, Surface, Mean Breadth and Euler
Characteristic.
The following table lists the prediction performance obtained using the
Isotropic Minkowski Functionals (IMF) and DXA BMD:-
44
Feature Groups Multi-Regression
(RMSE) SVR
(RMSE)
DXA BMD 0.960 ± 0.131 0.959 ± 0.132
IMF.volume 1.612 ± 0.163 1.585 ± 0.167
DXA BMD + IMF.volume 0.999 ± 0.113 0.992 ± 0.140
IMF.surface 1.701 ± 0.249 1.631 ± 0.200
DXA BMD + IMF.surface 1.003 ± 0.122 0.995 ± 0.146
IMF.mean_breadth 1.695 ± 0.226 1.625 ± 0.190
DXA BMD + IMF.mean_breadth 1.017 ± 0.125 0.985 ± 0.132
IMF.euler 1.669 ± 0.208 1.600 ± 0.183
DXA BMD + IMF.euler 1.026 ± 0.133 0.981 ± 0.134
Table 4.2: Table showing the prediction performance (RMSE) of Feature Groups DXA BMD and Isotropic Minkowski Functionals used in conjunction with Multi-Regression and Support Vector Regression with linear kernel.
From the above table, we can see that Isotropic Minkowski Functionals itself
is not very good. The best prediction performance for Isotropic Minkowski
Functionals alone is given by IMF.volume with RMSE = 1.585 ± 0.167 which is
significantly lower (p < 0.05) than the standard approach of using DXA BMD and
multi-regression (RMSE = 0.960 ± 0.131).
4.4 Anisotropic Minkowski Functionals
In order to extract the Anisotropic Minkowski Functionals, we have to first
threshold the BMD image to obtain a black and white image, similarly as with
Isotropic Minkowski Functionals. As before, we have empirically chosen our
threshold BMD value to be 400.
Then we have to optimize our free parameters (in this case it is the kernel
size and the ratio between the standard deviation of the Gaussian kernel in the
principal direction and its two orthogonal directions) to obtain the best
45
prediction performance. For this reason, we have chosen a number of different
kernel sizes ranging from 5x5x5 to 19x19x19 in increments of 2, and the ratio
between the standard deviations have been chosen as 2, 4 and 8.
The following table lists the prediction performance obtained using the
Anisotropic Minkowski Functionals (AMF) and DXA BMD :
Feature Groups Multi-Regression
(RMSE) SVR
(RMSE)
DXA BMD 0.960 ± 0.131 0.959 ± 0.132
AMF.volume 1.060 ± 0.126 1.007 ± 0.105
DXA BMD + AMF.volume 0.909 ± 0.111 0.880 ± 0.112
AMF.surface 1.051 ± 0.130 1.018 ± 0.120
DXA BMD + AMF.surface 0.921 ± 0.116 0.894 ± 0.115
AMF.mean_breadth 1.056 ± 0.154 0.998 ± 0.115
DXA BMD + AMF.mean_breadth 0.995 ± 0.128 0.904 ± 0.101
AMF.euler 0.966 ± 0.128 0.904 ± 0.105
DXA BMD + AMF.euler 0.898 ± 0.116 0.838 ± 0.092
Table 4.3: Table showing the prediction performance (RMSE) of Feature Groups DXA BMD and Anisotropic Minkowski Functionals used in conjunction with Multi-Regression and Support Vector Regression with linear kernel.
The following 8 figures will show a comparison of the prediction
performance (measured with RMSE) using features such as DXA BMD,
Anisotropic Minkowski Functionals and Isotropic Minkowski Functionals.
[Note: In the following figures, the RMSE distribution obtained using Multi-
Regression and Support Vector Regression with linear kernel is shown in red and
blue colors respectively. For each RMSE distribution, the central mark
46
corresponds to the median of the distribution and the top and bottom edges
correspond to the 75th and 25th percentile respectively. The red horizontal line
corresponds to the performance achieved with the standard approach (mean
BMD with multi-regression). The blue line corresponds to the best performance
achieved for the feature groups used in each figure.
Figure 4.2: Figure showing the prediction performance (RMSE) of Feature Groups DXA BMD, AMF volume, IMF volume.
47
Figure 4.3: Figure showing the prediction performance (RMSE) of Feature Groups DXA BMD, DXA BMD + AMF volume, DXA BMD + IMF volume.
Figure 4.4: Figure showing the prediction performance (RMSE) of Feature Groups DXA BMD, AMF surface, IMF surface.
48
Figure 4.5: Figure showing the prediction performance (RMSE) of Feature Groups DXA BMD, DXA BMD + AMF surface, DXA BMD + IMF surface.
Figure 4.6: Figure showing the prediction performance (RMSE) of Feature Groups DXA BMD, AMF mean breadth, IMF mean breadth.
49
Figure 4.7: Figure showing the prediction performance (RMSE) of Feature Groups DXA BMD, DXA BMD+ AMF mean breadth, DXA BMD + IMF mean breadth.
Figure 4.8: Figure showing the prediction performance (RMSE) of Feature Groups DXA BMD, AMF euler characteristic, IMF euler characteristic.
50
Figure 4.9: Figure showing the prediction performance (RMSE) of Feature Groups DXA BMD, DXA BMD + AMF euler characteristic, DXA BMD + IMF euler characteristic.
4.5 Prediction Performance Comparison and Conclusion
While looking at the above 8 figures, the general trend is that if we
combine DXA BMD mean with any feature, be it IMF or AMF, the prediction
performance of the combination is always better than that feature alone. Also,
Support Vector Regression in general gives better prediction performance (in
terms of lower RMSE) than Multi-Regression.
The best prediction performance using Anisotropic Minkowski Functionals
(AMF) alone was obtained for a combination of FA and Phi feature vectors
corresponding to Euler Characteristic (kernel size = 17, xyratio = 4). The best
prediction performance obtained using Isotropic Minkowski Functionals (IMF)
alone was obtained for Volume feature vector with kernel size 5.
51
The summary of the prediction performance obtained using different
feature groups are shown in the following table:-
Feature Groups Multi-Regression
(RMSE) SVR
(RMSE)
DXA BMD 0.960 ± 0.131 0.959 ± 0.132
Best IMF feature alone 1.612 ± 0.163 1.585 ± 0.167
Best AMF feature alone 0.966 ± 0.116 0.904 ± 0.105
Best combination of DXA BMD and IMF feature 1.026 ± 0.133 0.981 ± 0.134
Best combination of DXA BMD and AMF feature 0.898 ± 0.116 0.838 ± 0.092
Table 4.4: Table showing the prediction performance (RMSE) of Feature Groups DXA BMD, best IMF and AMF alone, and best combination of DXA BMD and IMF and DXA BMD and AMF used in conjunction with Multi-Regression and Support Vector Regression with linear kernel.
The final conclusions I have drawn from all the results obtained are as
follows:-
• Overall best prediction performance was obtained using the best
combination of DXA BMD and AMF feature, and which was significantly
better than DXA BMD alone (p < 10 - 4)
• Prediction performance obtained using best AMF feature alone was
significantly better than best IMF feature alone (p < 10 - 4)
• Prediction performance obtained using best AMF feature alone was
significantly better than DXA BMD alone (p < 0.05)
52
Chapter 5
Discussion
The correlation of QCT-derived mean BMD, a descriptor of bone mineral
content in the trabecular bone of proximal femur specimens, to bone strength or
failure load (FL) has been established [20,21]. However, one drawback with this
measure is its inability to adequately characterize the micro-architecture of the
femoral trabecular compartment. Previous studies have shown that
supplementing DXA BMD measurements (current clinical standard) with features
that characterize the trabecular bone texture variation and micro-architecture
can improve the corresponding correlation to bone strength on high-resolution
MRI [28,29,37,38] and multi-detector CT [20-22,39]. We specifically investigate
the ability of such features in predicting the failure load of the femur specimen
through a computer-aided diagnosis approach involving regression analysis. Our
results suggest that the inclusion of texture features derived from Anisotropic
Minkowski Functionals in addition to DXA BMD significantly improves the
accuracy of FL prediction in such proximal femur specimens. This suggests that
such descriptors of trabecular bone quality and trabecular texture variation have
significant potential to aid clinicians in predicting femoral fracture risk in patients
suffering from osteoporosis.
As seen in Figure 4.1, the correlation between measured FL and QCT-
derived mean BMD for head, neck and trochanter regions was the highest for the
head region. Since these findings suggest that regional characterization of
femoral trabeculae could play a significant role in FL prediction, subsequent
feature extraction and regression has been focused only on the head region of
the femur.
53
I have used Isotropic Minkowski Functionals (IMF) and Anisotropic
Minkowski Functionals (AMF) as the texture features for characterizing trabecular
bone micro-architecture in the femoral head region. Both IMF and AMF have the
ability to characterize the local structure within a region/volume of interest. But
AMF captures some more information (i.e. anisotropy in patterns) about the local
structure content in comparison to IMF.
Our results show that the prediction performance obtained using best
AMF feature alone (RMSE = 0.904 ± 0.105) was significantly better (p < 10 - 4)
than using best IMF feature alone (1.585 ± 0.167). This is due to the fact that the
anisotropy i.e. the directional patterns in bone structure (captured by AMF and
not by IMF) is highly correlated to its bone strength (FL).
The overall best prediction performance was obtained using DXA BMD +
AMF (RMSE = 0.838 ± 0.092), which was significantly better (p < 10 - 4) than using
DXA BMD alone (RMSE = 0.960 ± 0.131), and also significantly better (p < 10 - 4)
than using AMF alone (RMSE = 0.904 ± 0.105). We know that DXA BMD captures
the bone mineral density and bone mineral content information from not only
the trabecular bone but also from the cortical bone. So DXA BMD is able to gain
insight about bone stability. Anisotropic Minkowski Functionals (AMF), on the
other hand, are able to characterize the structure content (not the bone mineral
content) of the trabecular bone. Therefore we find that AMF and DXA BMD
capture complementary information about the femoral bone. If we combine AMF
and DXA BMD together into a single feature vector, the amount of information
about the femoral bone in their combination is much more than these features
individually. Thus, it is quite obvious that if DXA BMD and AMF are combined
together into a single feature vector for prediction performance, they will give
the overall best prediction performance.
54
References
[1] Maryellen L. Giger, Heang-Ping Chan, John Boone, “Anniversary Paper: History and status of CAD and quantitative image analysis: The role of Medical Physics and AAPM.” Med. Phys. (35) 12, December 2008 [2] Maryellen Giger, Heber MacMahon, “IMAGE PROCESSING AND COMPUTER-AIDED DIAGNOSIS.” Radiologic Clinics of North America (1996), 34: 565-596 [3] Maryellen L. Giger, Kunio Doi, Heber Mahon et. al. “An Intelligent Workstation for Computer-aided Diagnosis.” RSNA (1993); 13:647-656 [4] Kunio Doi, Heber MacMahon, Shigehiko Katsuragawa et. al. “Computer-aided diagnosis in radiology: potential and pitfalls.” European Journal of Radiology 31 (1997) 97-109 [5] K DOI, “Current status and future potential of computer-aided diagnosis in medical imaging.” The British Journal of Radiology, 78 (2005), S3-S19 [6] Kunio Doi, “Computer-aided diagnosis in medical imaging: Historical review, current status and future potential.” Computerized Medical Imaging and Graphics 31 (2007) 198-211 [7] R.A. Lerski, K. Straughan, L.R. Schad, D. Boyce, et al., “MR IMAGE TEXTURE ANALYSIS – AN APPROACH TO TISSUE CHARACTERIZATION.” Magnetic Resonance Imaging, Vol. 11, pp. 873-887, 1993. [8] Georgia D. Tourassl. “Journey toward Computer-aided Diagnosis: Role of Image Texture Analysis.” Radiology 1999; 213:317-320. [9] K. Michielsen, H. De Raedt, “INTEGRAL-GEOMETRY MORPHOLOGICAL IMAGE ANALYSIS.” Physics Reports 347 (2001) 461-538. [10] Axel Wismüller, Titas De, Eva Lochmüller, Felix Eckstein, and Mahesh B. Nagarajan, “Introducing anisotropic Minkowski functionals and quantitative anisotropy measures for local structure analysis in biomedical imaging.” Proc. SPIE 8672, Medical Imaging 2013: Biomedical Applications in Molecular, Structural, and Functional Imaging, 86720I, 2013 [11] F. Jamitzky, W. Stark, W. Bunk, S. Thalhammer, C. Raeth, T. Aschenbrenner, G. Morfill, and W. Heckl, “Scaling-index method as an image processing tool in scanning-probe microscopy,” Ultramicroscopy, vol. 86, pp. 241-246, 2000. [12] C. Raeth, W. Bunk, M. B. Huber, G. E. Morfill, J. Retzlaff, and P. Schuecker, “Analysing large-scale structure – I. Weighted scaling indices and constrained randomization,” Monthly Notices Roy. Astron. Soc., vol. 337, pp 413-421, 2002. [13] M. B. Huber, S. L. Lancianese, I. Ikpot, M. B. Nagarajan, A. L. Lerner, and A. Wismüller, “Prediction of biomechanical trabecular bone properties with geometric features using MR imaging,” Proc. SPIE, R.C. Molthen and J.B. Weaver, Eds., vol. 7626, no. 1, pp. 762610-1-762610-8, 2010. [14] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. New York, NY: Wiley-Interscience Publication, 2000.
55
[15] Cortes, C. and Vapnik, V., “Support-vector networks,” Machine Learning 20(3), 273-297 (1995). [16] Schoelkopf, B., Smola, A.J., Williamson, R. C., and Bartlett, P. L., New support vector algorithms,” Neural Computation 12(5), 1207-1245 (2000) [17] H. F. Boehm, C. Raeth, R. A. Monetti, D. Mueller, D. Newitt, S. Majumdar, E. Rummeny, G. Morfill, and T. M. Link, “Local 3D scaling properties for the analysis of trabecular bone extracted from high-resolution magnetic imaging of trabecular bone: Comparison with bone mineral density in the prediction of biomechanical strength in vitro,” Invest. Radiol., vol. 38, no. 5, pp. 269-280, 2003. [18] Tromp AM, Ooms ME, Popp-Snijders C, Roos JC, Lips P. “Predictors of fractures in elderly women.” Osteoporosis Int 2000; 11: 134-140. [19] Peter Pietschmann, Martina Rauner, Wolfgang Sipos, Katharina Kerschan-Schindl. “Osteoporosis: An Age-Related and Gender-Specific Disease – A Mini-Review” Gerontology 2009; 55: 3-12 [20] C. Cooper, G. Campion and L. J. Melton, III. “Hip Fractures in the Elderly: A World-Wide Projection” Osteoporosis Int (1992) 2:285-289 [21] Johnell O, Kanis JA, Oden A, et al. “Predictive value of BMD for hip and other fractures.” J Bone Miner Res 2005; 20:1185-1194 [22] Abrahamsen B, Vestergaard P, Rud B, et al. “Ten-year absolute risk of osteoporotic fractures according to BMD T score at menopause: the Danish Osteoporosis Prevention Study.” J Bone Miner Res 2006; 21: 796-800. [23] Black DM, Greenspan SL, Ensrud KE, et al. “The effects of parathyroid hormone and alendronate alone or in combination in postmenopausal osteoporosis.” N Engl J Med 2003; 349:1207-1215. [24] Black DM, Steinbuch M, Palermo L, et al. “An assessment tool for predicting fracture risk in postmenopausal women.” Osteoporosis Intl 2001; 12:519-528. [25] Kanis JA, Borgstrom F, De Laet C, et al. “Assessment of fracture risk.” Osteoporosis Int 2005; 16: 581-589. [26] Mazess R, Collick B, Trempe J, Barden H, Hanson J. “Performance evaluation of a dual-energy x-ray bone densitometer.” Calcif Tissue Int 1989; 44: 228-232. [27] Boehm HF, Eckstein F, Wunderer C, et al. “Improved performance of hip DXA using a novel region of interest in the upper part of the femoral neck: in vitro study using bone strength as a standard of reference.” J Clin Densitom 2005;8:488-494. [28] Cummings SR, Nevitt MC, Browner WS, et al. “Risk factors for hip fracture in white women: Study of Osteoporotic Fractures Research Group.” N Engl J Med 1995; 332:767-773. [29] Taylor BC, Schreiner PJ, Stone KL, et al. “Long-term prediction of incident hip fracture risk in elderly white women: study of osteoporotic fractures.” J Am Geriatr Soc 2004; 52: 1479-1486.
56
[30] Taylor BC, Schreiner PJ, Stone KL, et al. “Long-term prediction of incident hip fracture risk in elderly white women: study of osteoporotic fractures.” J Am Geriatr Soc 2004; 52: 1479-1486. [31