Download - Evaluation of Image Segmentation · 2010-08-17 · Validation of Image Segmentation • Comparison to digital and physical phantoms: – Excellent for testing the anatomy, noise and

Computational Radiology Laboratory Harvard Medical School www.crl.med.harvard.edu

Children’s Hospital Department of Radiology Boston Massachusetts

Evaluation of Image Segmentation

Simon K. Warfield, Ph.D. Associate Professor of Radiology Harvard Medical School

ComputationalRadiologyLaboratory. Slide 2

Segmentation •  Segmentation

–  Identification of structure in images. – Many different algorithms and a wide range

of principles upon which they are based. •  Segmentation is used for:

– Quantitative image analysis –  Image guided therapy – Visualization

•  Evaluation : How to know when we have a good segmentation ?


Validation of Image Segmentation •  Spectrum of accuracy versus realism in

reference standard. •  Digital phantoms.

–  Ground truth known accurately. –  Not so realistic.

•  Acquisitions and careful segmentation. –  Some uncertainty in ground truth. –  More realistic.

•  Autopsy/histopathology. –  Addresses pathology directly; resolution.

•  Clinical data ? –  Hard to know ground truth. –  Most realistic model.


Validation of Image Segmentation •  Comparison to digital and physical

phantoms: – Excellent for testing the anatomy, noise and

artifact which is modeled. – Typically lacks range of normal or

pathological variability encountered in practice.

MRI of brain phantom from Styner et al. IEEE TMI 2000


Comparison To Higher Resolution

MRI Photograph MRI

Provided by Peter Ratiu and Florin Talos.


Comparison To Higher Resolution

Photograph MRI Photograph Microscopy

Provided by Peter Ratiu and Florin Talos.


Comparison to Autopsy Data •  Neonate gyrification index

– Ratio of length of cortical boundary to length of smooth contour enclosing brain surface


Staging

Stage 3 Stage 5

Stage 4 Stage 6

Stage 3: at 28 w GA shallow indentations of inf. frontal and sup. Temp. gyrus (1 infant at 30.6 w GA, normal range: 28.6 ± 0.5 w GA)

Stage 4: at 30 w GA 2 indentations divide front. lobe into 3 areas, sup. temp.gyrus clearly detectable (3 infants, 30.6 w GA ± 0.4 w, normal range: 29.9 ± 0.3 w GA)

Stage 5: at 32 w GA frontal lobe clearly divided into three parts: sup., middle and inf. Frontal gyrus (4 infants, 32.1 w GA ± 0.7 w, normal range: 31.6 ± 0.6 w GA)

Stage 6: at 34 w GA temporal lobe clearly divided into 3 parts: sup., middle and inf. temporal gyrus (8 infants, 33.5 w GA ± 0.5 w normal range: 33.8 ± 0.7 w GA)

“Assessment of cortical gyrus and sulcus formation using MR images in normal fetuses”, Abe S. et al., Prenatal Diagn 2003


Neonate GI: MRI Vs Autopsy


GI Increase Is Proportional to Change in Age.


GI Versus Qualitative Staging


Neonate Gyrification


Validation of Image Segmentation

•  STAPLE (Simultaneous Truth and Performance Level Estimation): – An algorithm for estimating performance

and ground truth from a collection of independent segmentations.

– Warfield, Zou, Wells, IEEE TMI 2004. – Warfield, Zou, Wells, PTRSA 2008. – Commowick and Warfield, IEEE TMI 2010.


Validation of Image Segmentation •  Comparison to expert performance; to other

algorithms. •  Why compare to experts ?

–  Experts are currently doing the segmentation tasks that we seek algorithms for:

•  Surgical planning. •  Neuroscience research. •  Response to therapy assessment.

•  What is the appropriate measure for such comparisons ?


Measures of Expert Performance •  Repeated measures of volume

–  Intra-class correlation coefficient •  Spatial overlap

–  Jaccard: Area of intersection over union. –  Dice: increased weight of intersection. –  Vote counting: majority rule, etc.

•  Boundary measures –  Hausdorff, 95% Hausdorff.

•  Bland-Altman methodology: –  Requires a reference standard.

•  Measures of correct classification rate: –  Sensitivity, specificity ( Pr(D=1|T=1), Pr(D=0|T=0) ) –  Positive predictive value and negative predictive value

(posterior probabilities Pr(T=1|D=1), Pr(T=0|D=0) )


Measures of Expert Performance •  Our new approach:

• Simultaneous estimation of hidden ``ground truth’’ and expert performance.

• Enables comparison between and to experts.

• Can be easily applied to clinical data exhibiting range of normal and pathological variability.


How to judge segmentations of the peripheral zone?

1.5T MR of prostate Peripheral zone and segmentations


Estimation Problem

•  Complete data density: •  Binary ground truth Ti for each voxel i. •  Expert j makes segmentation decisions Dij. •  Expert performance characterized by sensitivity

p and specificity q. – We observe expert decisions D. If we knew

ground truth T, we could construct maximum likelihood estimates for each expert’s sensitivity (true positive fraction) and specificity (true negative fraction):


Expectation-Maximization •  General procedure for estimation

problems that would be simplified if some missing data was available.

•  Key requirements are specification of: – The complete data. – Conditional probability density of the hidden

data given the observed data. •  Observable data D •  Hidden data T, prob. density •  Complete data (D,T)

f (T | D,θ̂)


Expectation-Maximization •  Solve the incomplete-data log likelihood

maximization problem

•  E-step: estimate the conditional expectation of the complete-data log likelihood function.

•  M-step: estimate parameter values Q(θ | θ̂) = E ln f (D,T |θ) |D,θ̂

€

argmaxθ Q θ | ˆ θ ( )


Expectation-Maximization •  Since we don’t know ground truth T, treat T as

a random variable, and solve for the expert performance parameters that maximize:

•  Parameter values θj=[pj qj]T that maximize the conditional expectation of the log-likelihood function are found by iterating two steps: –  E-step: Estimate probability of hidden ground truth T given a

previous estimate of the expert quality parameters, and take the expectation.

–  M-step: Estimate expert performance parameters by comparing D to the current estimate of T.

Q(θ | θ̂) = E ln f (D,T |θ) |D,θ̂


STAPLE •  Consider binary labels:

–  foreground. – background.

•  Spatial correlation of the unknown true segmentation can be modelled with a Markov Random Field.


To Solve for Expert Parameters:


True Segmentation Estimate


Expert Performance Estimate Now we seek an expression for the conditional expectation of the complete-data log likelihood function that we can maximize.


Expert Performance Estimate Now, consider each expert separately:

Differentiate this with respect to pj,qj and solve for zero.


Expert Performance Estimate

p (sensitivity, true positive fraction) : ratio of expert identified class 1 to total class 1 in the image.

q (specificity, true negative fraction) : ratio of expert identified class 0 to total class 0 in the image.


Extension to Several Tissue Labels

•  Complete data density: •  True segmentation Ti for each voxel i

– May be binary

– May be categorical

•  Expert j makes segmentation decisions Dij

•  Expert performance θs’s characterizes probability of deciding label s’ when true label is s.


Probability Estimate of True Labels


Expert Performance Estimate Now, consider each expert separately:

Note constraint on sum of parameters. Solve for maximum.


Parameter Estimation Noting that

We can formulate the constrained optimization problem:


Parameter Estimation Therefore

And noting that

We find that


Results: Synthetic Experts •  Several experiments with known ground truth

and known performance parameters. •  Goal:

–  Determine if STAPLE accurately identifies known ground truth.

–  Determine if STAPLE accurately determines known expert performance parameters.

–  Understand sensitivity of STAPLE with respect to changes in prior hyper-parameters; requirements for number of observations to enable good estimation; convergence characteristics.


Synthetic Experts 10 observations of segmentation by expert with p=q=0.99

Four segmentations of ten shown. STAPLE ground truth.

STAPLE p,q estimates: mean p 0.990237 std. dev p 0.000616 mean q 0.990121 std. dev q 0.00071


Synthetic Experts 10 segmentations by experts with p=0.95, q=0.90

Four segmentations of ten shown. STAPLE ground truth.

STAPLE p,q estimates: mean p 0.950104 std. dev p 0.001201 mean q 0.900035 std. dev q 0.001685


Expert and Student Segmentations

Test image Expert consensus Student 1

Student 2 Student 3


Phantom Segmentation

Image Expert Students Voting STAPLE

Image Expert segmentation

Student segmentations


Prostate Peripheral Zone

Frequency of selection by experts. STAPLE truth estimate

1 2 3 4 5

pj .879 .991 .937 .918 .895

qj .998 .994 .999 .999 .999

Dice .913 .951 .967 .955 .944


A Binary MRF Model for Spatial Homogeneity. Include a prior probability for the neighborhood configuration:


MAP Estimation With MRF Prior


Synthetic Experts Only three segmentations by different quality experts.

STAPLE ground truth.

STAPLE p,q estimates: p1, q1 0.9505,0.9494 p2, q2 0.9511,0.8987 p3, q3 0.9000,0.8987

p=0.95,q=0.95 p=0.95,q=0.90

p=0.90,q=0.90 With MRF prior


Cryoablation of Kidney Tumor Segmentations before training session with radiologist:

Rater frequency. STAPLE with MRF. After training session:

Based on the STAPLE performance assessment, we found the training session created a statistically significant increase in performance of the raters.


Newborn MRI Segmentation


Newborn MRI Segmentation

Summary of segmentation quality (posterior probability Pr(T=t|D=t) ) for each tissue type for repeated manual segmentations.


STAPLE Summary •  Key advantages of STAPLE:

– Estimates ``true’’ segmentation. – Assesses expert performance.

•  Principled mechanism which enables: – Comparison of different experts. – Comparison of algorithm and experts.

•  Extensions for the future: – Can we learn image features that lead to

different levels of expert performance?


Acknowledgements

•  Neil Weisenfeld. •  Andrea Mewes. •  Petra Huppi. •  Olivier Clatz. •  William Wells. •  Olivier Commowick.

This study was supported by: Center for the Integration of Medicine and Innovative Technology R01 RR021885, R01 GM074068 and R01 HD046855.

Colleagues contributing to this work: •  Arne Hans. •  Heidelise Als. •  Lianne Woodward. •  Frank Duffy. •  Arne Hans. •  Kelly Zou.