Breast Density Scoringwith Multiscale Denoising Autoencodersstmi12.rutgers.edu/papers/Breast Density...

Breast Density Scoring with

Multiscale Denoising Autoencoders

Kersten Petersen1, Konstantin Chernoff1, Mads Nielsen1,2, and Andrew Y. Ng3

1 Department of Computer Science, University of Copenhagen, Denmark2 Biomediq A/S, Denmark

3 Department of Computer Science, Stanford University, United States

Abstract. Breast density scoring is an important and difficult problemthat has motivated various handcrafted feature representations. In thispaper, we propose to automatically generate rich feature sets from la-beled or unlabeled data. Our method (MS-DAE) builds upon denoisingautoencoders (DAE), an unsupervised feature learning method, and istailored to perform contextual image segmentation. We use our methodto score breast density on 85 mammograms from a placebo-controlledtrial on hormone replacement therapy. Our automated scores correlatewell with established manual scoring techniques (Cumulus and BI-RADS),while being better in separating out the treatment effect.

1 Introduction

Mammographic breast density reflects the amount of radiodense fibroglandulartissue in the breast and confers with a strongly increased risk of developingbreast cancer [19][13]. Nowadays, breast density is preferably scored using thesemi-automated software Cumulus [5]. With Cumulus, the user crops the breastarea, sets an intensity threshold to separate dense (white appearing) from fat(dark appearing) tissue, and lets the software compute the ratio between thedense tissue area and the total breast area. Cumulus scoring has repeatedlybeen shown to be a strong and independent predictor of breast cancer, andis usually more accurate than categorical scores, e.g., the widespread BreastImaging Reporting and Data System (BI-RADS) score [13].

However, user-assisted intensity thresholding is only of limited use for largeepidemiological studies because it is subjective and time-consuming. To over-come these issues, several automated scoring techniques have been proposed,which can be roughly categorized into global thresholding techniques [18][2],local feature-based methods [16][6][15][12], and approaches that estimate thevolumetric breast density [7][20]. The variety of approaches indicates that auto-mated breast density scoring is an important, but technically challenging task.For example, one has to account for a low signal-to-noise ratio and variationswith respect to scanner settings, radiographer techniques, or patient positioning.

Considerable effort has been spent on devising specialized systems with sev-eral processing stages and carefully handcrafted features. For instance, Kallen-berg et al. [10] recently proposed an integrated approach with more than 50

2 Lecture Notes in Computer Science: Authors’ Instructions

handselected features that comprise location, intensity, texture, and global con-text information of the images. Their approach achieves state-of-the-art perfor-mance, but introduces a plethora of parameters that need to be controlled.

On the contrary, recent work in machine learning has focused on learning fea-tures automatically from unlabeled data. Instead of handcrafting the features,these methods aim to learn rich hierarchical feature representations for encodingthe signal of the data. This makes them especially attractive for problems in med-ical image analysis, where specialized features are required, but hard to designby hand. In computer vision, various feature learning methods have been pro-posed [8][21][3][11]. Most of them have been applied to recognition tasks (audio,objects, or handwritten digits), whereas only a few of theme were demonstratedon image segmentation problems [14]. In this paper, we present an unsupervisedfeature learning method that is suited for contextual image segmentation. Theproposed method builds upon the denoising autoencoder (DAE) [21], and relieson multiple scales for improved segmentation accuracy.

The main contributions of this paper can be summarized as follows. (i) Wepropose a multiscale feature learning technique for scoring breast density frommammograms. Our data-driven technique requires little domain specific knowl-edge and works both on labeled and unlabeled data. (ii) We propose a novelsparsifying activation function. (iii) We evaluate the clinical relevance of ourautomated technique by comparing it to manual BI-RADS and Cumulus-likedensity scoring. It turns out that the automated density scores are better in seg-regating the treatment groups than the manual scores, although the automatedmethod was trained to mimic manual scoring.

The remainder of this paper is organized as follows. In Section 2, we reviewthe basic idea of the denoising autoencoder and explain, how it is adapted tocontextual image segmentation. Section 3 summarizes our experiments on breastdensity scoring and Section 4 concludes the paper.

2 Method

Autoencoder An autoencoder is an unsupervised learning algorithm for findingan efficient feature representation f(x) of some D-dimensional input vector x ∈R

D. Similar to a multilayer perceptron (MLP), an autoencoder is a hierarchicalarchitecture with multiple neural layers. Unlike MLPs, however, the autoencoderaims at encoding the input in the hidden layers by reconstructing the input. Inits simplest form, the autoencoder contains one hidden layer. The mapping fromthe input layer to the D′-dimensional hidden (feature) layer f(x) = a(Wx+b) iscalled the encoder, and is parameterized by a weight matrix W ∈ R

D×D′

and abias b ∈ R

D′

. The pointwise mapping a : R → R denotes a nonlinear activationfunction, typically a sigmoid or hyperbolic tangent.

The mapping from the hidden layer to the output layer g(f(x)) = WTf(x)+c

is called the decoder. It is also an affine mapping sent through an activation func-tion, but typically differs in using the identity function as the weight functionand a weight matrix equivalent to the transpose of the encoder’s weight matrix.

Lecture Notes in Computer Science: Authors’ Instructions 3

Tying the weights in this way helps to reduce the number of trainable parame-ters and move the encoder’s weights to a nonlinear regime without paying highreconstruction cost [21]. The bias of the decoder is written as c ∈ R

D.

Given a set of training examples X = {xn}Nn=1, the parameters θ = {W, b, c}are trained by minimizing the objective function

J(θ;X ) =

N∑

n=1

L(

x(n), g(f(x(n))))

+ λ||W ||F (1)

using backpropagation. The reconstruction error L(

x(n), g(f(x(n)))

= ||x(n) −

g(f(x(n)))||2 is defined as the squared loss, whereas the regularizer ||W ||F isthe Frobenius norm of the weights, controlled by a weight decay parameter λ.To avoid that the autoencoder learns the identity function, the hidden layer isdefined to be smaller than the input layer (under-complete representation).

Denoising Autoencoder In recent years, autoencoder variants have been intro-duced that are able to learn meaningful over-complete representation, i.e., hid-den layers that are larger than the input layer. One principled approach is thedenoising autoencoder [21], which sends a corrupted version of each training ex-ample through the autoencoder in order to reconstruct the original version. Theobjective function of the denoising autoencoder is defined as

JDAE(θ;X ) =

N∑

n=1

Ex(n)∼q(x(n)|x(n))

[

L(

x(n), g(f(x(n))))]

(2)

where x(n) denotes a corrupted training example, generated from the corrup-tion process q(x(n)|x(n)). Typical choices for corrupting the input are additiveGaussian noise: x(n) = x(n) + ǫ, ǫ ∼ N (0, σ2I) or binary masking noise, wherea prespecified fraction of randomly input units is set to 0.

Stacked Autoencoder Autoencoders and its variants can also be stacked in agreedy layerwise manner to learn hierarchies of increasingly more abstract fea-tures. At each level, the learned parameters of the encoder {W, b} are kept,whereas the decoder is removed. The training data is propagated through theencoder and reconstructed by a new autoencoder. This procedure is repeateduntil the activations of the last hidden layers have been reconstructed.

Classifier The activations of the last hidden layer are fed to a supervised learn-ing algorithm. Let us denote the labels of the training data as Y = {y(n}Nn=1

and assume that each training label y(n) ∈ CK is associated with a class la-bel k = 1, . . . ,K, written in the 1-of-K coding scheme. The learned featurerepresentation of the last hidden layer f = fM ◦ fM−1 ◦ . . . f1 is the compo-sition of the learned encoders for the hidden layers m = 1, . . . ,M . A popularchoice for classifying the learned features is the softmax regression model [4]


h(x(n)) = P (yk = 1|f(x(n));w) =exp(wT

kf(x(n)))

∑K

k′=1exp(wT

k′f(x(n)))

which learns the parame-

ter vector w by minimizing the objective function

J(θall;X ,Y) =N∑

n=1

K∑

k=1

L(

h(x(n)), y(n))

, (3)

using the multiclass cross entropy L(

h(x(n)), y(n))

= −∑K

k=1 yk log(

h(x(n))k)

.

Here, θall comprises the learned parameters of the encoders and the classifier.The accuracy of the classifier can be improved by fine-tuning the learned

weights of the entire architecture with a supervised signal. In this scenario, thepre-trained weights of the encoders serve as an initialization of an MLP that istrained using backpropagation.

Multiscale Input Data Our goal is to build input data {(x(n), y(n))}Nn=1 thatfacilitates the segmentation of breast tissue. As it is important to capture longrange interactions, we propose to extract the input examples from a scale spacerepresentation of the images. Formally, a given image I is embedded into aGaussian scale space I(s;σ) = I(s)∗Gσ , where parameter s denotes the position(or site) and σ determines the scale of a Gaussian. The examples are sampled atscales that match the resolution levels of a Gaussian pyramid with downsamplingfactor d = 2 and kernel G1. Thus, if t = 0, 1, . . . , T indexes the considered scales,I(s;σt) is obtained by smoothing image I(s) with a Gaussian Gσt

. The standard

deviation of this Gaussian, σt =√

∑t−1τ=0 d

2τ , is given as the square root of the

summed Gaussian variances from the first t scale levels of the Gaussian pyramid.The scale space representation is applied both to the raw image intensities aswell as to a locally normalized version [17] of the images. The advantage ofcombining these two versions is that the training examples represent both theintensity and the outline of the structures of interest.

Once these two image versions have been created at multiple scales, we candefine the input data {(x(n), y(n))}Nn=1 with respect to reference sites {s(n)}Nn=1.For training our method, we sample labeled examples from a set of trainingimages. Note that in the unsupervised setting we only use the labels for classifi-cation, not for learning the features. For testing an image, we extract unlabeledexamples in a sliding window approach.

Each example (x, y) at reference site s is defined by an appearance and a label

neighborhood. The appearance neighborhood constitutes the appearance part xand is defined across multiple scales. The idea is to select with increasing scalet neighborhood sites at increasing strides around s. More specifically, at scalet = 0, ν(s) refers to the neighborhood sites around reference site s. For t > 0, apatch of image I (raw or locally normalized) at scale t is given by I(ν(s; t);σt) ,where ν(s; t) = {s + 2t(r − s) : r ∈ ν(s)} selects the neighborhood sites atscale t. In contrast to common pyramid schemes (e.g., Gaussian pyramids), thisdefinition allows us to efficiently capture larger context areas at coarser scalelevels. It also ensures that the neighborhoods at each scale level are equally


(a) (b) (c) (d) (e) (f) (g) (h)

Fig. 1: (a) original image, (b) Cumulus-like thresholded, automatic prediction(before and after thresholding) for (c-d) softmax on raw input, (e+f) softmaxon pre-trained features from MS-DAE, and (g+h) fine-tuned features of (e+f).

sized and centered at a common reference site s. The label neighborhood isonly defined with respect to the original scale t = 0 and potentially differentin size from the appearance neighborhood. Inspired by previous work in imagedenoising [9], we specify a neighborhood of labels instead of a single label (e.g.,as in recognition tasks) to model label correlations. During testing, we deal withoverlapping label neighborhood sites by averaging the label probabilities.

3 Experiments

Data We scored breast density on a subpopulation of a previously reportedtrial [22], which examines the efficacy and safety of oral estradiol continuouslycombined with drospirenone for the prevention of postmenopausal bone loss. Thesubpopulation consists of a placebo group with 43 women, and a treatment groupwith 42 women. We evaluated our segmentation method on the right mediolateralmammograms at followup. The mammography was performed with a ”PlanmedSophie” X-ray unit and digitized using a Vidar scanner at an image resolutionof 200 µm per pixel and 12-bit gray scales. A trained radiologist masked out thebreast from the background.

Settings We have evaluated multiple versions of our method to investigate therole of the learning architecture and the training protocol. For each model, wecomputed 1) the area under the ROC curve (AUC) for separating out the treat-ment effect, 2) Pearson’s correlation coefficient between the automated and twomanual scoring methods (BI-RADS and a Cumulus-like interactive thresholdingtechnique), and 3) the Dice coefficient to measure the pixel-wise overlap withthe Cumulus segmentation. We distinguish two training protocols for the modelsand refer to them as abbreviated in parentheses: 1) unsupervised learned features


Placebo Treatment p-value AUC RB RCM DiceCM

BI-RADS (B) 2.02± 0.61 2.18 ± 0.59 0.241 0.56 1.00 0.86 N/ACumulus (CM) 19.65 ± 12.71 24.75 ± 12.19 0.052 0.62 0.86 1.00 1.00

MS-Softmax (ST) 15.27 ± 10.29 15.95 ± 9.77 0.753 0.54 0.20 0.26 0.54MS-Random (ST) 64.29 ± 38.40 64.75 ± 36.01 0.959 0.54 0.05 0.07 0.45

1-DAE (UT) 6.06± 4.79 8.18 ± 7.45 0.097 0.55 0.24 0.34 0.572-DAE (UT) 15.09 ± 9.79 20.92 ± 10.44 0.020 0.66 0.58 0.69 0.753-DAE (UT) 18.56 ± 12.47 25.97 ± 14.02 0.021 0.66 0.68 0.80 0.801-DAE (ST) 37.30 ± 16.61 45.86 ± 17.71 0.026 0.64 0.59 0.76 0.682-DAE (ST) 28.52 ± 14.98 39.10 ± 16.72 0.006 0.67 0.64 0.80 0.813-DAE (ST) 23.54 ± 14.68 33.14 ± 15.96 0.008 0.67 0.66 0.83 0.80

MS-DAE (UC) 27.72 ± 8.57 32.39 ± 8.18 0.020 0.65 0.72 0.83 0.74MS-DAE (UT) 22.69 ± 11.94 28.99 ± 11.19 0.025 0.67 0.74 0.84 0.89

MS-DAE (SC) 27.89 ± 10.85 35.30 ± 10.82 0.004 0.68 0.71 0.87 0.78MS-DAE (ST) 25.32 ± 14.01 34.97 ± 14.41 0.006 0.67 0.71 0.88 0.86

Fig. 2: Percentage density is evaluated in the format (mean±std, p-value of t-testbetween Placebo and Treatment, area under the ROC, Pearson’s R to BI-RADSand Cumulus, and Dice coefficient compared to Cumulus segmentation), wheregaussianity was confirmed with a chi-squared goodness-of-fit and a Jarque-Beratest. The DAE variants in this table contain two hidden layers with 1000 neuronseach. The models with a leading number only use input data from that scale.Bold numbers indicate the best results among the automated methods.

(pre-training) that are fixed and classified with a supervised classifier (U) and2) fine-tuning of the pre-trained features using a softmax classifier in the outputlayer (S). For each training protocol, we also compute two breast density scores:1) the predicted dense tissue labels thresholded at 0.5 (C) and 2) the optimalCumulus-like threshold, estimated from the thresholded predictions (T).

Our multiscale feature learning method is trained according to each of thefour protocols (UC, UT, SC, and ST) using input neighborhoods of size 28× 28at three consecutive scale levels, i.e., D = 3× 282 = 1536 neurons. These train-ing vectors are normalized for brightness by subtracting the mean, and PCAwhitened for suppressing second order correlations and reducing the input di-mensionality (99% of the total variation is retained). We perform a 5-fold crossvalidation on the 85 given mammograms, where each fold is trained on 60,000extracted training examples. The labels (dense vs. non-dense) associated withthe training reference sites are balanced. Pre-training and fine-tuning of theparameters are performed using L-BFGS with mini-batches of size 2,000. Weuse a novel sparsifying activation function a(·), the rectified hyperbolic tangenta(z) = max(tanh(z), 0). The intensity part of the input is corrupted using Gaus-sian noise with standard deviation σ = 0.1.

Results The results of our experiments are summarized in Fig. 2. We examinedtwo specific cases of our architecture in order to create a baseline performance.MS-Softmax corresponds to a softmax classifier that is applied to the multiscale


training examples, whereas MS-Random represents our final multiscale denoisingautoencoder without parameter learning, i.e., fixed random weights. Both modelsperform poorly, which indicates that the softmax classifier is hardly able todiscriminate the raw input data, and that specialized features are needed toperform well on this dataset.

We investigated the value of our multiscale extension to the DAE by evalu-ating the performance of the three single scale DAEs. Each investigated modelmaps to an 8× 8 label neighborhood. We see that a larger context helps in scor-ing breast density. However, our multiscale variant, MS-DAE, that combines theinformation from three scales outperforms the single scale variants. Even themultiscale setup that solely uses unlabeled data (UC and UT) performs betterwith respect to correlation and AUC than any of the single scale DAEs. Com-pared to the MS-Softmax classifier, the unsupervised models learn rich featurerepresentation from the data (Fig. 1).

Nearly all tested DAE variants – except for 1-DAE (UT) – are better inseparating out the treatment effect than both manual scores. The top performingsetup, a fine-tuned multiscale DAE, has an AUC of roughly 0.68 compared to0.62 of a human expert, although our method was trained to mimic the Cumulus-like scoring. The correlation with the Cumulus score (0.83-0.88) suggests that theproposed method is competitive with the reported correlation coefficients fromthe literature, e.g., 0.68 [1], 0.70 [6] and 0.91 [10]. Our method seems to learninformative features from unlabeled data that are competitive with complicatedhand-engineered features used in previous breast density scoring systems.

4 Conclusion

In this paper, we have presented a multiscale feature learning method for breastdensity scoring. Our automated scores compare to the state-of-the art in the lit-erature, and correlate well with manual scores using Cumulus-like thresholding.The proposed method was better in separating out the treatment effect than themanual thresholding technique and the categorical BI-RADS score, although ourtechnique was trained to mimic manual scoring. We showed that multiple scalesare advantageous for learning rich feature representations for segmentation. Asour approach is generic and easy to use, we expect that the MS-DAE model isalso useful for other segmentation problems in medical imaging.

References

1. Z. Aitken, V. A. McCormack, R. P. Highnam, L. Martin, A. Gunasekara, O. Melni-chouk, G. Mawdsley, C. Peressotti, M. Yaffe, N. F. Boyd, and I. dos Santos Silva.Screen-film mammographic density and breast cancer risk: A comparison of thevolumetric standard mammogram form and the interactive threshold measurementmethods. Cancer Epidemiology Biomarkers and Prevention, 19(2):418–428, 2010.

2. S. R. Aylward, B. M. Hemminger, and E. D. Pisano. Mixture modeling for digitalmammogram display and analysis. In in Digital Mammography, pages 305–312.Kulwer Academic Publishers, 1998.


3. Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise trainingof deep networks. In NIPS, pages 153–160, 2006.

4. C. M. Bishop. Pattern Recognition and Machine Learning (Information Scienceand Statistics). Springer, August 2006.

5. J. W. Byng, N. F. Boyd, E. Fishell, R. A. Jong, and M. J. Yaffe. The quantitativeanalysis of mammographic densities. Phys. Med. and Biol., 39(10):1629, 1994.

6. J. J. Heine, M. J. Carston, C. G. Scott, K. R. Brandt, F.-F. Wu, V. S. Pankratz,T. A. Sellers, and C. M. Vachon. An automated approach for estimation of breastdensity. Cancer Epidemiology Biomarkers and Prevention, 17(11):3090–3097, 2008.

7. R. Highnam and M. Brady. Mammographic Image Analysis. Kluwer, 1999.8. G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep

belief nets. Neural Computation, 18(7):1527–1554, 2006.9. V. Jain and H. S. Seung. Natural image denoising with convolutional networks. In

NIPS, pages 769–776, 2008.10. M. G. J. Kallenberg, M. Lokate, C. H. van Gils, and N. Karssemeijer. Automatic

breast density segmentation: an integration of different approaches. Physics inMedicine and Biology, 56(9):2715, September 2011.

11. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning appliedto document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov. 1998.

12. H. Li, M. L. Giger, O. I. Olopade, and M. R. Chinander. Power spectral analysis ofmammographic parenchymal patterns for breast cancer risk assessment. J. DigitalImaging, 21(2):145–152, 2008.

13. V. A. McCormack and I. dos Santos Silva. Breast density and parenchymal patternsas markers of breast cancer risk: A meta-analysis. Cancer Epidemiology Biomarkers& Prevention, 15(6):1159–1169, June 2006.

14. F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and P. E. Barbano. Towardautomatic phenotyping of developing embryos from videos. IEEE Transactions onImage Processing, 14(9):1360–1371, 2005.

15. A. Oliver, X. Llado, E. Perez, J. Pont, E. R. E. Denton, J. Freixenet, andJ. Martı. A statistical approach for breast density segmentation. J. Digital Imag-ing, 23(5):527–537, 2010.

16. S. Petroudi and M. Brady. Breast density segmentation using texture. In DigitalMammography / IWDM, pages 609–615, 2006.

17. N. Pinto, D. D. Cox, and J. J. DiCarlo. Why is real-world visual object recognitionhard? PLoS Computational Biology, 4(1):e27, January 2008.

18. A. Torrent, A. Bardera, A. Oliver, J. Freixenet, I. Boada, M. Feixes, R. Marti,X. Llado, J. Pont, E. Perez, S. Pedraza, and J. Martı. Breast density segmentation:A comparison of clustering and region based techniques. In Digital Mammography/ IWDM, pages 9–16, 2008.

19. C. Vachon, C. van Gils, T. Sellers, K. Ghosh, S. Pruthi, K. Brandt, and V. S.Pankratz. Mammographic density, breast cancer risk and risk prediction. BreastCancer Research, 9(6):217, December 2007.

20. S. van Engeland, P. Snoeren, H. Huisman, C. Boetes, and N. Karssemeijer. Vol-umetric breast density estimation from full-field digital mammograms. In IEEETransactions of Medical Imaging, volume 25, pages 273–282. IEEE, March 2006.

21. P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stackeddenoising autoencoders: Learning useful representations in a deep network with alocal denoising criterion. J. Mach. Learn. Res., 11:3371–3408, 2010.

22. L. Warming, P. Ravn, T. Nielsen, and C. Christiansen. Safety and efficacy ofdrospirenone used in a continuous combination with 17beta-estradiol for preventionof postmenopausal osteoporosis. Climacteric, 7(1):103–111, Mar 2004.

Date post:	26-Mar-2018
Category:	Documents
Upload:	vuthien
View:	237 times
Download:	0 times

Breast Density Scoringwith Multiscale Denoising Autoencodersstmi12.rutgers.edu/papers/Breast Density...

Documents