+ All Categories
Home > Documents > Multi-scale guided attention for medical image segmentation · 2019. 6. 10. · 1 Multi-scale...

Multi-scale guided attention for medical image segmentation · 2019. 6. 10. · 1 Multi-scale...

Date post: 14-Sep-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
10
1 Multi-scale guided attention for medical image segmentation Ashish Sinha and Jose Dolz Abstract—Even though convolutional neural networks (CNNs) are driving progress in medical image segmentation, standard models still have some drawbacks. First, the use of multi-scale approaches, i.e., encoder-decoder architectures, leads to a re- dundant use of information, where similar low-level features are extracted multiple times at multiple scales. Second, long-range feature dependencies are not efficiently modeled, resulting in non- optimal discriminative feature representations associated with each semantic class. In this paper we attempt to overcome these limitations with the proposed architecture, by capturing richer contextual dependencies based on the use of guided self-attention mechanisms. This approach is able to integrate local features with their corresponding global dependencies, as well as highlight interdependent channel maps in an adaptive manner. Further, the additional loss between different modules guides the attention mechanisms to neglect irrelevant information and focus on more discriminant regions of the image by emphasizing relevant feature associations. We evaluate the proposed model in the context of abdominal organ segmentation on magnetic resonance imaging (MRI). A series of ablation experiments support the importance of these attention modules in the proposed architecture. In addi- tion, compared to other state-of-the-art segmentation networks our model yields better segmentation performance, increasing the accuracy of the predictions while reducing the standard deviation. This demonstrates the efficiency of our approach to generate precise and reliable automatic segmentations of medical images. Our code and the trained model are made publicly available at: https://github.com/sinAshish/Multi-Scale-Attention Index Terms—Convolutional neural networks, Deep learning, Medical image segmentation, Deep attention, Self-attention I. I NTRODUCTION Semantic segmentation of medical images is a crucial step in diagnosis, treatment and follow-up of many diseases. Despite the automation of this task has been widely studied in the past, manual annotations are still typically used in clinical practice, which is a time-consuming and prone to inter and intra-observer variability process. Thus, there is a high demand on accurate and reliable automatic segmentation methods that allow to improve the work flow efficiency in clinical scenarios, alleviating the workload of radiologists and other medical experts. Recently, convolutional neural networks (CNNs) have achieved state-of-the-art performance in a breadth of visual recognition tasks, becoming very popular due to their pow- erful, nonlinear feature extraction capabilities. These deep models dominate the literature in medical image segmentation [1] and have achieved outstanding performance in a broad span A. Sinha is with the Indian Institute of Technology Roorkee, India. e-mail: [email protected]. J. Dolz is with the ´ Ecole de technologie Superieure, Montreal, Canada. email:[email protected]. Manuscript received XXX; revised XXX. of applications, including brain [2] or cardiac [3] imaging, for example, becoming the de facto solution for these problems. In this scenario, fully convolutional neural networks [4] or encoder-decoder architectures [5], [6] are typically the stan- dard choice. These architectures are commonly composed of a contracting path, which collapses an input image into a set of high-level features, and an expanding path, where high- level features are used to reconstruct a pixel-wise segmentation mask at a single [4] or multiple upsampling steps [5], [6]. Nevertheless, despite their strong representation power, these multi-scale approaches lead to a redundant use of information flow, e.g., similar low-level features are extracted multiple times at different levels within the network. Furthermore, the discriminative power of the learned feature representations for pixel-wise recognition may be insufficient for some challeng- ing tasks, such as medical image segmentation. Recent works to improve the discriminative ability of fea- ture representations include the use of multi-scale context fusion [7], [8], [9], [10]. Zhao et al. [8] proposed a pyramid network that exploited global information at different scales by aggregating feature maps generated by multiple dilated convolutional blocks. Aggregation of contextual multi-scale information can also be achieved through pooling operations [11]. Even though these strategies may help to capture objects at different scales, contextual dependencies for all image regions are homogeneous and non-adaptive, ignoring the dif- ference between local representation and contextual depen- dencies for different categories. Further, these multi-context representations are manually designed, lacking flexibility to model the multi-context representations. This makes that long- range object relationships in the whole image cannot be fully leveraged in these approaches, which is of pivotal importance in many medical imaging segmentation problems. Alternatively, attention mechanisms have been widely stud- ied in deep CNNs for many computer vision tasks in order to efficiently integrate local and global features, including human pose estimation [12], emotion recognition [13], text detection [14], object detection [15] and classification [16]. Unlike stan- dard multi-scale features fusion approaches, which compress an entire image into a static representation, attention allows the network to focus on the most relevant features without additional supervision, avoiding the use of multiple similar feature maps and highlighting salient features that are useful for a given task. Semantic segmentation networks have also benefited from attention modules, which has resulted in en- hanced models for pixel-wise recognition tasks [17], [18], [19], [20], [21], [22]. For example, Chen et .al [17] proposed an attention mechanism to weight multi-scale features extracted at different scales in the context of natural scene segmentation. arXiv:1906.02849v1 [cs.CV] 7 Jun 2019
Transcript
Page 1: Multi-scale guided attention for medical image segmentation · 2019. 6. 10. · 1 Multi-scale guided attention for medical image segmentation Ashish Sinha and Jose Dolz Abstract—Even

1

Multi-scale guided attention for medical imagesegmentationAshish Sinha and Jose Dolz

Abstract—Even though convolutional neural networks (CNNs)are driving progress in medical image segmentation, standardmodels still have some drawbacks. First, the use of multi-scaleapproaches, i.e., encoder-decoder architectures, leads to a re-dundant use of information, where similar low-level features areextracted multiple times at multiple scales. Second, long-rangefeature dependencies are not efficiently modeled, resulting in non-optimal discriminative feature representations associated witheach semantic class. In this paper we attempt to overcome theselimitations with the proposed architecture, by capturing richercontextual dependencies based on the use of guided self-attentionmechanisms. This approach is able to integrate local featureswith their corresponding global dependencies, as well as highlightinterdependent channel maps in an adaptive manner. Further, theadditional loss between different modules guides the attentionmechanisms to neglect irrelevant information and focus on morediscriminant regions of the image by emphasizing relevant featureassociations. We evaluate the proposed model in the context ofabdominal organ segmentation on magnetic resonance imaging(MRI). A series of ablation experiments support the importanceof these attention modules in the proposed architecture. In addi-tion, compared to other state-of-the-art segmentation networksour model yields better segmentation performance, increasing theaccuracy of the predictions while reducing the standard deviation.This demonstrates the efficiency of our approach to generateprecise and reliable automatic segmentations of medical images.Our code and the trained model are made publicly available at:https://github.com/sinAshish/Multi-Scale-Attention

Index Terms—Convolutional neural networks, Deep learning,Medical image segmentation, Deep attention, Self-attention

I. INTRODUCTION

Semantic segmentation of medical images is a crucial step indiagnosis, treatment and follow-up of many diseases. Despitethe automation of this task has been widely studied in thepast, manual annotations are still typically used in clinicalpractice, which is a time-consuming and prone to inter andintra-observer variability process. Thus, there is a high demandon accurate and reliable automatic segmentation methods thatallow to improve the work flow efficiency in clinical scenarios,alleviating the workload of radiologists and other medicalexperts.

Recently, convolutional neural networks (CNNs) haveachieved state-of-the-art performance in a breadth of visualrecognition tasks, becoming very popular due to their pow-erful, nonlinear feature extraction capabilities. These deepmodels dominate the literature in medical image segmentation[1] and have achieved outstanding performance in a broad span

A. Sinha is with the Indian Institute of Technology Roorkee, India. e-mail:[email protected].

J. Dolz is with the Ecole de technologie Superieure, Montreal, Canada.email:[email protected].

Manuscript received XXX; revised XXX.

of applications, including brain [2] or cardiac [3] imaging, forexample, becoming the de facto solution for these problems.In this scenario, fully convolutional neural networks [4] orencoder-decoder architectures [5], [6] are typically the stan-dard choice. These architectures are commonly composed ofa contracting path, which collapses an input image into a setof high-level features, and an expanding path, where high-level features are used to reconstruct a pixel-wise segmentationmask at a single [4] or multiple upsampling steps [5], [6].Nevertheless, despite their strong representation power, thesemulti-scale approaches lead to a redundant use of informationflow, e.g., similar low-level features are extracted multipletimes at different levels within the network. Furthermore, thediscriminative power of the learned feature representations forpixel-wise recognition may be insufficient for some challeng-ing tasks, such as medical image segmentation.

Recent works to improve the discriminative ability of fea-ture representations include the use of multi-scale contextfusion [7], [8], [9], [10]. Zhao et al. [8] proposed a pyramidnetwork that exploited global information at different scalesby aggregating feature maps generated by multiple dilatedconvolutional blocks. Aggregation of contextual multi-scaleinformation can also be achieved through pooling operations[11]. Even though these strategies may help to capture objectsat different scales, contextual dependencies for all imageregions are homogeneous and non-adaptive, ignoring the dif-ference between local representation and contextual depen-dencies for different categories. Further, these multi-contextrepresentations are manually designed, lacking flexibility tomodel the multi-context representations. This makes that long-range object relationships in the whole image cannot be fullyleveraged in these approaches, which is of pivotal importancein many medical imaging segmentation problems.

Alternatively, attention mechanisms have been widely stud-ied in deep CNNs for many computer vision tasks in order toefficiently integrate local and global features, including humanpose estimation [12], emotion recognition [13], text detection[14], object detection [15] and classification [16]. Unlike stan-dard multi-scale features fusion approaches, which compressan entire image into a static representation, attention allowsthe network to focus on the most relevant features withoutadditional supervision, avoiding the use of multiple similarfeature maps and highlighting salient features that are usefulfor a given task. Semantic segmentation networks have alsobenefited from attention modules, which has resulted in en-hanced models for pixel-wise recognition tasks [17], [18], [19],[20], [21], [22]. For example, Chen et .al [17] proposed anattention mechanism to weight multi-scale features extractedat different scales in the context of natural scene segmentation.

arX

iv:1

906.

0284

9v1

[cs

.CV

] 7

Jun

201

9

Page 2: Multi-scale guided attention for medical image segmentation · 2019. 6. 10. · 1 Multi-scale guided attention for medical image segmentation Ashish Sinha and Jose Dolz Abstract—Even

2

This method improved the segmentation performance overclassical average and max-pooling techniques to merge multi-scale features predictions.

Despite the growing interest on integrating attention mech-anisms in image segmentation networks for natural scenes,their adoption in medical images remains scarce [23], [24],[25], [26], being limited to simple attention models. Thus, inthis work, we explore more complex attention mechanismsthat can boost the performance of standard deep networksfor the task of medical image segmentation. Specifically, wepropose a multi-scale guided attention network for medicalimage segmentation. First, the multi-scale approach generatesstacks at different resolutions containing different semantics.While lower-level stacks focus on local appearance, higher-level stacks will encode global representations. This multi-scale strategy encourages that attention maps generated atdifferent resolutions encode different semantic information.Then, at each scale, a stack of attention modules will grad-ually remove noisy areas and emphasize those regions thatare more relevant to the semantic descriptions of the tar-gets. Each attention module contains two independent self-attention mechanisms, which focus on modelling positionand channel feature dependencies, respectively. This dupleallows to model wider and richer contextual representationsand improve dependencies between channel maps, resultingin enhanced feature representations. We validate our methodin the task of multi-organ segmentation on magnetic resonanceimaging (MRI), employing the publicly available CHAOSdataset. Results show that the proposed architecture improvesthe segmentation performance by successfully modeling richcontextual dependencies over local features.

II. RELATED WORK

A. Medical image segmentationEven though segmentation of medical images has been

widely studied in the past [27], [28] it is undeniable that CNNsare driving progress in this field, leading to outstanding perfor-mances in many applications. Most available medical imagesegmentation architectures are inspired from the well-knownfully convolutional neural network (FCN) [4] or UNet [5].In FCN the fully connected layers of standard classificationCNNS are replaced by convolutional layers to achieve densepixel prediction at one forward step. To recover the originalresolution of the input image, the prediction is upsampled ina single step. Further, to improve the prediction capabilities,skip connections are included in the network by employing theintermediate feature maps. On the other hand, UNet containscontractive and expansive paths created using the combinationof convolutional layers with pooling and upsampling layers.Skip connections are used to concatenate the features fromcontractive and expansive path layers. Many extensions ofthese networks have been proposed to solve pixel-wise seg-mentation problems in a wide range of applications [29], [30],[31], [32], [33], [34], [35], [36].

B. Deep attentionAttention mechanisms aim at emphasizing important local

regions captured in local features and filtering irrelevant infor-

mation transferred by global features, improving the modelingof long-range dependencies. These modules have thereforebecome an essential part of models that need to capture globaldependencies. The integration of these attention modules hasbeen proved very successful in many vision problems, suchas image captioning [37], image question-answering [38],classification [39] or detection [40], among many others.Self-attention [41], [42], [43], [44] has recently attracted theattention of researchers, as it exhibits a good ability to modellong-range dependencies while maintaining computational andstatistical efficiency. In these modules, the response at eachposition is calculated by attending to all positions and takingtheir weighted average in an embedding space. For imagevision problems, [18], [19], [43] integrated self-attention tomodel the relation of local features with their correspondingglobal dependencies. For instance, the point-wise spatial atten-tion network (PSANet) proposed in [18] allows a flexible anddynamic aggregation of long-range contextual information byconnecting each position in the feature map with all the othersthrough self-adaptive attention maps.

Recent works have indicated that attention features gener-ated in a single step may still contain noise introduced fromregions that are irrelevant for a given class, leading to sub-optimal results [38], [45]. To overcome this issue, some workshave investigated the use of progressive multiple attentionlayers in the context of visual question answering [38] or zeroshot learning [45]. This strategy gradually filters undesirednoise and emphasizes the regions highly relevant for the classsemantic representations. To the best of our knowledge, theapplication of stacked attention modules remains unexploredin semantic segmentation.

C. Medical image segmentation with deep attention

Even though attention mechanisms are becoming popularon many vision problems, the literature on medical image seg-mentation with attention remains scarce, with simple attentionmodules [23], [24], [25], [26]. Wang et .al [23] employedattention modules at multiple resolutions to combine localdeep attention features (DAF) with global context for prostatesegmentation on Ultrasound images. To model long-rangedependencies local and global features were combined in asimple attention module, which contains three convolutionallayers followed by a softmax function to create the attentionmap. A similar attention module, composed of two convo-lutional layers followed by a softmax, was integrated in ahierarchical aggregation framework integrated in UNet forleft atrial segmentation [24]. More recently, additive attentiongate modules were integrated in the skip connections ofthe decoding path of UNet with the goal of better modelcomplimentary information from the encoder [25].

III. METHODS

A. Overview

Target structures on medical imaging typically present intraand inter-class diversity on size, shape and texture, particu-larly if images are processed in 2D. Traditional CNNs forsegmentation have a local receptive field, which results in

Page 3: Multi-scale guided attention for medical image segmentation · 2019. 6. 10. · 1 Multi-scale guided attention for medical image segmentation Ashish Sinha and Jose Dolz Abstract—Even

3

Guided Attention

Guided Attention

Guided Attention

Guided Attention

Conv-1 Res-2 Res-3 Res-4

Attention features

Res-5

ConvInput

Segmentation

Conv Conv Conv Conv

Fig. 1: Overview of the proposed multi-scale guided attentionnetwork. We resort to ResNet-101 to extract dense localfeatures.

the generation of local feature representations. As long-rangecontextual information is not properly encoded, local featuresrepresentations may lead to potential differences between fea-tures corresponding to the pixels with the same label [19]. Thismay introduce intra-class inconsistency that can ultimatelyimpact on the recognition performance [46]. To tackle withthis problem, we investigate attention mechanisms to buildassociations between features. First, global context is capturedby employing a multi-scale strategy. Then, learned featuresat multiple scales are fed into the guided attention modules,which are composed by a stack of spatial and channel self-attention modules. While the spatial and channel self-attentionmodules will help to adaptively integrate local features withtheir global dependencies, the stack of attention modules willhelp to gradually filter noise out emphasizing on relevantinformation. The overview of the proposed framework isdepicted in Figure 1.

B. Multi-scale attention maps

Multi-scale features are known to be useful in computervision problems even before the deep learning era [47]. Inthe context of deep segmentation networks, the integration ofmulti-scale features has demonstrated astonishing performance[17], [48], [49]. Inspired by these works we make use oflearned features at multiple scales, which help to encode bothglobal and local context. Specifically we follow the multi-scale strategy recently proposed in [23], which is ilustrated inFig. 1. In this setting, features at multiple scales are denotedas Fs, where s indicates the level in the architecture. Sincefeatures come at different resolutions for each level s, theyare upsampled to a common resolution by employing a linearinterpolation, leading to enlarged feature maps F ′s. Then, F ′sfrom all the scales are concatenated forming a tensor thatis convolved to create a common multi-scale feature map,FMS = conv([F ′0, F

′1, F

′2, F

′3]). This new multi-scale feature

map is combined with each of the feature maps at different

scales and fed into the guided attention modules to generatethe attention features As:

As = AttMods(conv([F′s, FMS ])) (1)

where AttMod represents each guided attention module.

C. Spatial and Channel self-attention modules

As introduced earlier, receptive fields in traditional segmen-tation deep models are reduced to a local vicinity. This limitsthe capabilities of modeling wider and richer contextual repre-sentations. On the other hand, channel maps can be consideredas class-specific responses, where different semantic responsesare associated with each other. Thus, another strategy toenhance the feature representation of specific semantics isto improve the dependencies between channel maps [50]. Toaddress these limitations of standard CNNs we employ theposition and channel attention modules recently proposed in[19], which are depicted in Figure 2.

a) Position attention module (PAM): Let denote F ∈RC×W×H an input feature map to the attention module, whereC,W,H represent the channel, width and height dimensions,respectively. In the upper branch F is passed through a con-volutional block, resulting in a feature map F p

0 ∈ RC′×W×H ,where C ′ is equal to C/81. Then, F p

0 is reshaped to a featuremap of shape (W × H) × C ′. In the second branch, theinput feature map F follows the same operations and thenis transposed, resulting in F p

1 ∈ RC′×(W×H). Both maps aremultiplied and softmax is applied on the resulted matrix togenerate the spatial attention map Sp ∈ R(W×H)×(W×H):

spi,j =exp (F p

0,i · Fp1,j)∑W×H

i=1 exp (F p0,i · F

p1,j)

(2)

where spi,j evaluates the impact of the ith position on thejth position. The input F is fed into a different convolutionalblock in the third branch, resulting in F p

2 ∈ RC×(W×H), whichhas the same shape as F . As in the other branches, F p

2 isreshaped becoming F p

2 ∈ RC×(W×H). Then it is multipliedby a permuted version of the spatial attention map S, whoseoutput is reshaped to a RC×(W×H). The attention feature mapcorresponding to the position attention module, i.e., FPAM ,can be therefore formulated as follows:

FPAM,j = λp

W×H∑i=1

spijFp2,j + Fj (3)

As in [19], the value of λp is initialized to 0 and itis gradually learned to give more importance to the spatialattention map. Thus, the position attention module selectivelyaggregates global context to the learned features, guided bythe spatial attention map.

1We use the superscript p to indicate that the feature map belongs to theposition attention module. Similarly, we will employ the superscript c for thechannel attention module features.

Page 4: Multi-scale guided attention for medical image segmentation · 2019. 6. 10. · 1 Multi-scale guided attention for medical image segmentation Ashish Sinha and Jose Dolz Abstract—Even

4

(C/8)xWxH(WxH)xC/8

(C/8)x(WxH)

Softmax

(WxH) x (WxH)

Cx(WxH)CxWxH

Cx(WxH)

Cx(WxH)

(WxH)xC

Cx(WxH)

Softmax

CxC

Cx(WxH)CxWxH

CxWxH

Self-Attention features

Channel attention module

Position attention module

Reshape Permute

CxWxH

Convolution

CxWxH

CxWxH

Fig. 2: Details of the position and channel attention modulesinspired by [19].

b) Channel attention module (CAM): The pipeline ofthe channel attention module is depicted at the bottom ofFigure 2. The input F ∈ RC×W×H is reshaped in the firsttwo branches of the CAM, and permuted in the second branch,leading to F c

0 ∈ R(W×H)×C and F c1 ∈ RC×(W×H), respec-

tively. Then, we perform a matrix multiplication between F c0

and F c1 , and obtain the channel attention map Sc ∈ RC×C as:

sci,j =exp (F c

0,i · F c1,j)∑C

i=1 exp (Fc0,i · F c

1,j)(4)

where the impact of the ith channel on the jth is given bysci,j . This is then multiplied by a transposed version of theinput F , i.e., F c

2 , whose result is reshaped to RC×(W×H).Similarly to the PAM, the final channel attention map isobtained as:

FCAM,j = λc

C∑i=1

scijFc2,j + Fj (5)

where λc controls the importance of the channel attentionmap over the input feature map F . Similarly to λp, λc isinitially set to 0 and gradually learned. This formulationaggregates weighted versions of the features of all the channelsinto the original features, highlighting class-dependent featuremaps and increasing feature discriminability between classes.

At the end of both attention modules, the new generatedfeatures are fed into a convolutional layer before performing anelement-wise sum operation to generate the position-channelattention features.

D. Guiding attention

Given the feature map F at the input of the guided attentionmodule at scale s–generated by concatenating FMS and F ′s–,it generates attention features via a multi-step refinement. Inthe first step, F is used by the position and channel attentionmodules to generate self-attention features. In parallel, weintegrate an encoder-decoder network that compresses theinput features F into a compacted representation in the latentspace. The objective is that the class information can beembedded in the second position-channel attention module byforcing the semantic representation of both encoder-decodersto be close, which is formulated as:

LG = ‖E1(F )− E2(FSA)‖22 (6)

where E1(F ) and E2(FSA) are the encoded representationsof the first and second encoder-decoder networks, respectively,and FSA are the attention features generated after the first dualattention module. Specifically, the feature maps reconstructedin the first encoder-decoder (n = 0) are combined with theself-attention features generated by the first attention modulethrough a matrix-multiplication operation to generate FSA. Inaddition, to ensure that the reconstructed features correspondto the features at the input of the position-channel attentionmodules, the output of the encoders are forced to be close totheir input:

LRec = ‖F − F‖22 + ‖FSA − FSA‖22 (7)

where F and FSA are the reconstructed feature maps, i.e.,D0(E0(F )) and D1(E1(FSA)), of the first and second encoder-decoder networks.

Position attention module

Position attention module

Channel attention module

Channel attention module

Guided loss

Encoder-decoder 0 Encoder-decoder 1

Semantic guided attention module

Fig. 3: An illustration of the semantic guided attention modulefor a given scale s.

As the guided attention module is applied at multiple scales,the combined guided loss for all the modules will be:

LGTotal=

S∑s=0

LsG (8)

Similarly, the total reconstruction loss becomes:

LRecTotal=

S∑s=0

LsRec (9)

Page 5: Multi-scale guided attention for medical image segmentation · 2019. 6. 10. · 1 Multi-scale guided attention for medical image segmentation Ashish Sinha and Jose Dolz Abstract—Even

5

where LRec1 and LRec1 are the reconstruction losses for theencoder-decoder architectures in the first and second block ofthe guided attention module.

E. Deep supervision

While the attention modules do not require auxiliary objec-tive functions, we found that the use of extra supervision ateach scale [51] encouraged the intermediate feature maps tobe semantically discriminative at each image scale, which isin line with similar works in the literature [17], [23], [25].

LSegTotal=

S∑s=0

LsSegF ′ +

S∑s=0

LsSegA (10)

where the first term refers to the segmentation results at theraw features F ′s and the second term evaluates the segmenta-tion result provided by the attention features. In all the cases,the multi-class cross-entropy between the network predictionand the ground truth labels is employed as segmentation loss.Taking into account all the losses, the final objective functionto optimize becomes:

LTotal = αLSegTotal+ βLGTotal

+ γLRecTotal(11)

where α, β and γ control the importance of each term inthe main loss function.

IV. EXPERIMENTS

A. Experimental setting

In this section we present the common setting for all theexperiments including: dataset, network architectures, trainingand evaluation metrics.

1) Dataset: The abdominal MRI dataset from the Com-bined Healthy Abdominal Organ Segmentation (CHAOS)Challenge 2 [52], [53], [54] is employed to evaluate ourmethod. Particularly, among the five tasks we focus on thesegmentation of abdominal organs on MRI (T1-DUAL inphase). This dataset includes scans from 20 subjects fortraining, with their corresponding ground truth annotations,and 20 for testing without annotations. Scans were acquiredby a 1.5T Philips MRI, producing 12 bit DICOM imagesand having a resolution of 256×256 pixels per slice, andbetween 26 and 50 slices. Since testing labels are not providedwithin the dataset we employed the training dataset for ourexperiments. Particularly we split the dataset into subsets of13, 2 and 5 subjects that were used for training, validation andtesting. We repeated the process 3 times selecting differentsubjects for validation and testing and report the averageresults over the three folds. To increase the variability of thedata, we rotate, flipped and mirrored the images randomly, butwithout augmenting the dataset size.

2https://chaos.grand-challenge.org/

2) Network architectures: The multi-scale strategy in theproposed network is based on the recently work in [23], whichuses ResNet101 [55] as backbone architecture. Therefore,this architecture is considered as the lower baseline in ourexperiments. In the first part of the experiments, we performan ablation study on the different proposed modules to evaluatethe impact of each choice in the segmentation performance.The first two networks –i.e., Proposed (PAM) and Proposed(CAM)– extend the baseline by replacing the attention moduleby either the spatial or the channel self-attention module (Fig.2), respectively. Then, both modules are combined simultane-ously, leading to the Proposed (DualNet) model. In the nextmodel –i.e., Proposed (MS-DualNet)– the attention featuresgenerated by the dual attention module are refined in a multi-step process, where a second dual attention module is included.Last, the proposed architecture, referred to as Proposed (MS-DualNet-Guided) extends the Proposed (MS-DualNet) modelby incorporating the semantic guidance (Fig. 3). Furthermorewe compared the performance of the proposed network toother state-of-the-art architectures, most of them integratingattention: UNet [5], Attention UNet [25], DualNet [19] andPyramidal Attention Network (PAN) [20].

3) Training and implementation details: We train all thenetworks using Adam optimizer with mini-batch of size 8,and with β1 and β2 set to 0.9 and 0.99, respectively. Whilemost of the networks converged during the first 250 epochs,we found that PAN [20] and DANet [19] needed around400 epochs to achieve the best results. The learning rate isinitially set to 0.001 and multiplied by 0.5 after 50 epochswithout improvement on the validation set. As a segmentationobjective function, we employ the cross-entropy error at eachpixel over all the categories for all the networks. Furthermore,as introduced in Section III, we use the objective function ineq. (11) in the proposed architecture, with α, β and γ setempirically to 1, 0.25 and 0.1, respectively. As input of thenetworks we employed 2D axial images of size 256 × 256.Experiments were performed in a server equipped with a TitanV. The code of our model, as well as the model trained, aremade publicly available at https://github.com/sinAshish/Multi-Scale-Attention .

4) Evaluation: Similarity between ground truth and CNNsegmentations is assessed by employing several comparisonmetrics. First, we resort to the widely used Dice similaritycoefficient (DSC) to compare volumes based on their overlap.Given two volumes A and B, their DSC can be defined as:

DSC =2 |A ∩B||A|+ |B|

(12)

Further, we also assess the segmentation performance basedon the volume similarity, which is formulated as:

VS = 1− abs(A−B)/(A+B) (13)

However, volume-based metrics generally lack sensitivityto segmentation outline, and segmentations showing a highdegree of spatial overlap might present clinically-relevantdifferences between their contours. Thus, distance-based met-rics, such as the mean surface distance (MSD), were also

Page 6: Multi-scale guided attention for medical image segmentation · 2019. 6. 10. · 1 Multi-scale guided attention for medical image segmentation Ashish Sinha and Jose Dolz Abstract—Even

6

DSC

Method Liver Kidney R Kidney L Spleen Mean

Baseline (DAF [23]) 91.66 (±2.99) 79.28 (±18.68) 83.63 (±7.56) 75.35 (±20.41) 82.48 (±6.06)Proposed (PAM) 91.89 (±4.29) 85.47 (±7.04) 86.84 (±6.53) 73.65 (±22.62) 84.46 (±6.68)Proposed (CAM) 92.58 (±2.65) 84.52 (±9.34) 86.38 (±6.27) 76.84 (±20.56) 85.08 (±5.62)Proposed (DualNet) 92.60 (±3.20) 85.29 (±7.96) 87.74 (±6.37) 76.44 (±22.17) 85.52 (±5.86)Proposed (MS-Dual) 92.62 (±3.08) 86.29 (±5.98) 88.82 (±4.84) 76.96 (±19.87) 86.17 (±5.78)Proposed (MS-Dual-Guided) 92.46 (±2.82) 87.96 (±6.46) 88.01 (±6.16) 78.61 (±18.69) 86.75 (±5.05)

Volume similarity (VS)

Liver Kidney R Kidney L Spleen Mean

Proposed( DAF [23]) 96.69 (±3.21) 86.75 (±16.41) 90.29 (±8.39) 84.98 (±14.42) 89.68 (±4.48)Proposed (PAM) 96.62 (±4.62) 92.83 (±7.43) 93.96 (±6.46) 83.93 (±20.54) 91.84 (±4.77)Proposed (CAM) 97.25 (±2.95) 93.78 (±6.04) 93.98 (±5.48) 83.72 (±20.97) 92.18 (±5.07)Proposed (DualNet) 97.04 (±3.03) 94.50 (±5.96) 93.43 (±7.03) 83.30 (±22.53) 92.07 (±5.23)Proposed (MS-Dual) 97.47 (±3.07) 93.30 (±4.11) 95.27 (±4.89) 84.90 (±16.86) 92.74 (±4.76)Proposed (MS-Dual-Guided) 96.44 (±3.15) 96.14 (±3.15) 94.95 (±4.48) 87.87 (±15.23) 93.85 (±3.50)

Average Surface Distance (MSD)

Liver Kidney R Kidney L Spleen Mean

Baseline( DAF [23]) 0.64 (±0.29) 0.97 (±1.08) 0.63 (±0.25) 1.45 (±2.04) 0.92 (±0.33)Proposed (PAM) 0.55 (±0.19) 0.56 (±0.23) 0.55 (±0.21) 1.54 (±2.40) 0.80 (±0.43)Proposed (CAM) 0.58 (±0.22) 0.57 (±0.24) 0.52 (±0.20) 1.29 (±1.64) 0.74 (±0.32)Proposed (DualNet) 0.54 (±0.19) 0.56 (±0.19) 0.50 (±0.18) 1.49 (±2.29) 0.77 (±0.41)Proposed (MS-Dual) 0.53 (±0.18) 0.51 (±0.14) 0.46 (±0.14) 1.19 (±1.42) 0.67 (±0.30)Proposed (MS-Dual-Guided) 0.54 (±0.16) 0.48 (±0.18) 0.48 (±0.14) 1.13 (±1.24) 0.66 (±0.27)

TABLE I: Ablation study on different proposed attention modules on the Chaos dataset (multi-organ segmentation on MRItask). The values show the average result of the experiments averaged over the 3 folds. Best results are represented in redbold, while blue is used to highlight the second best performance.

considered in our evaluation. The MSD between contours Aand B is defined as follows:

MSD =1

|A|+ |B|

(∑a∈A

d(a, b) +∑b∈B

d(b, a)

)(14)

where d(a, b) is the distance between a point a on thesurface A and the surface B, which is given by the minimumof the Euclidean norm:

d(a,B) = minb∈B‖a− b‖22 (15)

Since inter-slice distances and x-y spacing for each individ-ual scan are not provided, we report these results on voxels.

B. Results1) Ablation study on the proposed attention modules: To

validate the individual contribution of different components tothe segmentation performance, we perform an ablation experi-ment under different settings. Table I reports the results of thedifferent attention modules. Compared to the baseline, we ob-serve that by integrating either a spatial (PAM) or an attentionmodule (CAM) at each scale in the baseline architecture theperformance improves between 2-3% in terms of overlappingand volume similarity, and between 12-18% in terms of surfacedistances, as average. On the other hand, having both modulesin parallel –i.e., Proposed (DualNet)– brings slightly betterresults in terms of DSC, but achieves lower performancewhen employing the surface distance metric. However, despitethe lower average performance on the MSD, the proposedDualNet model still achieves better results in 3 out of 4

structures compared to the channel attention model. This trendis repeated on the DSC metric, where DualNet surpasses theproposed CAM architecture in the same 3 structures: liver andboth left and right kidneys. This suggests that, even thoughboth spatial and channel attention bring an improvement onthe performance, the channel attention module contributesmore than the spatial attention when they are combined. Iffeatures generated by the proposed DualNet model are refinedin a second step –network referred to as Proposed(MS-Dual)–the average results are further improved by nearly 0.7% and10% in volume and distance-based metrics, respectively. Last,the introduction of the semantic-guided loss –Proposed (MS-Dual-Guided)– results in an additional boost on performance,yielding to the best values in the three metrics: 86.75% (DSC),93.85% (VS) and 0.66 voxels (MSD). These results representan improvement of 4.5%, 4% and 26% in DSC, VS andMSD, respectively, compared to the baseline in [23], showingthe efficiency of the proposed attention network compared toindividual attention components.

2) Comparison to state-of-the-art: The experimental re-sults obtained by several state-of-the-art segmentation net-works are reported in Table II. Compared to other networksthat were proposed in the context of medical image segmen-tation –i.e., UNet [5], Attention UNet [25] and DAF [23]–our network achieves a mean improvement of 5.6%, 4.3%and 2.0% (in terms of DSC), 4.9%, 4.2% and 2.1% (onVS) and 25%, 26% and 6% (on MSD), respectively. Thisdifference in performance could be explained by the fact thatthe attention modules integrated in [23] and [25] are muchsimpler than those proposed in our architecture. On the otherhand, attention modules on general computer vision tasks have

Page 7: Multi-scale guided attention for medical image segmentation · 2019. 6. 10. · 1 Multi-scale guided attention for medical image segmentation Ashish Sinha and Jose Dolz Abstract—Even

7

DSC

Method Liver Kidney R Kidney L Spleen Mean

UNet [5] 90.94 (±4.01) 79.14 (±15.23) 82.51 (±7.48) 71.95 (±21.61) 81.14 (±7.88)DANet [19] 91.69 (±4.07) 83.85 (±9.40) 84.49 (±8.60) 75.54 (±16.08) 83.89 (±9.54)PAN (ResNet34) [20] 91.99 (±2.98) 81.51 (±9.03) 83.62 (±6.21) 73.70 (±19.97) 82.70 (±6.51)PAN (ResNet101)[20] 92.13 (±3.51) 85.02 (±5.16) 85.36 (±4.87) 74.84 (±21.23) 84.34 (±6.17)DAF [23] 91.66 (±2.99) 79.28 (±18.68) 83.63 (±7.56) 75.35 (±20.41) 82.48 (±6.06)UNet Attention [25] 92.02 (±1.93) 84.33 (±5.91) 85.57 (±4.09) 77.18 (±15.95) 84.77 (±5.27)Proposed (MS-Dual-Guided) 92.46 (±2.82) 87.96 (±6.46) 88.01 (±6.16) 78.61 (±18.69) 86.75 (±5.05)

Volume similarity (VS)

Liver Kidney R Kidney L Spleen Mean

UNet [5] 95.54 (±4.43) 87.68 (±5.77) 89.55 (±4.68) 83.28 (±14.78) 89.01 (±4.82)DANet [19] 96.90 (±4.18) 92.88 (±5.12) 91.52 (±6.73) 84.37 (±16.15) 91.42 (±4.52)PAN (ResNet34) [20] 96.56 (±3.55) 90.89 (±5.64) 91.83 (±7.75) 81.98 (±20.67) 90.32 (±5.27)PAN (ResNet101) [20] 96.99 (±3.64) 93.77 (±4.63) 92.69 (±6.88) 84.24 (±17.37) 91.93 (±4.71)DAF [23] 96.69 (±3.21) 86.75 (±16.41) 90.29 (±8.39) 84.98 (±14.42) 89.68 (±4.48)UNet Attention [25] 96.95 (±1.89) 92.29 (±6.41) 91.79 (±3.53) 85.94 (±11.88) 91.74 (±3.91)Proposed (MS-Dual-Guided) 96.44 (±3.15) 96.14 (±3.15) 94.95 (±4.48) 87.87 (±15.23) 93.85 (±3.50)

Average Surface Distance (MSD)

Liver Kidney R Kidney L Spleen Mean

UNet [5] 0.59 (±0.18) 0.69 (±0.38) 0.61 (±0.19) 1.76 (±2.57) 0.91 (±0.49)DANet [19] 0.61 (±0.27) 0.65 (±0.31) 0.67 (±0.30) 1.17 (±0.94) 0.78 (±0.23)PAN (ResNet34)[20] 0.62 (±0.25) 0.75 (±0.31) 0.69 (±0.21) 1.37 (±1.43) 0.86 (±0.29)PAN (ResNet101) [20] 0.57 (±0.22) 0.61 (±0.19) 0.64 (±0.15) 1.30 (±1.47) 0.78 (±0.31)DAF [23] 0.64 (±0.29) 0.97 (±1.08) 0.63 (±0.25) 1.45 (±2.04) 0.92 (±0.33)UNet Attention [25] 0.57 (±0.25) 0.61 (±0.23) 0.56 (±0.18) 1.15 (±1.01) 0.72 (±0.24)Proposed (MS-Dual-Guided) 0.54 (±0.16) 0.48 (±0.18) 0.48 (±0.14) 1.13 (±1.24) 0.66 (±0.27)

TABLE II: Comparison of the proposed network to other state-of-the-art architectures on the CHAOS dataset. The values showthe average result of the experiments on the 3 folds.

attracted more attention, resulting in more elaborated strategieswhich typically achieve better segmentation results. Amongthese architectures, the PAN model [20] with ResNet101 asbackbone –the same as ours– achieved 84.34%, 91.93% and0.78 voxels, as average, for DSC, VS and MSD, respectively,which represent the best results for segmentation networksproposed for natural scenes. Despite these competitive results,the proposed model still outperforms the PAN architecture by2.4%, 1.9% and 12% in DSC, VS and MSD. As PAN [20] alsoemployed a multi-scale architecture, these differences suggestthat the use of dual self-attention and a guided refinementmodule can actually improve the modelling of contextualdependencies, resulting in an increased segmentation perfor-mance.

In addition to the values reported on Tables I and II, wealso depict the distribution of DSC, VS and MSD valueson the 15 subjects used for evaluation for all the models(Fig. 4). In these plots, we can first observe the impact ofthe different attention modules in the segmentation perfor-mance of the proposed model. As we progressively includethe proposed attention modules in the baseline network, thesegmentation performance improves, which is reflected in abetter distribution of segmentation accuracy values with asmaller variance. This difference on results distribution is moreprominent when comparing the proposed network with otherstate-of-the-art networks, which are represented in bluish boxplots. We can also observe that this pattern is constant acrossorgans and metrics, suggesting that the proposed attentionnetwork achieves better and more robust segmentation resultsthan current state-of-the-art architectures.

3) Convergence: We have also compared the different ar-chitectures in terms of convergence, whose results are depictedin Fig. 5. Particularly, the mean DSC value over the fourstructures on one of the validation folds is shown for eachof the networks. It can be observed that, even though mostof the networks achieve results which may be considered‘similar’ –up to some extent– the convergence behaviour istotally different. While there are three networks with similarconvergence curves –i.e., UNet, DANet and DAF–, PAN needsmore iterations to convergence, ultimately performing betterthan these networks after nearly 400 epochs. On the otherhand, we found that attention UNet and the proposed networkpresented the fastest convergence, achieving their best resultsat epoch 48 and 73, respectively.

4) Qualitative evaluation: To visualize the impact of thedifferent attention modules, Fig. 6 displays the segmentationresults of the different networks on several subjects. Despitethe quantitative results reported on Table II show that there areseveral architectures with similar performance, the qualitativeresults depict interesting findings. The first thing that wecan observe is that UNet, which is the only network notintegrating attention, typically over segments certain organsand gets confused easily. For example, in the first and thirdrow it fails to properly segment the liver –in green– and thespleen –in blue–, respectively, including many regions thatdo not belong to the target. Particularly in the third row itconfused the small bowels with the spleen, while the spleenis not even present in that slice. Integrating attention canovercome the limitations shown by UNet and improve thesegmentation performance by focusing the attention to relevant

Page 8: Multi-scale guided attention for medical image segmentation · 2019. 6. 10. · 1 Multi-scale guided attention for medical image segmentation Ashish Sinha and Jose Dolz Abstract—Even

8

(a) Dice Similarity coefficient (%)

(b) Volume similarity (%)

(c) Average surface distance (voxels)

Fig. 4: These plots depict the distributions of the differentevaluation metrics for the four organs segmented. Bluishcolors represent the results obtained by other state-of-the-artnetworks, whereas the results obtained by our proposed modelsare displayed in with the brownish boxplots.

Fig. 5: Evolution of the mean validation DSC over time.

areas. This can be observed in the results obtained by theother networks, which, up to some extent, reduce the falsepositives in the prediction. Particularly, the PAN model (withResNet101 as backbone) seems to avoid misclassificationresults on these ambiguous regions. Nevertheless, it producessmoother segmentations, which result in lost of fine graineddetails. This effect can be observed, for example, in the liver

segmentation results –in green– on the first two rows. Aninteresting result is the segmentation shown in the last row.In this particular case, all the models except the proposednetwork get confused to segment the left kidney. While DANetand PAN models confuse the left kidney with the right one,DAF is not able to detect any relevant region related to thekidneys in that area. In addition, both UNet and UNet withattention models generate segmentations of the left kidney thatcontain three organs, i.e., left and right kidneys and spleen,which is anatomically not plausible. Unlike all these models,the proposed architecture does not get distracted by ambiguousregions and some misclassified structures are now correctlyclassified.

These visual results indicate that our approach can success-fully recover finer segmentation details, while avoiding gettingdistracted in ambiguous regions. The selective integration ofspatial information and among channel maps followed by aguided attention module helps to capture context information.This demonstrates that the proposed multi-scale guided atten-tion model can efficiently encode complimentary informationto accurately segment medical images.

V. CONCLUSION

In this work, we introduced a novel attention architecturefor the task of medical image segmentation. This model incor-porates a multi-scale strategy to combine semantic informationat different levels and self-attention modules to progressivelyaggregate relevant contextual features. Last, a guided refine-ment module filters noisy regions and help the network tofocus on relevant class-specific regions in the image. Tovalidate our approach we conducted experiments on MRI scans(T1-DUAL) from the Combined Healthy Abdominal OrganSegmentation (CHAOS) Challenge. We provided extensive ex-periments to evaluate the impact of the individual componentsof the proposed architecture. Besides, we compared our modelto existing approaches that integrate attention, which have beenrecently proposed for natural scene [19], [20] and medicalimage [5], [23], [25] segmentation. Experiment results showedthat the proposed model outperformed all previous approachesboth quantitative and qualitatively, which may be explained bythe enhanced ability to model rich contextual dependenciesover local features. This demonstrates the efficiency of ourapproach to provide precise and reliable automatic segmenta-tions of medical images.

ACKNOWLEDGMENTS

We wish to thank NVIDIA for its kind donation of the TitanV GPU used in this work.

REFERENCES

[1] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,M. Ghafoorian, J. A. van der Laak, B. Van Ginneken, and C. I. Sanchez,“A survey on deep learning in medical image analysis,” Medical imageanalysis, vol. 42, pp. 60–88, 2017.

[2] J. Dolz, K. Gopinath, J. Yuan, H. Lombaert, C. Desrosiers, andI. Ben Ayed, “HyperDense-Net: A hyper-densely connected CNN formulti-modal image segmentation,” IEEE transactions on medical imag-ing, 2018.

Page 9: Multi-scale guided attention for medical image segmentation · 2019. 6. 10. · 1 Multi-scale guided attention for medical image segmentation Ashish Sinha and Jose Dolz Abstract—Even

9

Ground Truth UNet DANet

PAN(ResNet34)

PAN(ResNet101) DAF Attention

UNetProposed

Input Image

Fig. 6: Results on several subjects on the CHAOS Challenge dataset. The proposed multi-scale guided attention networkachieves qualitatively better results than other state-of-the-art networks that also integrate attention modules.

[3] O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P.-A. Heng,I. Cetin, K. Lekadir, O. Camara, M. A. G. Ballester et al., “Deep learningtechniques for automatic MRI cardiac multi-structures segmentationand diagnosis: Is the problem solved?” IEEE transactions on medicalimaging, vol. 37, no. 11, pp. 2514–2525, 2018.

[4] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2015, pp. 3431–3440.

[5] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in International Conference onMedical image computing and computer-assisted intervention. Springer,2015, pp. 234–241.

[6] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinementnetworks for high-resolution semantic segmentation,” in Proceedings ofthe IEEE conference on computer vision and pattern recognition, 2017,pp. 1925–1934.

[7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,“Deeplab: Semantic image segmentation with deep convolutional nets,atrous convolution, and fully connected CRFs,” IEEE transactions onpattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848,2018.

[8] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsingnetwork,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2017, pp. 2881–2890.

[9] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmen-tation,” in Proceedings of the European Conference on Computer Vision(ECCV), 2018, pp. 801–818.

[10] F. Yu and V. Koltun, “Multi-scale context aggregation by dilatedconvolutions,” in ICLR, 2016.

[11] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider tosee better,” arXiv preprint arXiv:1506.04579, 2015.

[12] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang,“Multi-context attention for human pose estimation,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,2017, pp. 1831–1840.

[13] A. Gupta, D. Agrawal, H. Chauhan, J. Dolz, and M. Pedersoli, “Anattention model for group-level emotion recognition,” in Proceedings ofthe 2018 on International Conference on Multimodal Interaction. ACM,2018, pp. 611–615.

[14] Z. Huang, Z. Zhong, L. Sun, and Q. Huo, “Mask R-CNN with pyramidattention network for scene text detection,” in 2019 IEEE WinterConference on Applications of Computer Vision (WACV). IEEE, 2019,pp. 764–772.

[15] S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient ob-ject detection,” in Proceedings of the European Conference on ComputerVision (ECCV), 2018, pp. 234–250.

[16] K. Li, Z. Wu, K.-C. Peng, J. Ernst, and Y. Fu, “Tell me where to

look: Guided attention inference network,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2018, pp.9215–9223.

[17] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention toscale: Scale-aware semantic image segmentation,” in Proceedings of theIEEE conference on computer vision and pattern recognition, 2016, pp.3640–3649.

[18] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia,“PSANet: Point-wise spatial attention network for scene parsing,” inProceedings of the European Conference on Computer Vision (ECCV),2018, pp. 267–283.

[19] J. Fu, J. Liu, H. Tian, Z. Fang, and H. Lu, “Dual attention network forscene segmentation,” in The IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2019.

[20] H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network forsemantic segmentation,” in BMVC, 2018.

[21] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet:Bilateral segmentation network for real-time semantic segmentation,” inProceedings of the European Conference on Computer Vision (ECCV),2018, pp. 325–341.

[22] P. Zhang, W. Liu, H. Wang, Y. Lei, and H. Lu, “Deep gated atten-tion networks for large-scale street-level scene segmentation,” PatternRecognition, vol. 88, pp. 702–714, 2019.

[23] Y. Wang, Z. Deng, X. Hu, L. Zhu, X. Yang, X. Xu, P.-A. Heng,and D. Ni, “Deep attentional features for prostate segmentation inultrasound,” in MICCAI, 2018.

[24] C. Li, Q. Tong, X. Liao, W. Si, Y. Sun, Q. Wang, and P.-A. Heng,“Attention based hierarchical aggregation network for 3D left atrialsegmentation,” in International Workshop on Statistical Atlases andComputational Models of the Heart. Springer, 2018, pp. 255–264.

[25] J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker,and D. Rueckert, “Attention gated networks: Learning to leverage salientregions in medical images,” Medical image analysis, vol. 53, pp. 197–207, 2019.

[26] D. Nie, Y. Gao, L. Wang, and D. Shen, “ASDNet: Attention basedsemi-supervised deep networks for medical image segmentation,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2018, pp. 370–378.

[27] T. Heimann and H.-P. Meinzer, “Statistical shape models for 3D medicalimage segmentation: a review,” Medical image analysis, vol. 13, no. 4,pp. 543–563, 2009.

[28] J. Dolz, L. Massoptier, and M. Vermandel, “Segmentation algorithms ofsubcortical brain structures on MRI for radiotherapy and radiosurgery:a survey,” IRBM, vol. 36, no. 4, pp. 200–212, 2015.

[29] P. F. Christ, M. E. A. Elshaer, F. Ettlinger, S. Tatavarty, M. Bickel,P. Bilic, M. Rempfler, M. Armbruster, F. Hofmann, M. DAnastasi et al.,“Automatic liver and lesion segmentation in CT using cascaded fullyconvolutional neural networks and 3d conditional random fields,” in

Page 10: Multi-scale guided attention for medical image segmentation · 2019. 6. 10. · 1 Multi-scale guided attention for medical image segmentation Ashish Sinha and Jose Dolz Abstract—Even

10

International Conference on Medical Image Computing and Computer-Assisted Intervention, 2016, pp. 415–423.

[30] T. Fechter, S. Adebahr, D. Baltas, I. Ben Ayed, C. Desrosiers, andJ. Dolz, “Esophagus segmentation in CT via 3D fully convolutionalneural network and random walk,” Medical physics, vol. 44, no. 12, pp.6341–6352, 2017.

[31] X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P.-A. Heng, “H-DenseUNet: hybrid densely connected UNet for liver and tumor seg-mentation from CT volumes,” IEEE transactions on medical imaging,vol. 37, no. 12, pp. 2663–2674, 2018.

[32] Y. Man, Y. Huang, J. F. X. Li, and F. Wu, “Deep Q learning driven CTpancreas segmentation with geometry-aware U-Net,” IEEE transactionson medical imaging, 2019.

[33] J. Dolz, C. Desrosiers, and I. Ben Ayed, “3D fully convolutionalnetworks for subcortical segmentation in MRI: A large-scale study,”NeuroImage, vol. 170, pp. 456–470, 2018.

[34] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane,D. K. Menon, D. Rueckert, and B. Glocker, “Efficient multi-scale 3DCNN with fully connected CRF for accurate brain lesion segmentation,”Medical image analysis, vol. 36, pp. 61–78, 2017.

[35] A. Carass, J. L. Cuzzocreo, S. Han, C. R. Hernandez-Castillo, P. E.Rasser, M. Ganz, V. Beliveau et al., “Comparing fully automated state-of-the-art cerebellum parcellation from magnetic resonance images,”NeuroImage, 2018.

[36] J. Dolz, X. Xu, J. Rony, J. Yuan, Y. Liu, E. Granger, C. Desrosiers,X. Zhang, I. Ben Ayed, and H. Lu, “Multiregion segmentation ofbladder cancer structures in MRI with progressive dilated convolutionalnetworks,” Medical physics, vol. 45, no. 12, pp. 5482–5493, 2018.

[37] M. Pedersoli, T. Lucas, C. Schmid, and J. Verbeek, “Areas of atten-tion for image captioning,” in Proceedings of the IEEE InternationalConference on Computer Vision, 2017, pp. 1242–1250.

[38] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attentionnetworks for image question answering,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2016, pp. 21–29.

[39] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang,and X. Tang, “Residual attention network for image classification,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2017, pp. 3156–3164.

[40] H. Li, Y. Liu, W. Ouyang, and X. Wang, “Zoom out-and-in networkwith map attention decision for region proposal and object detection,”International Journal of Computer Vision, vol. 127, no. 3, pp. 225–238,2019.

[41] A. P. Parikh, O. Tackstrom, D. Das, and J. Uszkoreit, “A decomposableattention model for natural language inference,” in In EMNLP, 2016.

[42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advancesin neural information processing systems, 2017, pp. 5998–6008.

[43] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention gen-erative adversarial networks,” arXiv preprint arXiv:1805.08318, 2018.

[44] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net-works,” in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2018, pp. 7794–7803.

[45] Z. Ji, Y. Fu, J. Guo, Y. Pang, Z. M. Zhang et al., “Stacked semantics-guided attention model for fine-grained zero-shot learning,” in Advancesin Neural Information Processing Systems, 2018, pp. 5995–6004.

[46] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters–improve semantic segmentation by global convolutional network,” inProceedings of the IEEE conference on computer vision and patternrecognition, 2017, pp. 4353–4361.

[47] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection andhierarchical image segmentation,” IEEE transactions on pattern analysisand machine intelligence, vol. 33, no. 5, pp. 898–916, 2010.

[48] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik, “Hypercolumnsfor object segmentation and fine-grained localization,” in Proceedings ofthe IEEE conference on computer vision and pattern recognition, 2015,pp. 447–456.

[49] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich, “Feedforwardsemantic segmentation with zoom-out features,” in Proceedings of theIEEE conference on computer vision and pattern recognition, 2015, pp.3376–3385.

[50] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S.Chua, “SCA-CNN: Spatial and channel-wise attention in convolutionalnetworks for image captioning,” in Proceedings of the IEEE conferenceon computer vision and pattern recognition, 2017, pp. 5659–5667.

[51] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervisednets,” in Artificial Intelligence and Statistics, 2015, pp. 562–570.

[52] M. A. Selver, “Exploring brushlet based 3D textures in transfer functionspecification for direct volume rendering of abdominal organs,” IEEEtransactions on visualization and computer graphics, vol. 21, no. 2, pp.174–187, 2014.

[53] E. Selvi, M. A. Selver, A. E. Kavur, C. Guzelis, and O. Dicle,“Segmentation of abdominal organs from MR images using multi-levelhierarchical classification,” Journal of the Faculty of Engineering andArchitecture of Gazi University, vol. 30, no. 3, pp. 533–546, 2015.

[54] M. A. Selver, “Segmentation of abdominal organs from CT using amulti-level, hierarchical neural network strategy,” Computer methodsand programs in biomedicine, vol. 113, no. 3, pp. 830–852, 2014.

[55] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residualnetworks,” in European conference on computer vision. Springer, 2016,pp. 630–645.


Recommended