IEEE TRANSACTIONS ON IMAGE PROCESSING 1 SRDA-Net: Super ...

IEEE TRANSACTIONS ON IMAGE PROCESSING 1

SRDA-Net: Super-Resolution Domain AdaptationNetworks for Semantic Segmentation

Bin Pan, Zhenjie Tang, Enhai Liu, Xia Xu, Tianyang Shi, Zhenwei Shi

Abstract—Recently, Unsupervised Domain Adaptation wasproposed to address the domain shift problem in semanticsegmentation task, but it may perform poor when source andtarget domains belong to different resolutions. In this work, wedesign a novel end-to-end semantic segmentation network, Super-Resolution Domain Adaptation Network (SRDA-Net), whichcould simultaneously complete super-resolution and domainadaptation. Such characteristics exactly meet the requirement ofsemantic segmentation for remote sensing images which usuallyinvolve various resolutions. Generally, SRDA-Net includes threedeep neural networks: a Super-Resolution and Segmentation(SRS) model focuses on recovering high-resolution image andpredicting segmentation map; a pixel-level domain classifier(PDC) tries to distinguish the images from which domains;and output-space domain classifier (ODC) discriminates pixellabel distributions from which domains. PDC and ODC areconsidered as the discriminators, and SRS is treated as thegenerator. By the adversarial learning, SRS tries to align thesource with target domains on pixel-level visual appearance andoutput-space. Experiments are conducted on the two remotesensing datasets with different resolutions. SRDA-Net performsfavorably against the state-of-the-art methods in terms of ac-curacy and visual quality. Code and models are available athttps://github.com/tangzhenjie/SRDA-Net.

Index Terms—domain adaptation, super resolution, semanticsegmentation, remote sensing

I. INTRODUCTION

REMOTE sensing imagery semantic segmentation, whichaims at assigning a semantic label for every pixel of an

image, has enabled various high-level applications, such asurban planning, land-use survey and environment monitoring[1]–[3]. Deep convolutional neural networks (CNNs) recentlyhave presented amazing performance in the task of semanticsegmentation [4]–[7]. However, to guarantee the superiorrepresentation ability of CNNs, a large amount of manually

This work was supported in part by the National Key Research andDevelopment Program of China under Grant 2017YFC1405605, in part by theNatural Science Foundation of Hebei under Grant F2019202062, in part by theScience and Technology Program of Tianjin under Grant 18YFCZZC00060,in part by the Science and Technology Program of Tianjin under Grant18ZXZNGX00100. (Corresponding author: Bin Pan.)

Enhai Liu and Zhenjie Tang are with the School of Artificial Intelligence,Hebei University of Technology, Tianjin 300401, China, and also with theHebei Province Key Laboratory of Big Data Calculation, Tianjin 300401,China (e-mail:[email protected]; [email protected]).

Bin Pan (Corresponding author) is with the School of Statistics andData Science, Nankai University, Tianjin 300071, China (e-mail: [email protected]).

Xia Xu is with College of Computer Science, Nankai University, Tianjin300350, China (e-mail: [email protected]).

Tianyang Shi and Zhenwei Shi are with the Image Processing Center,School of Astronautics, Beihang University, Beijing 100191, China (e-mail:[email protected]; [email protected]).

labeled data are required for training. The manually annotatingprocess for each pixel is time-consuming and labor-intensive.

Unsupervised Domain Adaptation (UDA) is one of thepowerful techniques to handle the problem of insufficientlabeling. UDA is the field of research that aims at learninga well performance model of target domain from sourcesupervision only. Most UDA works for semantic segmentationseek to align features in a deep network of source and targetdomains by making features domain invariant [8], [9]. Inrecent years, many works attempt to minimize domain shift atthe pixel-level by turning source domain images into target-like images with adversarial training [8], [10]. In addition,some researchers propose to reduce the spatial structure do-main discrepancies in the output space [11], [12].

However, UDA methods for natural scene images may notbe directly transferred to remote sensing images, because ofthe spatial resolution problem. Spatial resolution [13], [14] isone of the important characteristics of remote sensing images.Unlike natural scene images, the sensors used to acquireremote sensing images usually have significant differences,which results in different spatial resolutions. Moreover, thedefinition of resolution in remote sensing image is not thesame as that in natural scene. For example, there may be bothlarge and small cars in a natural scene image, however, a car ina 4m-resolution remote sensing image can never be the samesize as a car in a 1m-resolution image. On the other hand,if we only considered UDA for remote sensing images withthe same resolution [15], [16], the available data should beseverely compressed. Therefore, we may conclude that UDAfor remote sensing images should not only narrow the gapsbetween source and target domains, but also address the issueof different resolutions.

To the best of our knowledge, there are few UDA al-gorithms for remote sensing images that explicitly considerthe resolution problem. Most UDA works dealt with theresolution problem by simply interpolation [17] or adjustingthe parameters of kernel function [18]. When the domain gapof resolution between source and target domains is not serious,some researchers [19], [20] ignored the resolution problem.For instance, Yan et al. [19] proposed a triplet adversarialdomain adaptation method to learn a domain-invariant classi-fier in output-space by a novel domain discriminator, whichignored the resolution problem between source and targetdomains. Instead of matching the distributions in output-space,Zhang et al. [17] developed to eliminate the domain shift byaligning the distributions of the source and target data in thefeature space, which dealt with the resolution problem onlyby interpolation. Liu et al. [18] also minimized the distance

arX

iv:2

005.

0638

2v3

[cs

.CV

] 2

1 M

ay 2

020


Fig. 1. A unsupervised domain adaptation approach for semantic segmenta-tion in remote sensing images. Given a source domain (low-resolution remotesensing data) with labels, and a target domain (high-resolution remote sensingdata) without labels, our goal is train a segmentation model to predict thelabels of the target domain.

of feature distributions between source and target domainsthrough metric under different kernel functions, which reducedthe effect of resolution problem by adjusting the parameters ofkernel function. However, the existing UDA methods for re-mote sensing images have not explicitly studied the resolutionproblem.

Explicitly considering the resolution problem, in this paper,we propose a novel end-to-end network that can simulta-neously conduct Super-Resolution and Domain Adaptation(SRDA-Net), which improves the segmentation performancefrom low-resolution remote sensing data to high-resolutionremote sensing data. Fig. 1 briefly depicts the problem set-ting: source domain (low-resolution remote sensing images)with labels and target domain (high-resolution remote sens-ing images) without labels. SRDA-Net is motivated by tworecent research: 1) adversarial training based UDA methodsfor semantic segmentation 2) super-resolution and semanticsegmentation can promote each other. In previous works, mostof the UDA methods resort to adversarial training to reduce thedomain discrepancies. For instance, Zhang et al. [8] apply theadversarial loss to the lower layers of segmentation networkbecause the lower layers mainly capture the appearance infor-mation of the images. Tsai et al. [11] employ the adversarialfeature learning in the output space over the base segmentationmodel. Vu et al. [12] also reduce the discrepancies of featuredistributions in output space through Adversarial EntropyMinimization. Moreover, recent research shows that super-resolution and semantic segmentation can boost each other.Some researchers show that super-resolution results can beimproved by semantic priors, such as semantic segmentation

probability maps [21] or segmentation labels [22]. In thefield of remote sensing, high-resolution images which containmany details are important for image segmentation [23]. Forinstance, Lei et al. [23] propose to combine image super-resolution in segmentation network to obtain improvementson both super-resolution and segmentation tasks.

To be specific, the SRDA-Net consists of three deep neuralnetworks, a multi-task model for Super-Resolution and seman-tic Segmentation (SRS), a Pixel-level Domain Classifier (PDC)and Output-space Domain Classifier (ODC). SRS integratesa super-resolution network and a segmentation network intoone architecture. The former focuses on recovering the high-resolution image with the style of target domain, and the latteraims to get the segmentation result with aligned in output-space [11]. PDC is fed with high-resolution images of super-resolution network, and outputs their domain (source or targetdomain) for each pixel. ODC is fed with the predicted labeldistributions of segmentation network, then outputs the domainclass for each pixel label distribution. Similar to generativeadversarial networks (GANs) [24], SRS model can be regardedas a generator, and PDC/ODC models are treated as twodiscriminators. Through adversarial training, SRS model canlearn domain-invariant features at the pixel and output-spacelevels.

In summary, our major contributions of SRDA-Net can besummarized as follows:• A new UDA method named SRDA-Net is proposed

for semantic segmentation to adapt from low-resolutionremote sensing images to high-resolution remote sensingimages.

• Inspired by the mutual promotion of super-resolution andsemantic segmentation, we construct a multi-task modelcomposed of super-resolution and segmentation, whichnot only eliminates the resolution difference betweensource and target domains, but also obtains improvementson both super-resolution and segmentation tasks.

• We design two domain classifiers at the pixel-level andoutput space to align the source and target domains. Byadversarial training, the domain gap can be effectivelyreduced.

II. RELATED WORKS

In this section, we briefly review the important works:semantic segmentation, single image super resolution, unsu-pervised domain adaptation.

A. Semantic Segmentation

Semantic segmentation is the task that assigns each pixel asemantic label for an image which plays a vital role in lotsof tasks including autonomous driving, urban planning, etc.In 2014, fully convolutional network [4] presents amazingperformance in the field of some pixel-wise tasks (such assemantic segmentation). After that, models based on FullyConvolutional Networks have demonstrated significant im-provement on several segmentation benchmarks [25]. Thereare several model variants proposed to exploit the contextualinformation for segmentation by adopting multi-scale inputs


Fig. 2. The overview of our Super-Resolution Domain Adaptation Network (SRDA-Net). On the top, the asymmetric multi-task model is depicted, whichconsists of a Super-Resolution model and a Segmentation model (SRS). During the training phase, a source domain image and a downsampled target domainimage are fed to the SRS model. The purple and red curve arrows respectively represent the input/output of source and target domains. Further, the two-wayarrow indicates the data flow involved in the training process. From this figure, source images take part in the super-resolution and segmentation training inthe supervised manner, while target images only participate the super-resolution training in the supervised manner. On the bottom, the Pixel-level DomainClassifier (PDC) and Output-space Domain Classifier (ODC) are demonstrated. The Super-Resolution images and the predicted label distributions from SRSare respectively flowed to PDC and ODC. By adversarially training SRS and the two classifiers, the final SRS will be obtained. During the testing stage, thedownsampled test images are fed to the SRS for predicting the segmentation maps.

[5], [26] or employing probabilistic graphical models [1]. Forinstance, Chen et al. [5] propose a dilated convolution opera-tion to aggregate multi-scale contextual information. Ding etal. [26] introduce a two-stage multiscale training strategy toincorporate enough context information.

B. Single Image Super Resolution

Single Image Super Resolution [27] aims to recover high-resolution images from the corresponding low-resolution ones,which has been widely used in many applications such as se-curity and surveillance imaging, medical imaging and remotesensing image reconstruction. The conventional non-CNNmethods mainly focus on the domain and feature priors. Forexample, interpolation methods such as bicubic and Lanczosgenerate the HR pixels by the weighted average of neighboringLR pixels. However, CNN-based methods [28], [29] considerthe super resolution as a mapping between the LR and HRspaces in an end-to-end manner, showing great breakthrough.For example, Arun et al. design a 3-D convolutional neuralnetwork (CNN)-based super resolution architecture for hyper-spectral images. Also, some researchers use perceptual loss[30] and adversarial training [31] to improve perceptual qualityof super resolution.

C. Unsupervised Domain AdaptationSince we are only concerned with visual semantic segmen-

tation in this work, we limit our review of UDA to approachesthat aim at this task as well. Many UDA approaches [8],[9] for segmentation employ adversarial training to minimizecross-domain discrepancy in the feature space. Another works[11], [12] propose to align the predicted label distributionsin the output space. Tsai et al. [11] do the alignment on theprediction of the segmentation network and Vu et al. [12]propose to do it on entropy minimization of the predictionprobability. In contrast, pixel-level domain adaptation [10],[16], [32] is the use of generative networks to turn sourcedomain images into target-like images. Li et al. [10] presenta bidirectional learning system for semantic segmentation,which is a closed loop to learn the segmentation adaptationmodel and the image translation model alternatively, causingthe domain gap to be gradually reduced at the pixel-level.Besides, a curriculum learning strategy is proposed in [33]by leveraging information from global label distributions andlocal super-pixel distributions of target domain.

III. APPROACHThis section describes the detailed methodology of the

proposed unsupervised domain adaptation for semantic seg-


mentation. In order to reduce the domain gap (including theresolution difference), we integrate the super-resolution intosegmentation model for eliminating the impact of differentresolutions. Moreover, by adversarial optimizing SRS and twodomain classifiers (PDC and ODC), the domain gap in pixel-level and output-space can be gradually reduced. Fig. 2 showsthe entire framework.

Before illustrating the method in detail, it is necessary todefine the cross-domain problem our faced with mathematicalnotations. A source domain S from a low-resolution remotesensing dataset provides low-resolution images IS , pixel-level annotations AS ; a target domain T from high-resolutionremote sensing dataset only provides high-resolution imagesIT . Note that S and T share the same label space RC , wherethe C denotes the number of categories. In a word, given IS ,AS , IT , our goal is to learn a segmentation model to predictpixel-wise category of T .

According to the definition above, the purpose of this paperis that how to reduce the domain gap (including resolutiondifference) between S and T . Below, we first describe theasymmetric multi-task (super-resolution and segmentation)model. Then, the adversarial domain adaptation (pixel-leveland output-space) is presented in details.

A. Multi-task Model: Super-Resolution and SegmentationOver the past few years, methods based on convolutional

neural networks have achieved significant progress in semanticsegmentation. However, CNN-based methods may not gener-alize well to unseen images, especially when there is a domaingap between the training (source domain) and test (targetdomain) images. For remote sensing images, the resolutiondifference is a important domain gap, which seriously affectsthe generalization ability of the segmentation model. Thus, it isimportant to eliminate resolution difference for cross-domainsemantic segmentation in remote sensing.

Recently, some researchers [21], [23] show that super-resolution and semantic segmentation can boost each other.Therefore, super-resolution and segmentation both are chal-lenging tasks, but they may have certain relationship. Super-resolution will provide images with more details that mayhelp to improve the segmentation accuracy, while label mapsin segmentation dataset or semantic segmentation probabilitymaps may contribute to recover textures faithful to semanticclasses during super-resolution process.

Based on the above discussions, we propose an asymmetricmulti-task learning model to eliminate the gap of differentresolution between source and target domains, which consistsof Super-Resolution and Segmentation models (SRS). In orderto make super-resolution and segmentation boost each other,we propose two strategies: (1) we introduce a pyramid featurefusion structure between the two tasks; (2) for the generatedhigh-resolution images of source domain, we impose the cross-entropy segmentation loss to train the segmentation network.During the training phase, a source domain image and adownsampled target domain image are fed to the SRS network:the source domain images are involved in the entire trainingprocess; the target images only participate in the super-resolution training process (shown in Fig. 2). At the testing

stage, the downsampled test images are fed to SRS networkto obtain the pixel-wise scope maps.

To be specific, we employ Residual ASPP Module [34] asthe shared feature extractor. For the super-resolution model,due to GPU memory limitation, we only use a few de-convolutions to recover the high-resolution images withoutusing pixelshuffle [35]. In order to transfer the low-levelfeatures from super-resolution stream to segmentation stream,we introduce the pyramid feature fusion structure [36] be-tween the two streams. Moreover, the super-resolution resultsof source domain are also fed to the segmentation stream.Meanwhile, the segmentation stream also ensures to recovertextures faithful to semantic classes during super-resolutionstream.

The proposed SRS model is trained through following loss:

LSRS = αLseg + β (LidT + LidS) (1)

Lseg = Lcel (S (IS) , ↑ AS) +Lcel (S (↓ R (IS)) , ↑ AS) (2)

LidT = Lmse (R (↓ IT ) , IT ) (3)

LidS = Lper (R (IS) , ↑ IS) + 0.5× Lfp (4)

Lfp = LL1 (E (↓ R (IS)) ,E (IS)) (5)

where Lcel is the 2D Cross Entropy Loss, the standard super-vised pixel-wise classification objective function; Lmse is thepixel-wise MSE loss, the most widely used optimization targetfor image super resolution; Lper is the perceptual loss [30];LL1 is the L1 norm loss. The ↑ and ↓ denote upsampling anddownsampling operations, respectively. S, R, and E denoteSegmentation model, super-Resolution model and the sharedfeature Extractor, respectively. Note that, in order to easilysuperimpose the style of target domain, we use the Lper andLfp (fixpoint loss [37]) to train the super-resolution model ofsource domain images. α, β denote the weighting factors forsemantic segmentation and super-resolution.

B. Adversarial Domain Adaptation

Although the proposed asymmetric multi-task model elim-inates the resolution difference between source and targetdomains, other domain gaps (such as texture, color and soon) are still not alleviated. Due to the influence of sensors,geographic locations, imaging conditions, and other factors,these differences are inherent in remote sensing imagery. Fortraditional supervised deep learning, the model only learnsto the discriminative features based on the given annotatedremote sensing data. Therefore, there is a problem that howto learn the domain-invariant features for remote sensingimagery.

Adversarial learning provides a good framework to dealwith the problem, which consists of a generator networkand a discriminator network. The main idea is to train thediscriminator for predicting the domain of the data, whilethe segmentation network tries to fool it and also implementsthe segmentation task on source. Through alternately trainingthe two networks, the feature domain gap can be graduallyreduced.


In this paper, we design the pixel-level and output-spaceDomain Classifiers (PDC and ODC) as discriminators, andthe SRS treated as the generator. By the adversarial training,SRS will learn the domain-invariant features that fool the PDCand ODC.

C. Pixel-Level Adaptation

Since the proposed SRS mode only eliminates the resolutiondifference between source and target domains and does notreduce the gap in appearance, PDC is built to distinguish thedomain category for each pixel. It receives the high-resolutionimages of source or target domain. To be specific, we apply thePatchGAN [38] as PDC. The bottom-left sub figure in Fig. 2shows the network architecture of PDC.

The PDC loss is computed as follows:

IRS = R (IS) ∈ RH×W×3 (6)

Ifake = Dpdc

(IRS)∈ RH×W×1 (7)

Itrue = Dpdc (IT ) ∈ RH×W×1 (8)

LPDC = EIfake∼pdata (Ifake)

[(Ifake − 1)2

]+ EItrue∼pdata (Itrue)

[(Itrue)

2] (9)

where Dpdc is the PDC model, H and W denote the heightand width of the high-resolution target domain image.

At the same time, the inverse of PDC loss is defined as:

LPDCinv = EItrue∼pdata (Itrue)

[(Itrue − 1)2

]+ EIfake∼pdata (Ifake)

[(Ifake)

2] (10)

Finally, the adversarial objective functions are written asfollows:

minθSRS

LSRS + LPDC (11)

minθPDC

LPDCinv (12)

where θSRS and θPDC denote the network parameters ofSRS and PDC, respectively. During the training phase, theparameters of the two models are updated in turns usingEq.(11) and Eq.(12).

D. Output-Space Adaptation

Different from image classification task that based on globalfeatures, the generated high-dimensional features for seman-tic segmentation encode complex representations. Therefore,adaptation only in the pixel space may not enough for seman-tic segmentation. On the other hand, although segmentationoutputs are in the low-dimensional space, they contain rich in-formation, e.g., scene layout and context. Moreover, in remotesensing scenes, no matter images are from the source or targetdomain, their segmentations should share strong similarities,spatially and locally. For example, the rectangular road regionmay cover the part of cars, pedestrians; and the green plantsoften grow around the buildings. Thus, we adapt the low-dimensional softmax outputs of segmentation predictions viaan adversarial learning scheme.

To be specific, we design ODC to distinguish domainsource for each pixel label distribution, which receives thesegmentation softmax output: P = S(I) ∈ RH×W×C , whereC is the number of categories. We forward P to ODC using across-entropy loss LODC for the two classes (i.e., source andtarget). The ODC loss can be written as:

PS = S(IS) ∈ RH×W×C (13)

PT = S(↓ IT ) ∈ RH×W×C (14)

Ptrue = Dodc (PS) ∈ RH×W×1 (15)

Pfake = Dodc (PT ) ∈ RH×W×1 (16)

LODC = −∑h,w

(1− z) log (Pfake) + z log (Ptrue) (17)

where Dodc denotes the ODC model.The inverse of ODC loss is defined as:

LODCinv = −∑h,w

(1− z) log (Ptrue) + z log (Pfake) (18)

In the end, the adversarial objective functions are computedas follows:

minθSRS

LSRS + LODC (19)

minθODC

LODCinv (20)

where θSRS and θODC denote the parameters of SRS andODC networks, respectively. During the training stage, the pa-rameters of two networks are updated in turns by minimizingEq.(19) and Eq.(20).

E. Final objective function

In order to make the network have a good initializationparameters, we first use the following loss function for pre-training:

minθR

β (LidT + LidS) + LPDC (21)

minθPDC

LPDCinv (22)

where β denotes a weighting factor for super-resolution, θRis the parameters of R network. During training stage, the Rand PDC networks are optimized in turns using Eq.(21) andEq.(22).

For the full models (including SRS, PDC and ODC) train-ing, our full objective functions can be formulated as:

minθSRS

LSRS + LPDC + LODC (23)

minθDLPDCinv + LODCinv (24)

where θD denotes the network parameters of PDC and ODC.During training phase, the parameters of SRS, PDC and ODCare optimized in turns by minimizing Eq.(23) and Eq.(24).Algorithm 1 illustrates training procedure of our proposedSRDA-Net.


Algorithm 1 the proposed SRDA-Net.Input:

Source Domain low-resolution image IS , Target Domainhigh-resolution image IT , Source Domain low-resolutionlabel AS , The weighting factors for semantic segmentationand super-resolution: α, β = 10.

Output:High-resolution source domain image with style of targetdomain: IRSPredict label of target domain: AT

1: Repeat2: % Super-Resolution images by the R model

IRS , IRT = R (IS , ↓ IT ) ∈ RH×W×3

3: % Segmentation softmax outputs from the S modelPS , PT = S (IS , ↓ IT ) ∈ RH×W×C

4: % Predict label of target domain imageAT = max (PT ) ∈ RH×W×1

5: % Distinguish the pixels of super-resolution imagesIfake, Itrue = Dpdc

(IRS , IT

)∈ RH×W×1

6: % Distinguish the pixel distributions of softmax outputsPtrue, Pfake = Dodc (PS , PT ) ∈ RH×W×1

7: % Adversarial trainingR and S can be optimized according to equation (23).Dpdc and Dodc are updated by minimizing the inverseloss (24).

8: Until convergence

IV. EXPERIMENTSIn this section, we first report the experimental details. Then,

extensive experimental results are presented to demonstrate theeffectiveness of our proposed methods.

A. Datasets

1) Mass-Inria: We use the following two UDA datasets forsingle-category semantic segmentation.• Massachusetts Buildings Dataset [39] consists of 151

aerial images at 1m spatial resolution of the Boston area.The ground truth provides two semantic classes: buildingand nonbuilding. The dataset is divided into a training setof 137 images, a test set of 10 images and a validation setof 4 images. We consider the training set of this datasetas the source domain.

• Inria Aerial Image Labeling Dataset [25] is comprisedof 360 ortho-rectified aerial RGB images at 0.3m spa-tial resolution. Ground-truth also provides two semanticclasses. We split the training set (image 1 to 5 of eachlocation for validation, 6 to 36 for training). We considerthe training set of this dataset as target domain. We finallyvalidate the results of the algorithm on the validation setof this dataset.

2) Vaih-Pots: We use the following two UDA datasets formulti-category semantic segmentation.• ISPRS Vaihingen 2D Semantic Labeling Challenge

contains 33 images at 9 cm spatial resolution, taken overthe city of Vaihingen (Germany). There are 6 labeledcategories: impervious surface, building, low vegetation,

tree, car, Clutter/background. We conside this dataset assource domain.

• ISPRS Potsdam 2D Semantic Labeling Challengedataset is comprised of 38 ortho-rectified aerial IRRGBimages (6000 × 6000 px) at 5cm spatial resolution,taken over the city of Potsdam (Germany). The groundtruth is provided for 24 tiles alike Vaihingen dataset. Werandomly choose 12 images as train set, and other 12images as test set.

Note that the resolution gap of Mass-Inria (around 3.333times) is greater than Vaih-Pots (around 2.0 times).

B. Evaluation and Experimental Setup

1) Evaluation: In the semantic segmentation field, theintersection-over-union (IoU) is adopted as the main evaluationmetric. Its defined by:

IoU (Pm, Pgt) =

∣∣∣∣Pm ∩ PgtPm ∪ Pgt

∣∣∣∣ (25)

where Pm is the prediction and Pgt is the ground truth. MeanIoU (mIoU) is used to evaluate model performance on allclasses.

2) Experimental Setup: Network architectures: In SRS, wechoose the Residual ASPP Module [34] to capture contextualinformation, as the shared feature Extractor. For the super-resolution stream, due to GPU memory limitation, we only usea few deconvolutions to recover the high-resolution images. Asfor two discriminators, we apply the PatchGAN [38] classifieras the PDC Network, and for ODC network we choose issimilar to [11] which consists of 5 convolution layers withkernel 4 4 and stride of 2, where the channel number is 64,128, 256, 512, 1, respectively.

Training Details: During training, adam optimization isapplied with a momentum of 0.9. For the Mass to Inriaexperiments, we fix α to 2.5, β to 10. Due to differentresolution, the Mass images and labels are cropped to 114 ×114 pixels and then the labels are interpolated to 380 × 380pixels. The Inria images are cropped to 380 × 380 pixels andresized to 114 × 114 pixels. During testing, images of Inriaare cropped to 625 × 625 patches without overlap and resizedto 188 × 188. In the Vaih to Pots experiments, α = 5, β =10; the low-resolution image is cropped to 160 × 160 pixelsand the high-resolution is cropped to 320 × 320 pixels duringtraining stage. During testing, images of Pots are cropped to500 × 500 pixels without overlap and resized to 250 × 250pixels. In the actual training process, we first pre-train themodel with learning rate 2 ×10−4. Then the framework istrained with the learning rate 1.5 ×10−4.Our stepwise experiments.• SRS: SRS is directly trained without domain adaptation

from the source domain to the target domain.• SRS + PDC: Based on SRS model, PDC is added to the

training process by the adversarial learning.• SRS + ODC: Based on SRS model, ODC is added to

the training process by the adversarial learning.• SRDA-Net (SRS + PDC + ODC): the proposed SRDA-

Net model.


TABLE IDOMAIN ADAPTATION FROM MASS TO INRIA VAL DATASET: THE COMPARISON RESULTS OF THE STATE OF ART METHODS AND OURS

Methods % BaseNet Source domain Target domain IoU

NoAdapt [11] Resnet-101 [40] Mass Inria 32.9AdaptSegNet [11] Resnet-101 [40] Mass ↓ Inria 35.0AdaptSegNet [11] Resnet-101 [40] ↑ Mass Inria 48.5

NoAdapt [8] Resnet-101 [40] Mass Inria 32.9CycleGan-FCAN [8] Resnet-101 [40] Mass ↓ Inria 41.8CycleGan-FCAN [8] Resnet-101 [40] ↑ Mass Inria 49.7

SRS (NoAdapt) ResidualASPP [34] Mass Inria 36.7SRS + PDC ResidualASPP [34] Mass Inria 46.0SRS + ODC ResidualASPP [34] Mass Inria 39.4Full (SRDA-Net) ResidualASPP [34] Mass Inria 52.8

(a) Input image (b) Ground Truth (c) SRS (d) SRS + PDC (e) SRS + ODC (f) SRDA-Net (g) AdSgNt(48.5) (h) CFCAN(49.7)

Fig. 3. Examples results of the Inria val dataset (Source domain: Massachusetts Buildings)

Other comparison experiments.• AdaptSegNet [11]: This work employs the adversarial

feature learning in output-space of the base segmentationmodel. Instead of having only one discriminator overthe feature layer, Tsai et al. propose to install anotherdiscriminator on one of the intermediate layers as well.

• CycleGan-FCAN: FCAN [8] is a two-stage method.AAN first adapts source-domain images to appear as ifdrawn from the style in the target domain, then RANattempts to learn domain-invariant representations. Tobetter adapt the source images to appear as if drawnfrom the target domain, we replace AAN in FCAN withCycleGan [41].

In the experiments, their no adaptation and final results arereported for comparison with our stepwise experiments.

C. Mass → Inria

Table I summarizes the results of some methods for the shiftfrom Mass to Inria, including AdaptSegNet [11], CycleGan-

FCAN [8] and our stepwise experiments: SRS, SRS + PDC,SRS + ODC, SRDA-Net. The bold values denote the bestscores in the column.

From the results, we can see that our proposed method(SRDA-Net) achieved the best result: IoU of 52.8%. Basedon the same training strategy, result (52.8%) of SRDA-Netoutperforms that of AdaptSegNet (best result of 48.5%) andCycleGan-FCAN (best result of 49.7%). As for the resultsof the three methods with no adaptation, SRS improves theIoU (from 32.9% to 36.7%, increasing by 3.8%) significantly,which shows the effectiveness of combining image super-resolution in segmentation network.

Further, the adaptation results of the three methods out-perform their corresponding results of NoAdapt. Moreover,in order to explore the effect of resolution difference on thedomain adaptation results, we construct experiments on twotraining data settings (source domain: Mass, target domain:Downsampled Inria; source domain: Upsampled Mass, targetdomain: Inria) for each comparison method. From the results,


TABLE IIDOMAIN ADAPTATION FROM VAIH TO POTS VAL DATASET: THE COMPARISON RESULTS OF THE STATE OF ART METHODS AND OURS

Methods % Bas

eNet

Sour

ce

Targ

et

Impe

rvio

us

Bui

ldin

g

vege

tatio

n

Tree

Car

Clu

tter

mIo

U

NoAdapt [11] Resnet-101 [40] Vaih Pots 51.8 45.5 46.2 11.8 35.3 18.5 34.9AdaptSegNet [11] Resnet-101 [40] Vaih ↓ Pots 59.4 54.2 47.0 26.3 52.2 32.2 45.2AdaptSegNet [11] Resnet-101 [40] ↑ Vaih Pots 55.1 55.6 43.0 31.5 60.6 1.6 41.2

NoAdapt [8] Resnet-101 [40] Vaih Pots 51.8 45.5 46.2 11.8 35.3 18.5 34.9CycleGan-FCAN [8] Resnet-101 [40] Vaih ↓ Pots 50.1 42.5 33.1 31.6 44.1 22.6 37.3CycleGan-FCAN [8] Resnet-101 [40] ↑ Vaih Pots 47.9 51.2 43.0 41.7 61.1 23.8 44.8

SRS (NoAdapt) ResidualASPP [34] Vaih Pots 26.5 32.0 35.2 17.3 32.0 17.5 26.7SRS + PDC ResidualASPP [34] Vaih Pots 58.3 51.1 51.8 27.9 62.5 20.5 45.4SRS + ODC ResidualASPP [34] Vaih Pots 51.2 21.7 17.9 12.3 54.2 13.0 28.4Full (SRDA-Net) ResidualASPP [34] Vaih Pots 60.2 61.0 51.8 36.8 63.4 18.3 48.6

(i) Input image (j) Ground Truth (k) SRS (l) SRS + PDC (m) SRS + ODC (n) SRDA-Net (o) AdSgNt(45.2) (p) CFCAN(44.8)

Fig. 4. Examples results of the Pots val dataset (Source domain: ISPRS Vaihingen 2D Semantic dataset)

we find that the setting (source domain: Upsampled Mass,target domain: Inria) achieves more better result, which con-firms that when the resolution difference between the sourceand target domains is larger, the gain obtained by eliminatingthe resolution difference is greater than the error introducedby interpolation. In comparison, our method does not need toconsider this, and obtained better results.

In order to explore the semantic segmentation performancefurther, Fig. 3 shows the visualization results of our step-by-step and the AdaptSegNet/CycleGan-FCAN methods. Theimages in the first column are selected from the Inria valdataset. The second column shows the ground truth, and theremaining columns illustrate the predicted results of SRS,SRS + PDC, SRS + ODC, SRDA-Net, AdaptSegNet (48.5%),CycleGan-FCAN (49.7%). On the whole, after adding the PDC

or ODC, some segmentation mistakes are removed effectively.According to the results of SRS + PDC and SRS + ODC,PDC plays a more important role in learning domain-invariantfeatures than ODC (improvement of 9.3% versus 2.7%). Whenthe domain gap is reduced by integrating PDC and ODC toSRS, a better segmentation result can be obtained. Moreover,we can observe that the visualization segmentation results ofSRDA-Net outperform results of best AdaptSegNet/CycleGan-FCAN.

D. Vaih → PotsThe results of AdaptSegNet [11], CycleGan-FCAN [8] and

our stepwise experiments are listed in Table II, which areadapted from Vaih to Pots. The bold fonts represent the bestscores of the corresponding columns. We observe that the


Fig. 5. Qualitative results of super-resolution with style transfer from CycleGan and SRDA-Net (upsampled source domain images → target domain images)

proposed method obtains the best performance (48.6%). Com-pared with the previous best method AdaptSegNet (45.2%),the result of SRDA-Net contributes 3.4% relative mIoU im-provement. For the three no adaptation methods, mIoU ofour method is slightly lower. This is because that parametersof SRDA-Net are less than half of resnet101, which limitsthe learning power of network. According to the mIoUs ofSRS+PDC and SRS + ODC, PDC (18.7%) is more effectiveat learning domain-invariant features than ODC (1.7%).

From the results of comparison experiments, we find thatwhen the gap of resolution between source and target domainsis relatively small, it is difficult to determine whether toupsample the source domain or downsample the target domainto obtain better results. However, there is no need to ourproposed method (SRDA-Net).

For reporting the effect of our algorithm, Fig. 4 gives thethree typical example labeling results. From the visual results,we can find that the segmentation results are getting more andmore refined in our step-by-step experiments. And our SRDA-Net obtains the finer segmentation results.

E. SRDA-Net vs. CycleGan

Fig. 5 shows the qualitative super-resolution results ofsource domain images with style transfer from CycleGan and

SRDA-Net. Despite of superimposing the style of target do-main, CycleGan generates monotonous and unnatural textures,like buildings in Fig. 5. Moreover, we find that some objectsin the results of CycleGan get distorted, like cars in Fig.5. The reason is that upsampled source domain images areblurry, which drops some information. SRDA-Net employssemantic category priors to help capture the characteristics ofeach category, leading to more natural and realistic textures.And by adversarial training, the super-resolution images ofSRDA-Net simultaneously achieve style conversion.

V. CONCLUSION

In this paper, we propose a novel UDA framework namedSRDA-Net to explicitly address the adaptive research in thefield of semantic segmentation with different resolution. To bespecific, a multi-task model for super-resolution and semanticsegmentation is built, named SRS, which eliminating thedifference of resolution between source and target domains.In addition, the pixel-level and output-space domain classifiersare designed to guide the SRS model to learn domain-invariantfeatures by the adversarial learning, which can reduce thedomain gap effectively. In order to prove the effectivenessof our framework, we construct two datasets which havedifferent resolutions in their source and target domains: Mass-Inria and Vaih-Pots. Extensive experiments demonstrate the


effectiveness of SRDA-Net when domain adaptation involvingthe resolution difference.

REFERENCES

[1] C. Zheng, Y. Zhang, and L. Wang, “Semantic segmentation of remotesensing imagery using an object-based markov random field model withauxiliary label fields,” IEEE Transactions on Geoscience and RemoteSensing, vol. 55, no. 5, pp. 3015–3028, 2017.

[2] B. Pan, Z. Shi, X. Xu, T. Shi, N. Zhang, and X. Zhu, “Coinnet: Copyinitialization network for multispectral imagery semantic segmentation,”IEEE Geoscience and Remote Sensing Letters, vol. 16, no. 5, pp. 816–820, 2019.

[3] L. Mou, Y. Hua, and X. X. Zhu, “Relation matters: Relational context-aware fully convolutional network for semantic segmentation of high-resolution aerial images,” IEEE Transactions on Geoscience and RemoteSensing, pp. 1–13, 2020.

[4] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2015, pp. 3431–3440.

[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,“Deeplab: Semantic image segmentation with deep convolutional nets,atrous convolution, and fully connected crfs,” IEEE transactions onpattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848,2017.

[6] Q. Wang, J. Gao, and X. Li, “Weakly supervised adversarial domainadaptation for semantic segmentation in urban scenes,” IEEE Transac-tions on Image Processing, vol. 28, no. 9, pp. 4376–4386, 2019.

[7] B. Pan, X. Xu, Z. Shi, N. Zhang, H. Luo, and X. Lan, “Dssnet: Asimple dilated semantic segmentation network for hyperspectral imageryclassification,” IEEE Geoscience and Remote Sensing Letters, pp. 1–5,2020.

[8] Y. Zhang, Z. Qiu, T. Yao, D. Liu, and T. Mei, “Fully convolutionaladaptation networks for semantic segmentation,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2018,pp. 6810–6818.

[9] Z. Wu, X. Wang, J. E. Gonzalez, T. Goldstein, and L. S. Davis,“Ace: Adapting to changing environments for semantic segmentation,” inProceedings of the IEEE International Conference on Computer Vision,2019, pp. 2121–2130.

[10] Y. Li, L. Yuan, and N. Vasconcelos, “Bidirectional learning for domainadaptation of semantic segmentation,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2019, pp.6936–6945.

[11] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, andM. Chandraker, “Learning to adapt structured output space for semanticsegmentation,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2018, pp. 7472–7481.

[12] T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. Perez, “Advent:Adversarial entropy minimization for domain adaptation in semanticsegmentation,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2019, pp. 2517–2526.

[13] Z. Pan, W. Ma, J. Guo, and B. Lei, “Super-resolution of single remotesensing image based on residual dense backprojection networks,” IEEETransactions on Geoscience and Remote Sensing, vol. 57, no. 10, pp.7918–7933, 2019.

[14] G. Liu, Y. Gousseau, and F. Tupin, “A contrario comparison of localdescriptors for change detection in very high spatial resolution satelliteimages of urban areas,” IEEE Transactions on Geoscience and RemoteSensing, vol. 57, no. 6, pp. 3904–3918, 2019.

[15] W. Liu, F. Su, and X. Huang, “Unsupervised adversarial domain adapta-tion network for semantic segmentation,” IEEE Geoscience and RemoteSensing Letters, pp. 1–5, 2019.

[16] O. Tasar, S. L. Happy, Y. Tarabalka, and P. Alliez, “Colormapgan:Unsupervised domain adaptation for semantic segmentation using colormapping generative adversarial networks,” IEEE Transactions on Geo-science and Remote Sensing, pp. 1–16, 2020.

[17] Z. Zhang, K. Doi, A. Iwasaki, and G. Xu, “Unsupervised domainadaptation of high-resolution aerial images via correlation alignmentand self training,” IEEE Geoscience and Remote Sensing Letters, pp.1–5, 2020.

[18] W. Liu and R. Qin, “A multikernel domain adaptation method forunsupervised transfer learning on cross-source and cross-region remotesensing data classification,” IEEE Transactions on Geoscience andRemote Sensing, pp. 1–11, 2020.

[19] L. Yan, B. Fan, H. Liu, C. Huo, S. Xiang, and C. Pan, “Triplet adversarialdomain adaptation for pixel-level classification of vhr remote sensingimages,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58,no. 5, pp. 3558–3573, 2020.

[20] J. Zhang, J. Liu, B. Pan, and Z. Shi, “Domain adaptation based oncorrelation subspace dynamic distribution alignment for remote sensingimage scene classification,” IEEE Transactions on Geoscience andRemote Sensing, pp. 1–11, 2020.

[21] X. Wang, K. Yu, C. Dong, and C. Change Loy, “Recovering realistictexture in image super-resolution by deep spatial feature transform,” inProceedings of the IEEE conference on computer vision and patternrecognition, 2018, pp. 606–615.

[22] M. S. Rad, B. Bozorgtabar, U.-V. Marti, M. Basler, H. K. Ekenel, andJ.-P. Thiran, “Srobb: Targeted perceptual loss for single image super-resolution,” in Proceedings of the IEEE International Conference onComputer Vision, 2019, pp. 2710–2719.

[23] S. Lei, Z. Shi, X. Wu, B. Pan, X. Xu, and H. Hao, “Simultaneous super-resolution and segmentation for remote sensing images,” in IGARSS2019-2019 IEEE International Geoscience and Remote Sensing Sympo-sium. IEEE, 2019, pp. 3121–3124.

[24] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” inAdvances in neural information processing systems, 2014, pp. 2672–2680.

[25] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Can semanticlabeling methods generalize to any city? the inria aerial image labelingbenchmark,” in IEEE International Geoscience and Remote SensingSymposium (IGARSS). IEEE, 2017.

[26] L. Ding, J. Zhang, and L. Bruzzone, “Semantic segmentation of large-size vhr remote sensing images using a two-stage multiscale trainingarchitecture,” IEEE Transactions on Geoscience and Remote Sensing,pp. 1–10, 2020.

[27] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, “Learning low-level vision,” International journal of computer vision, vol. 40, no. 1,pp. 25–47, 2000.

[28] J. M. Haut, R. Fernandez-Beltran, M. E. Paoletti, J. Plaza, A. Plaza, andF. Pla, “A new deep generative network for unsupervised remote sensingsingle-image super-resolution,” IEEE Transactions on Geoscience andRemote Sensing, vol. 56, no. 11, pp. 6792–6810, 2018.

[29] P. V. Arun, K. M. Buddhiraju, A. Porwal, and J. Chanussot, “Cnn-based super-resolution of hyperspectral images,” IEEE Transactions onGeoscience and Remote Sensing, pp. 1–16, 2020.

[30] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-timestyle transfer and super-resolution,” in European conference on computervision. Springer, 2016, pp. 694–711.

[31] S. Lei, Z. Shi, and Z. Zou, “Coupled adversarial training for remotesensing image super-resolution,” IEEE Transactions on Geoscience andRemote Sensing, vol. 58, no. 5, pp. 3633–3643, 2020.

[32] Z. Zou, T. Shi, W. Li, Z. Zhang, and Z. Shi, “Do game data generalizewell for remote sensing image segmentation?” Remote Sensing, vol. 12,no. 2, p. 275, 2020.

[33] Y. Zhang, P. David, H. Foroosh, and B. Gong, “A curriculum domainadaptation approach to the semantic segmentation of urban scenes,”IEEE transactions on pattern analysis and machine intelligence, 2019.

[34] L. Wang, Y. Wang, Z. Liang, Z. Lin, J. Yang, W. An, and Y. Guo, “Learn-ing parallax attention for stereo image super-resolution,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,2019, pp. 12 250–12 259.

[35] W. Shi, J. Caballero, F. Huszr, J. Totz, A. P. Aitken, R. Bishop,D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network.”

[36] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie,“Feature pyramid networks for object detection,” in Proceedings of theIEEE conference on computer vision and pattern recognition, 2017, pp.2117–2125.

[37] D. Kotovenko, A. Sanakoyeu, P. Ma, S. Lang, and B. Ommer, “A contenttransformation block for image style transfer,” in The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), June 2019.

[38] C. Li and M. Wand, “Precomputed real-time texture synthesis withmarkovian generative adversarial networks,” in European conference oncomputer vision. Springer, 2016, pp. 702–716.

[39] V. Mnih, “Machine learning for aerial image labeling,” Ph.D. disserta-tion, University of Toronto, 2013.

[40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 770–778.


[41] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” in Proceedingsof the IEEE international conference on computer vision, 2017, pp.2223–2232.

Date post:	11-Jan-2022
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

IEEE TRANSACTIONS ON IMAGE PROCESSING 1 SRDA-Net: Super ...

Documents