IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING …

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Looking Closer at the Scene: Multi-ScaleRepresentation Learning for Remote Sensing Image

Scene ClassificationQi Wang, Senior Member, IEEE, Wei Huang, Student Member, IEEE, Zhitong Xiong, and Xuelong

Li, Fellow, IEEE

Abstract—Remote sensing image scene classification has at-tracted great attention because of its wide applications. Althoughconvolutional neural network (CNN) based methods for sceneclassification have achieved excellent results, the large scalevariation of the features and objects in remote sensing imageslimits the further improvement of the classification performance.To address this issue, we present multi-scale representation forscene classification, which is realized by a global-local two-stream architecture. This architecture has two branches of globalstream and local stream, which can individually extract theglobal features and local features from the whole image andthe most important area. In order to locate the most importantarea in the whole image using only image-level labels, a weakly-supervised key area detection strategy of structured key arealocalization (SKAL) is specially designed to connect the above twostreams. To verify the effectiveness of the proposed SKAL basedtwo-stream architecture, we conduct comparative experimentsbased on three widely used CNN models, including AlexNet,GoogleNet and ResNet18, on four public remote sensing imagescene classification data sets, and achieve the state-of-the-artresults on all the four data sets. Our codes will be providedin https://github.com/hw2hwei/SKAL.

Index Terms—remote sensing, scene classification, CNN, multi-scale representation, structured key area localization

I. INTRODUCTION

BENEFITING from the remote sensing imaging equip-ments and technologies, in recent years, many semantic-

level tasks of remote sensing images have developed rapidly,such as object detection [1], image retrieval [2], image cap-tioning [3] [4], road extraction [5] and the others. As a basisfor these tasks, remote sensing image scene classification [6]–[10] has become a research hotspot, which classifies remotesensing images into a set of scene classes according to thefeatures and objects in the images.

There are plenty of similar and confusing features andobjects in remote sensing images, and therefore it is cruicalto extract discriminative features of remote sensing scences.According to feature extraction, there are two kinds of super-vised features: handcrafted features and deep features. Com-pared with handcrafted features, deep features contain more

Q. Wang, W. Huang, Z. Xiong and X. Li are with the School of ComputerScience, and with the Center for Optical Imagery Analysis and Learning(OPTIMAL), Northwestern Polytechnical University, Xi’an 710072, Shaanxi,China (e-mail: [email protected], [email protected], [email protected], xuelong [email protected]).

X. Li is the corresponding author.This work was supported by the National Natural Science Foundation of

China under Grant U1864204, 61773316, U1801262, and 61871470.

high-level semantic information, which can be automaticallylearned by convolutional neural networks (CNNs). Due to thepowerful ability of feature extraction, CNN-based methods[6], [8], [9], [11]–[14] have become the mainstream methodsand achieved the state-of-the-art results in the field of remotesensing image scene classification.

Although the performance of CNN-based methods hasimproved significantly, scene classification of remote sensingimages still suffers from the large scale variation of featuresand objects in the images. As shown in the images of Fig.1, the most important areas occupy only a small part of thewhole images, and they are usually surrounded by a largenumber of useless features and objects, which decreases thediscrimination of the extracted features.

(a) airplane (b) lake (c) island (d) intersection

Fig. 1: Some samples of remote sensing scene images withthe bounding boxes labeling the key area.

To tackle this issue, we present joint global and local featurerepresentation for remote sensing image scene classification.As shown in Fig. 2, we design a global-local two-streamarchitecture. In this architecture, the blue branch network is theglobal feature extraction stream and the green branch networkis the local feature extraction stream. The global area containsmore global features such as contour and texture information,while the key local area enlarges the most important objectswhich can provide more fine-grained features and reducebackground noise. Our global-local two-stream architecture


can individually extract global features and local features fromthe input images of different scales, and finally fuse theirclassification scores.

In order to locate the most important local area in thewhole image, we further propose a weakly-supervised key areadetection strategy of structured key area localization (SKAL)as the yellow route in Fig. 2. The proposed SKAL defines anexplicit local key area localization process based on the featureresponse degree of the patches in global feature maps. It canaccurately locate the most important area, i.e. its boundary of[x1, x2, y1, y2], using only image-level labels.

To verify the effectiveness of the proposed SKAL basedglobal-local two-stream architecture in remote sensing imagescene classification, the abundant comparative experimentsbased on three widely used CNN models (Alexnet [15],ResNet18 [16] and GoogleNet [17]) are conducted on fourpublic large-scale remote sensing scene data sets includingUC Merced data set [18], RSSCN7 data set [19], AID dataset [20], and NWPU-RESISC45 data set [21]. We achieve thestate-of-the-arts on all the four data sets.

The contributions of this paper can be summarized as thefollowing four aspects:(1) To deal with the problem of large scale variation in

remote sensing images, we present joint global and localfeature representation for remote sensing image sceneclassification from the perspective of multi-scale feature.Correspondingly, a global-local two-stream architecture isdesigned to individually extract global features and localfeatures from the input images of different scales.

(2) To locate the most important area in the whole remotesensing scene image, a strategy of structured key arealocalization (SKAL) is specially proposed to connect theglobal and local streams. SKAL can accurately calculatethe most important local area in the form of boundingbox [x1, x2, y1, y2].

(3) In order to prove the effectiveness of our SKAL basedglobal-local two-stream architecture in remote sensingimages, plenty of comparative experiments based onseveral widely used CNN models are conducted on fourpublic scene classification data sets of remote sensingimages. The state-of-the-art results demonstrate that ourmethod can significantly improve the performance ofremote sensing scene classification.

(4) As a multi-scale representation learning method, ourglobal-local two-stream architecture can easily be appliedin all kinds of CNN models, as with the advantages ofsimple implementation, fast operation and strong inter-pretability.

II. RELATED WORK

In this section, the related works of remote sensing imagescene classification and weakly-supervised key object detec-tion methods are reviewed in brief.

A. Remote Sensing Image Scene ClassificationAccording to the feature extraction, the methods used in

scene classification of remote sensing images can be roughlysplit into the following three types.

1) Handcrafted Features: The handcrafted feature basedmethods were first applied in remote sensing image scene clas-sification. These methods rely on a series of manually designedfeature descriptors including global feature descriptors (suchas color histograms and texture descriptors [22], [23]), andlocal features descriptors (such as Histogram of Oriented Gra-dients (HOG) [24], [25] and Scale-Invariant Feature Transform(SIFT) [26], [27]). Global feature descriptors can generate theentire representation of a remote sensing image, which canbe directly sent into the classifier. Local feature descriptors,which are usually the mid-level feature descriptors, need tobe integrated into an global representation by feature com-bination technologies like bag-of-visual-words (BoVW) [28].Further, Zhu et al. [29] propose a local-global feature fusionoperation at the histogram level. These handcrafted featuresare well-designed, however, they cannot effectively deal withthe challenges of intraclass diversity, interclass similarity andscale variation.

2) Unsupervised Features: Classification is intrinsically asupervised task but researchers also find ways to interpretunsupervised learning results as classes. Researchers haveattempted unsupervised feature learning based methods [30]–[34], which aim at learning the feature encoding functions.For these unsupervised feature learning based methods, theytake the handcrafted feature descriptors like SIFT as input, andgenerate the fused features. It is crucial to combine multiplefeatures by using some encoding techniques. The typeicalencoding methods used in remote sensing image scene classifi-cation include Principal Component Analysis (PCA), k-meansclustering, sparse coding [32] and autoencoder [35]. What’smore, BoVW [25], which can generate the visual dictionaries(codebooks) from the handcrafted features based on k-meansclustering, is also one of the most popular unsupervised featurelearning based methods. However, overall the unsupervisedmethods cannot generate the same discriminate features ofdifferent scene classes as the supervised features because ofthe lack of labels.

3) Deep Features: In recent years, deep learning methods,especially Convolutional Neural Networks (CNNs) [6], [8],[9], [11]–[14], [36]–[38], have dominated the most fields ofnatural images because of their powerful large-scale featureextraction. Similarly, deep feature based methods methodshave become the mainstream of remote sensing image sceneclassification with the better classification performance thanthe handcrafted and unsupervised features. Compared withhandcrafted feature-based methods that generally are deter-mined by feature engineering skills and domain acknowledge,deep feature based methods can automatically learn the mostdiscriminative semantic-level features from the raw images.Besides, the CNN models are the end-to-end trainable ar-chitectures instead of the complex multi-stage architectures,which are the main workflows of handcrafted feature basedmethods. Although deep feature based methods have obtainedthe excellent performance, the large scale variation is still oneof the most difficult problems to solve.


sam

pli

ng

Global Image ConvBlocks g

SKAL

Feature Map g

Class Score g

Class Score lConvBlocks l Local Image

Feature Map l

[x1, y1, x2, y2]

Feature Vector g

Feature Vector l

Class Score Fusion

Global Stream

Local Stream

Fig. 2: SKAL based global-local two-stream architecture is designed for joint global and local feature representation in remotesensing scene image classification.

B. Weakly-Supervised Object Detection

Weakly-supervised object detection, which locates the mostimportant local area using only image-level labels, is ameaningful research subject. It can not only increase theinterpretability of our scene classification models, but alsobe used for further improving their performance. Pandey et.al [39] achieve weakly-supervised object detection based onthe handcrafted feature of deformable part models (DPM).To make full use of the advantages of convolutional fea-tures, Bilen et. al [40] propose weakly-supervised deep de-tection networks to locate the key objects. And Cinbis et.al [41] introduce the multiple instance learning into weakly-supervised objection detection. Besides, Zhang et. al [42]bring the saliency detection into this field. Further, Fu et.al [43] realize a learnable key object localization networkof Recurrent Attention Convolutional Neural Network (RA-CNN) for fine-grained image recognition and get significantgains. The clustering learning technology is also attemptedin [44]. Recently, Yang et. al [45] proposed spatial prior forthe object dependence for joint object detection and actionclassification.

For remote sensing images, however, there are lots of large-scale features and objects containing background noise. Todeal with these complicated scene images, motivated by theidea of multi-scale feature representation in RA-CNN, wedesign a key area localization strategy of SKAL to generatethe minimum area boundary to sample the most important areain the whole image. There are three main differences betweenthe proposed SKAL and RA-CNN: (1) The proposed SKALis a relatively interpretable key area localization srategy tosome extent while RA-CNN utilizes an attention proposal sub-

network (ATN), which also belongs to the black-box model, topredict the key area. (2) Size of the key area located by SKALcan be controlled artificially by adjusting a hyperparameter,while the RA-CNN is hard to realize it. Because the remotesensing image scene classification is the basic of other furtherremote sensing image processing tasks, the controllable sizemay be more suitable for the further image processing. (3) Itis necessary for RA-CNN to set an extra inter-scale pairwiseranking loss which is used to constrain the location processand an subtle g alternative training strategy. For our SKAL,the training of two streams is independently and is easier torealize.

There is some connection between the proposed SKAL anda series of region-proposal objection detection methods [46]–[48] under the demand of area location. The difference is thatthese object detection methods need accurate bounding boxesof the interest objects while the proposed SKAL has no needfor them.

III. METHOD

In this section, firstly, the compositions of convolutionalneural networks are sequentially introduced. Then the pro-posed SKAL strategy based on the global multi-layer featuremap is introduced in detail in Fig. 3. Finally, as shown inFig. 2, we put forward the SKAL based global-local two-stream architecture to individually extract the global and localfeatures, and fuse their classification scores.

A. CNNGenerally, an entire CNN model for image classification

can be roughly divided into three parts: stacked convolutionallayers, global average pooling layer and fully-connected layer.


GlobalFeature Map

Energy

Aggregating

Map

Structuration

Vw

Energy Map

(x1, x2)

(y1, y2)Vh

Structured Vectors

Greedy-LikeBoundary

Search

SKAL Strategy

Fig. 3: Workflow of the proposed SKAL strategy.

Stacked Convolutional Layers. In CNNs, the multiplestacked convolutional layers are the most important partsused to extract features from low-level texture and colorcharacteristics to high-level semantic information. Each con-volutional layer normally consists of convolutional kernels andnon-linear activation function (in some cases there is alsobatch normalization [49]). Because the convolutional layersof different models are stacked in different ways, it is hardto respectively describe their architectures. Thus, they aredescribed as a unified ConvBlocks to roughly represent theprocess of feature extraction in this paper. The input of thesestacked convolutional layers is a RGB image I ∈ R3×224×224

and the output is the multi-layer feature map M ∈ RC×H×W

(C,H and W are the number of channels, spatial height andspatial width respectively). It is denoted as:

M = ConvBlocks(I), (1)

Global Average Pooling Layer. Because the multi-layerfeature map M distributes in spatial in units of patches, it isnecessary to pool the feature map M into the correspondingfeature vector V ∈ RC for the next classifier (fully-connectedlayer). Thus, global pooling layer, mostly global averagepooling (GAP) layer, is used to connect the convolutionallayers and fully-connected layer, which is calculated by:

V (c) =1

HW

H∑i=0

W∑j=0

M(c, i, j), (2)

GAP can not only change the spatial dimension of features,but also decrease the overfitting of the trainable parameters.

Fully-connected Layer. Fully-connected (FC) layer playsthe role of classifier in CNNs, which gives the classificationscore of each class based on the high-dimension feature vectorV . It takes as input the V and takes as output a score vectorof classification confidence denoted as S ∈ RN , where N isthe number of classes in data set. It is calculated by:

S = WT ∗ V + b, (3)

here W ∈ RC×N and b ∈ RN are the weight and bias of thefeatures V , respectively. The element of S is the classificationscore of each class. In order to scale the sum of S to 1,the operation of softmax is added after the FC layer. It is

formulated as:

Si =eSi∑Cj=1 e

Sj

, (4)

S ∈ RN is the scaled classification score.The cascade of convolutional layers, GAP layer and FC

layer is the entire CNN models for classification task. And itis also the structure of each stream in our global-local two-stream architecture.

B. SKAL

The most critical step is to localize the key area in theglobal image, which plays a role of a bridge between theglobal and local streams. As shown in Fig 3, we proposethe SKAL strategy in detail in this subsection. Based on themulti-layer feature map of global image of Mg , which iscalculated by Eqn (1)-(3) from the global image, the proposedSKAL generates a bounding box of [x1, x2, y1, y2] to guidethe sampling process. SKAL consists of the following threesubsteps: energy aggregation, energy map structuration andgreedy-like boundary search.

1) Energy Aggregation: It is prerequisite to quantitativelydescribe the importance degree of each patch in Mg . In thispaper, the operation of energy aggregation, which takes Mg

as input and takes the energy map of ME ∈ RH×W as output,is applied as:

ME =

C∑i=0

Mg(i,H,W ), (5)

here it is notable that energy aggregation can be regarded as akind of explicit attention mechanism without the requirementfor training.

It is necessary to scale all the elements of ME into therange of [0, 1] by min-max scaling, which can remove theinterference from the negative element.

ME(i) =ME(i)−min(ME)

max(ME)−min(ME)(6)

here max(ME) and min(ME) are the the values of themaximum and minimum elements in ME(i), respectively. ME

are the scaled energy map in the same dimension with ME .For more accurate localization, it is meaningful to upsample

the energy map into a larger spatial size from H×W (normally


6× 6 or 7× 7), which is denoted as:

ME = bilinear(ME) (7)

we used bilinear interpolation as the upsampling technique.And the size of ME is set to 25× 25 in all the CNN modelsin this paper.

2) Energy Map Structuration: After finishing the abovepreparations, we obtain an scaled energy map ME that canquantitively describe the patch-wise importance degree of theglobal image. As well as we know, it is complicated toimplement the search in 2-D space. Therefore, we furtheraggregate the scaled energy map into two 1-D structuredenergy vectors, Vw ∈ RW and Vh ∈ RH , along the spatialheight and width by:

Vw =H∑i=0

ME(i,W ),

Vh =W∑i=0

ME(H, i),

(8)

Vw and Vh are the structured interpretation of ME . Energymap structuration has two advantages. On one hand, it greatlyimproves the search efficiency to translate the boundary searchin the 2-D energy map into the search in two 1-D energyvectors. On the other hand, it can realize decoupling in 2-D space by the separate search along the spatial width andheight.

3) Greedy-Like Boundary Search: Based on Vw and Vh,[x1, x2] and [y1, y2] can be calculated respectively. In orderto quickly and accurately locate the most important 1-D areain the 1-D energy vector, we present a greedy-like boundarysearch method on the basis of energy.

Taking the Vw as an example, we present some conceptscontaining the energy of different elements in the width vectoras:

E[0:W ] =W∑i=0

V (i),

E[x1:x2] =x2∑

i=x1

V (i)(9)

here E[0:W ] is the energy sum of all the elements in the widthvector, and E[x1:x2] contains the energy of the region alongthe spatial width from x1 to x2.

In this paper, the key area in the global image is defined as:the area occupies the smallest area but contains no less than athreshold of the total energy (ETr), i.e., E[x1:x2] / E[0:W ]

> ETr. ETr is a hyper-parameter of energy proportion.On the basis of this definition, we can search the most keyregion using a greedy-like algorithm, which is summarized inAlgorithm 1. The greedy-like boundary search algorithm canbe subdivided into the following three steps.

Step A: Initializing the boundary. From Line 1 to 8, firstly,[x1, x2] are initialized by the boundary of the half of Vw havingthe maximum energy.

Step B: Adjusting the boundary. Then the boundary of[x1, x2] are adjusted iteratively to make its energy converge toa small neighbor of ETr. There are two states after Step A:from Line 9 to 16, energy of the initialized area of [x1, x2]is higher than ETr; from Line 17 to 24, the energy is lower

than ETr. When the energy is higher than ETr, the region of[x1, x2] needs to shrink until the energy is no higher than ETralong the direction of the slowest energy drop. However, whenthe energy is lower than ETr, the region needs to enlarge untilit is no lower than ETr along the direction of the fastestenergy rise. Our greedy-like boundary search can find thesmallest but most informative area quickly.

Step C: Scaling the boundary. Because the above bound-aries are in the range of [0,W ], as shown from Line 26 to 27,it is necessary to scale them to [0, 1] by the dimension of theenergy vector.

The width boundary of the most key area in a globalimage, which is [x1, x2], can be obtained by the above threesteps. And the height boundary of [y1, y2] can be obtainedsimilarly by using the same algorithm. The entire boundaryof [x1, x2, y1, y2] is used to guide the key local area samplingprocess.

Algorithm 1 Greedy-Like Boundary Search.

Input: Structured width vector Vw ∈ RW ;Output: The scaled width boundary of the key area: [x1, x2];

1: x1 ← 02: x2 ←W/23: for i = 0→W/2 do4: if E[x1:x2] < E[i:i+W/2] then5: x1 ← i6: x2 ← i + W/27: end if8: end for9: if E[x1:x2]/E[0:W ] > ETr then

10: while E[x1:x2]/E[0:W ] > ETr do11: if Vw(x1 + 1) < Vw(x2 − 1) then12: x1 ← x1 + 113: else14: x2 ← x2 − 115: end if16: end while17: else18: while E[x1:x2]/E[0:W ] < ETr do19: if Vw(x1 − 1) > Vw(x2 + 1) then20: x1 ← x1 − 121: else22: x2 ← x2 + 123: end if24: end while25: end if26: x1 ← x1/W × 100%27: x2 ← x2/W × 100%28: return [x1, x2]

C. Global-Local Two-Stream ArchitectureAccording to the scaled boundary of [x1, x2, y1, y2], we can

sample the key local area Il ∈ R3×224×224 in the enlargedglobal image I

′

g ∈ R3×448×448 by bilinear interpolationtechnology, which is denoted as:

Il = bilinear(I′

g, (x1, x2, y1, y2)) (10)


The global and local classification scores of Sg and Sl arecalculated from the global image Ig ∈ R3×224×224 and thekey local area Il, which are formulated as:

Sg = CNNg(Ig) (11)

Sl = CNNl(Il) (12)

Both CNNg and CNNl are the cascade of Eqn (1)-(4). Thesetwo streams have the same structure but do not share theparameters in order to extract the features of different scales.The fused calssification score is the average of Sg and Sl as:

Sf =Sg + Sl

2(13)

IV. EXPERIMENTS

In this section, the remote sensing image scene data sets andthe evaluation metrics used in this paper are introduced firstly.Secondly, the experiment setup and training hyper-parametersare provided in detail. Following that, an example of thevisualization of the proposed SKAL is shown for auxiliaryinterpretation. Finally, we report the experimental results oneach data set with the comparison with some state-of-the-artmethods, and analyse the performance of our SKAL basedglobal-local two-stream architecture.

A. Data Sets and Evaluation Metrics

1) UC Merced Land-Use Data Set: The UC Merced (UCM)land-use data set [18] consists of 2,100 images that are splitinto 21 typical land-use scene classes of agricultural, airplane,baseball diamond, beach, buildings, chaparral, dense residen-tial, forest, freeway, golf course, harbor, intersection, mediumresidential, mobile home park, overpass, parking lot, river,runway, sparse residential, storage tanks, and tennis courts.Each class contains 100 optical images measuring 256×256pixels, and each pixel has a spatial solution of 30 cm in theRGB color space.

2) RSSCN7 Data Set: The RSSCN7 data set contains2800 images that are made up of seven typical scene classesincluding the grassland, forest, farmland, industrial region,lake, parking lot, residential region, and river. Each class has400 images collected from the global satellite map in GoogleEarth, which are individually sampled at four different scales.Images in RSSCN7 data set are 400×400 pixels. RSSCN7 isa challenging data set due to the changing seasons, varyingweathers and scale diversity.

3) Aerial Image Data Set: The Aerial Image Data (AID)set [20] have 10,000 images split into 30 classes includ-ing airport, bare land, baseball field, beach, bridge, center,church,commercial, dense residential, desert, farmland, for-est, industrial, meadow, medium residential, mountain, park,parking, playground, pond, port, railway station, resort, river,school, sparse residential, square, stadium, storage tanks, andviaduct. Each class has hundreds of large-scale images mea-suring 600×600 pixels in the RGB space. Each pixel hasthe spatial resolution of the range from 800 cm/pixel to 50cm/pixel.

4) NWPU-RESISC45 Data Set: The NWPU-RESISC45data set contains 31,500 images split into 45 classes, includingairplane, airport, baseball diamond, basketball court, beach,bridge, chaparral, church, circular farmland, cloud, commer-cial area, dense residential, desert, forest, freeway, golf course,ground track field, harbor, industrial area, intersection, island,lake, meadow, medium residential, mobile home park, moun-tain, overpass, palace, parking lot, railway, railway station,rectangular farmland, river, roundabout, runway, sea ice, ship,snowberg, sparse residential, stadium, storage tank, tenniscourt, terrace, thermal power station, and wetland. Each classhas 700 images of 256×256 pixels with the spatial resolutionof the range from about 300 cm/pixel to 20 cm/pixel inthe RGB color space. It is the largest remote sensing sceneclassification data set in terms of the number of images andclasses so far, and is more challenging because of the higherbetween-class similarity.

B. Evaluation Metrics

The following two typical metrics are used to quantitativelyevaluate the experimental results.

1) Overall Accuracy: The overall accuracy (OA) is definedas the number of the correctly classified images divided by thetotal number of images in the data set. The score of OA reflectsthe overall performance of classification models instead of perclass.

2) Confusion Matrix: The confusion matrix (CM) is antwo-dimension informative table which is used to analyzethe between-class classification errors and confusion degree.Each row of the matrix represents all the image samples of apredicted class while each column represents the samples ofa ground-truth class.

To obtain reliable experimental results, on UCM, RSSCN7and AID data sets, we repeated the experiments for five timesusing the same training ratio to randomly split the data set, andreport the mean value and standard deviation of these results.On NWPU-RESISC45 data set, the number of repetitions isthree due to the large number of samples.

C. Experiment Setup

1) CNN Baselines: To evaluate the effectiveness and ro-bustness of the SKAL based global-local two-stream architec-ture in scene classification of remote sensing images, the com-parative experiments are conducted on the aforementioned fourdata sets based on three kinds of widely used CNN models,including AlexNet [15], GoogleNet [17] and ResNet18 [16]pretrained on ImageNet [50]. AlexNet is composed mainlyof convolutional layers, GoogleNet concatenates filters withdifferent sizes, and ResNets have residual connections. Whenthey are used for the key area calculation, their classifiers(fully-connected layers) are removed and the final multi-layerfeature maps are upsampled to 25 × 25 × C (25×25 is thespatial size and C is the dimension of feature map) for moreaccurate location. Here the technology of replacing the lastlayer of a model pretrained on ImageNet is widely usedstrategy [38], [51], [52].


2) Training Parameters: Adam algorithm [53] is selectedas the optimizer and cross entropy loss is used as the lossfunction for all the models. All the models are trained for50 epochs with the batch size set to 64. Learning rate ofAdam is initialized to 1e-4, and it is divided by 10 every 20epochs. As for the size of input images in global-local two-stream architecture, the input of global stream is the globalimage resized to 224×224, while the input of local streamis the local area, which is cropped from the enlarged globalimage of 448×448, and then resized to 224×224. For faircomparison, the state-of-the-art CNN models compared withour two-stream architecture are all based on the input size of224×224.

3) Data Augmentation: Because of the scale difference ofthe global and local images, the data augmentation methodsof these two streams are different. For global stream, weuse random horizontal flipping to augment the global images.For local stream, besides random horizontal flipping, werandomly crop a square area from the global image for localimage augmentation. In order to keep the scale of the localimages consistent during training and testing, the side lengthproportion of the square area is set to 50%.

4) Training and Test Procedure: We adopt two-stage train-ing strategy to individually train the global and local streams.The global stream is trained and tested firstly, and providea preliminary scene classification score based on the globalimage. Then the local stream is trained by the augmented localsamples, and give another score based on the key local area.Finally these two scores are fused by average operation.

In addition, all the experiments are implemented by Pytorch[54] 1.1.0 Version in the computing environment of 64-GBmemory CPU and 1× 12-GB NVIDIA GeForce GTX 1080TiGPU.

D. Visualization of SKAL

For intuitive understanding of our SKAL in Algorithm 1, acomplete process of SKAL based on GoogleNet on a remotesensing image in UCM data set is shown in Fig. 4. Forbetter explanation, all the coordinates of width and height areenlarged to the range of [0, 25] from [0, 1]. ETr here is set to60%.

In Fig. 4, the energy map is extracted from the originalglobal image by Eqn (1), Eqn (5)-(7). And the initializedarea of width and height, i.e. [x1, x2, y1, y2]=[3, 15, 7, 19],are obtained. Their initialized energy ratios are 69% and52%, respectively. Because of the independence of the areasearching process of spatial width and height, width [x1, x2]and height [y1, y2] are separately adjusted. Firstly, height area[y1, y2] is iteratively enlarged to [5, 19] from [7, 19] withthe energy ratio increasing from 52% to 62%. Secondly, thewidth area [x1, x2] is iteratively shrunk from [3, 15] to [4, 14]with the energy ratio decreasing from 69% to 59%. Therefore,the final key area of [xinit, xinit, yinit, yinit] is [4, 14, 5, 19].After being scaled, this boundary is used for the guidance ofthe key area sampling.

Results in Fig. 4 indicate that the local image covers themost informative area in the global image, with the energy

ratio of the structured vectors quickly converging to a smallneighbor of 60%.

More localization samples are shown in Fig. 5 and Fig.6. Fig. 5 shows some samples of discriminative classeswhich reflect the reasonable effect of key area location.Fig. 6 provides some samples of three ambiguous cat-egories of “sparse residential”, “medium residential” and“dense residential”. In Fig. 6, “sparse residential” pays atten-tion to the individual building, “medium residential” focuseson the interface of buildings and trees, and “dense residential”emphasizes a piece of houses next to each other which maybe judged by the junction of houses.

E. In Comparison With Other Methods

1) UCM Data Set: UCM data set is the earliest and themost widely used remote sensing scene classification dataset. Thus we first apply the SKAL based global-local two-stream architecture on UCM to explore the improvement ofclassification performance, and find a reasonable value ofETr. We make a hyperparameter tuning study based onAlexNet, ResNet18 and GoogleNet with ETr set to 60%,70% and 80%, respectively. The ratios of training samplesare 50% and 80%, which follow the splitting convention ofUCM data set. Our results are shown in Table I. In the Table,the CNN models attached by the subscripts of global, localand global+ local represent only the global stream (baseline),only the local stream and both of them, respectively.

According to the results, our SKAL based global-local two-stream architecture can provide a significant improvement forthe classification of UCM data set under two kinds of splittingratios. And it can be found that our method is not limitedby the CNN models. Among the three CNN models, theperformance of AlexNet obviously falls behind GoogleNetand ResNet18. Compared with ResNet18, there are somemulti-scale convolutional blocks in GoogleNet, which is moresuitable for our SKAL based two-stream method. Hence, whenour two-stream architecture is applied, GoogleNet is slightlyahead of ResNet18.

We also make a comparison between the global stream andlocal stream as shown in Table I. Although the performance oflocal stream is far worse than the global stream, their fusionresult is better than only the global stream. This phenomenondemonstrates the independence in feature extraction of theglobal stream and local stream, that is, the former extractsthe global scene features while the latter mines the local keyfeatures. Besides, it is probably that the global stream plays amajor role in the decision-making process and the local streammakes the compensation for it: if global stream has an clearclassification judgment, local stream can enhance the resultof global stream; on the contrary, if global stream hesitatesbetween some ambiguous categories, local stream focusingon the enlarged key area can correct the possible mistakenclassification results of global stream. As we can see in thetable, the promotion effect decreases with the increase of localarea, because the local features tend to be homogenous withthe global features when the local area expands. Thereforeit is necessary to limit the local area in a reasonable range,


w0

719

3 15

Heig

ht E

nerg

y: 52%

Width Energy: 69%

h

w

h

0 3 15

Width Energy: 69%

619

Heig

ht E

nerg

y: 58%

w

h

0

519

Heig

ht E

nerg

y: 62%

3 15

Width Energy: 69%

w

h

0

519

Heig

ht E

nerg

y: 62%

3 14

Width Energy: 65%

w

h

0 4 14

Width Energy: 59%

519

Heig

ht E

nerg

y: 62%

Init Boundary

in Energy Map

Height Adjust Height Adjust

Width Adjust

Width AdjustLocalizationBilinear Sampling

Global Image

Local Image

Fig. 4: Visualization of SKAL with ETr as 60%.

(a) airplane (b) basketbal court (c) beach (d) ground track field (e) sea ice (f) church

Fig. 5: Some samples of key area localization based on SKAL in NWPU-RESISC45 data set. The images from top to bottomare the following: original image, energy map and the fusion image labeled by the bounding box.

which is determined by ETr. Too small threshold leads theloss of useful local features while too big threshold decreasesthe discrimination of the local features. It could be found thatall of 60%, 70% and 80% work well, but 70% is the best inmost cases. In the following experiments, ETr is set to 70%by default.

Based on the comparison experiments of global andglobal + local, the training and test time of the proposedSKAL based two-stream architecture on UCM data set isexplored, and the results are provide in Table II. Because local

stream needs the bounding box of key area which dependson global stream, there is no the corresponding time costof only local. During the training stage when images aretrained in a mini batch of 64 for 50 epochs, the trainingtime of global mainly contains feature extraction and gradientbackward propagation in only global while the training time ofglobal+local includes the feature extraction in global, SKAL,and the feature extraction and gradient backward propagationin local stream. According to this table, it could be foundthat the training time of global + local is just a few more


(a) sparse residential 1 (b) sparse residential 2 (c) medium residential 1 (d) medium residential 2 (e) dense residential 1 (f) dense residential 2

Fig. 6: Some samples of three ambiguous categories, including “sparse residential”, “medium residential” and“dense residential”, in UCM data set. The images from top to bottom are the following: original image, energy map andthe fusion image labeled by the bounding box.

TABLE I: OA(%) based on different CNN baselines with different training ratio and ETr on the UCM data set.

Methods 50% images for training 80% images for trainingETr=60% ETr=70% ETr=80% ETr=60% ETr=70% ETr=80%

AlexNetglobal 91.38±0.52 91.38±0.52 91.38±0.52 96.31±0.12 96.31±0.12 96.31±0.12AlexNetlocal 82.72±0.34 86.81±0.14 86.86±0.09 85.60±0.36 90.36±1.07 88.69±0.12AlexNetglobal+local 93.10±1.00 93.77±1.28 93.48±0.71 97.38±0.24 97.38±0.48 97.02±0.12ResNet18global 97.43±0.19 97.43±0.19 97.43±0.19 99.05±0.24 99.05±0.24 99.05±0.24ResNet18local 92.57±0.19 95.48±0.05 94.66±0.66 93.45±1.31 96.43±0.24 97.03±0.59ResNet18global+local 97.81±0.10 97.95±0.05 98.15±0.05 99.52±0.24 99.52±0.24 99.28±0.24GoogleNetglobal 97.76±0.19 97.76±0.19 97.76±0.19 98.81±0.71 98.81±0.71 98.81±0.71GoogleNetlocal 91.81±0.48 95.53±0.09 95.24±0.19 94.29±0.71 96.08±0.59 96.43±0.48GoogleNetglobal+local 97.90±0.10 98.19±0.05 98.14±0.24 99.41±0.36 99.70±0.30 99.41±0.36

than local. If the time of feature extraction in local is takeninto consideration, the time cost of SKAL in global + localcould be relatively ignored. During the testing stage when theimages are test one by one, the test time of global mainlyconsists of image loading feature extraction of global whilethere are extra time cost of the proposed SKAL and featureextraction in local. Although the model size of global+ localis twice as big as global, test time of the former is less thantwice time cost of the latter. It probably is caused by the imageloading, and it also suggests that the proposed SKAL does nottake the noticeable running time. Such results indicate that theproposed SKAL has a low computational complexity.

To objectively evaluate the performance of the proposedSKAL based global-local two-stream method, as shown in

TABLE II: Training time of 50% images of UCM data set andtest time of the rest 50% images.

Methods Training Time Test Time(Second) (Second)

AlexNetlobal 123 2.62AlexNetglobal+local 144 4.30ResNet18lobal 196 3.75ResNet18global+local 236 6.40GoogleNetlobal 209 3.81GoogleNetglobal+local 255 6.44

Table III, we make a comparison of OA(%) with some state-


of-the-art methods on UCM data set, including handcraftedfeature based methods, unsupervised feature based methodsand deep feature based methods. As we can see in Table III,our method significantly outperforms all the other state-of-the-art scene classification methods. When 50% images areused for training, our two-stream method wins the first placewith an obvious increase of 1.38% over the second methodof ARCNet-VGG16 [8]. When the training ratio increases to80%, compared with the third method of ARCNet-VGG16 [8],our GoogleNet based two-stream architecture has an signifi-cant gain of 0.58%, in consideration of the near 100% OA.Although Resnet101-FSL [6] is much deeper and bigger thanGoogleNet, our global-local two-stream architecture basedon GoogleNet still outperforms Resnet101-FSL. It can alsobe found that our method has huge advantages in terms ofclassification accuracy than the handcrafted feature methods[55], [56] and unsupervised methods [30].

TABLE III: Comparison of OA(%) with some state-of-the-artresults on UCM data set.

Methods 50% training 80% training

AlexNet 91.38±0.52 96.31±0.12AlexNetglobal+local 93.77±1.28 97.38±0.48ResNet18 97.43±0.19 99.05±0.24ResNet18global+local 97.95±0.05 99.52±0.24GoogleNet 97.76±0.19 98.81±0.71GoogleNetglobal+local 98.19±0.05 99.70±0.30

Resnet101-FSL [6] — 99.52ARCNet-VGG16 [8] 96.81±0.14 99.12±0.40DDRL-AM [9] — 99.05±0.08SF-CNN with VGGNet [13] — 99.05±0.27MSCP [57] — 98.36±0.58ELM based Two-Stream [11] 96.97±0.75 98.02±1.03TEX-Net-LF [12] 96.91±0.36 97.72±0.54Combing Scenarios I and II [38] — 98.49Fusion by Addition [58] — 97.42±1.79CNN-NN [59] — 97.19SalM3LBPCLM [55] 94.21±0.75 95.75±0.80VGG-VD-16 [20] 94.14±0.69 95.21±1.20MS-CLBP+FV [56] 88.76±0.76 93.00±1.20Unsupervised Feature Learning [30] — 81.67±1.23

Moreover, as shown in Fig. 7, we make two CMs ofGoogleNet and the corresponding two-stream architecture tofurther analyze the improvement of each class of UCM dataset. As it can be observed in Fig. 7, there are some mis-classified samples among the scenes of medium residential,dense residential, buildings, storage tanks and intersection.When the proposed two-stream architecture is applied, thenumber of the misclassified samples and scenes decreasesrapidly, which proves the effectiveness of our method.

2) RSSCN7 Data Set: RSSCN7 data set is a difficult remotesensing scene data set, affected by the changing seasons,various weathers and scale diversity. These problems challengethe stability and robustness of the proposed global-local two-stream architecture in key local area localization. To study theperformance of our method in dealing with these problems,the comparative experiments are conducted based on AlexNet,

ResNet18 and GoogleNet (with ETr set to the default valueof 70%), and the experimental results of OA(%) are reportedin Table IV. In the table, the subscript of global is removedand the results of the local streams are not provided for moreclear comparison.

Table IV results indicate that the increase of about 2% isobtained when the proposed two-stream architecture is appliedover these three CNN baselines. The wide and meaningfulimprovement powerfully supports the stability and effective-ness of our method. Especially, when the CNN baselines arelimited by the lack of training data, the local images can alsobe regarded as the extra training samples from the perspectiveof data augmentation. Compared with all the state-of-the-artmethods, the proposed method has a huge advantage in theresults of OA. Our global-local two-stream architecture basedon ResNet18 wins the first place and has the accuracy gainof 1.44% under the training ratio of 20%, and 2.04% underthe training ratio of 50%, over the second method of Resnet50based TEX-Net-LF [12].

To study the performance of each class in RSSCN7 dataset based on the proposed two-stream architecture, the CMis made as shown in Fig 8. According to this CM, it can befound that the missclassified samples are mainly distributedin the scenes of industry, parking, grass and field. There arelarge intraclass diversity and high interclass similarity in thesescenes, which need to be solved in the future work.

TABLE IV: Comparison of OA(%) with some state-of-the-artresults on RSSCN7 data set.



Resnet50 [12] 90.23±0.43 93.12Q±0.55Resnet50 based TEX-Net-LF [12] 92.45±0.45 94.00±0.57VGG-M [12] 86.00±0.63 88.80±0.55VGG-M based TEX-Net-LF [12] 88.61±0.46 91.25±0.58Deep filter banks [60] — 90.40±0.60CaffeNet [20] 85.57±0.95 88.25±0.62VGG-VD-16 [20] 83.98±0.87 87.18±0.94GoogleNet [20] 82.55±1.11 85.84±0.92HCV [60] — 84.70±0.70DBN based feature selection [19] — 77.0

3) AID Data Set: AID is a high-resolution large-scale re-mote sensing scene data set covering a lot of background noise.The comparative experiments are conducted on AID data setbased on the aforementioned three CNN baselines under thedefault training settings and ETr. The results achieved byour two-stream architecture of OA(%) are provided in TableV, with the comparison with some state-of-the-art results. It isnotable that the state-of-the-art results of CNN-based methodson AID data set reported in this paper are based on the image


medium

residentialfreewayforeststoragetanksairplanechaparralparkinglotrunwaybeachtenniscourtagriculturalbaseballdiam

onddenseresidentialm

obilehomepark

harborsparseresidentialriveroverpassgolfcoursebuildingsintersection

mediumresidentialfreeway

foreststoragetanks

airplanechaparralparkinglot

runwaybeach

tenniscourtagricultural

baseballdiamonddenseresidentialmobilehomepark

harborsparseresidential

riveroverpass

golfcoursebuildings

intersection

0.90 0.10

1.00

1.00

0.85 0.10 0.05

1.00

1.00

1.00

1.00

1.00

1.00

1.00

0.95 0.05

0.05 0.90 0.05

1.00

1.00

0.05 0.05 0.90

1.00

1.00

1.00

1.00

1.000.0

0.2

0.4

0.6

0.8

1.0

(a) GoogleNet

medium

residentialfreewayforeststoragetanksairplanechaparralparkinglotrunwaybeachtenniscourtagriculturalbaseballdiam

onddenseresidentialm

obilehomepark

harborsparseresidentialriveroverpassgolfcoursebuildingsintersection

mediumresidentialfreeway

foreststoragetanks

airplanechaparralparkinglot

runwaybeach

tenniscourtagricultural

baseballdiamonddenseresidentialmobilehomepark

harborsparseresidential

riveroverpass

golfcoursebuildings

intersection

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

0.05 0.90 0.05

1.00

1.00

0.05 0.95

1.00

1.00

1.00

1.00

1.000.0

0.2

0.4

0.6

0.8

1.0

(b) GoogleNet based two-stream architecture

Fig. 7: Confusion matrix of UCM Data Set under the training ratio of 80% using the following two methods: GoogleNet(global stream) and the proposed SKAL based global-local two-stream architecture based on GoogleNet.

industryforestgrassfieldriverlakeresidentparking

industryforestgrassfield

riverlakeresidentparking

0.91 0.02 0.06

0.98

0.97 0.03

0.04 0.95

0.01 0.97

0.04 0.96

0.04 0.940.0

0.2

0.4

0.6

0.8

Fig. 8: Confusion matrix of RSSCN7 Data Set under thetraining ratio of 50% using the proposed SKAL based global-local two-stream architecture based on ResNet18.

size of 224×224 within a small variance, because image sizehas an important influence on the classification accuracy.

Results of Table V indicate that the proposed global-localtwo-stream architecture provides two kinds of advantages forCNN baselines. On one hand, our two-stream architecture sig-nificantly and widely improves the performance of all the threeCNN baselines. And our global-local two-stream architecturebased on ResNet18 outperforms all the current state-of-the-art methods, with the advantage of at least 0.78% in OAunder the training ratio of 50% and 0.57% under the trainingratio of 20%. On the other hand, our two-stream architecturereduces the standard deviation of experimental results andprovides more stable classification accuracies although thetraining images are randomly selected from the data set forseveral times.

In order ot evaluate the performance of the proposed methodon AID data set, we make a CM in Fig. 9, which is achieved bythe global-local two-stream architecture based on ResNet18.As shown in this CM, the most easily misclassified scenesare park, resort, railway station, square, school and center. Allof them are strongly related to the plenty of buildings andvegetation cover. These highly similar features and objectslimits the further improvement on AID data set, and this

TABLE V: Comparison of OA(%) with some state-of-the-artresults on Aerial Image Data Set.



Resnet101-FSL [6] — 95.88Resnet50 based TEX-Net-LF [12] 93.81±0.12 95.73±0.16ELM based Two-Stream [11] 92.32±0.41 94.58±0.25VGG-VD16 + MSCP [57] 91.52±0.21 94.42±0.17ARCNet-VGG16 [8] 88.75±0.40 93.10±0.55VGG-M based TEX-Net-LF [12] 90.87±0.11 92.96±0.18Fusion by Addition [58] — 91.87±0.36SalM3LBPCLM [55] 86.92±0.35 89.76±0.45VGG-VD-16 [20] 86.59±0.29 89.64±0.36CaffeNet [20] 86.86±0.47 89.53±0.31MS-CLBP+FV [55] 86.48±0.27 —GoogLeNet [20] 83.44±0.40 86.39±0.55

problem could be improved by deeper and more complexfeature representation.

4) NWPU-RESISC45 Data Set: NWPU-RESISC45 is thebiggest remote sensing data set of scene classification with45 challenging scenes. Benefiting from the large amount oftraining data, the classification results of NWPU-RESISC45are more stable and convincing. We conduct the ablation ex-periments on NWPU-RESISC45 under the same experimentalconditions of CNN baselines, training settings and ETr as theprevious three data sets.

We make the comparative experiments with/without the


BareLandPlaygroundDenseResidentialParkSparseResidentialMeadowFarm

landResortAirportRailwayStationPortBridgeStadiumMountainForestBaseballFieldRiverChurchCom

mercial

SquareViaductIndustrialBeachParkingStorageTanksDesertMedium

ResidentialSchoolCenterPond

BareLandPlayground

DenseResidentialPark

SparseResidentialMeadow

FarmlandResortAirport

RailwayStationPort

BridgeStadium

MountainForest

BaseballFieldRiver

ChurchCommercial

SquareViaduct

IndustrialBeach

ParkingStorageTanks

DesertMediumResidential

SchoolCenter

Pond

0.95 0.04

0.99

0.97 0.01 0.01

0.90 0.06

0.99 0.01

0.99

0.99

0.06 0.820.01 0.01 0.01 0.03

0.98 0.01

0.020.96

1.00

0.020.98

0.01 0.98

1.00

1.00

0.02 0.98

0.99

0.02 0.870.030.02 0.030.03

0.97 0.03

0.05 0.01 0.02 0.85 0.010.04

1.00

0.02 0.02 0.93 0.010.01

0.99

1.00

0.99

0.02 0.98

0.02 0.01 0.96

0.03 0.030.01 0.02 0.89

0.02 0.02 0.94

0.990.0

0.2

0.4

0.6

0.8

1.0

Fig. 9: Confusion matrix of Aerial Image Data Set under thetraining ratio of 50% using the proposed SKAL based global-local two-stream architecture based on ResNet18.

TABLE VI: Comparison of OA(%) with some state-of-the-artresults on NWPU-RESISC45 Data Set.



SF-CNN with VGGNet [13] 89.89±0.16 92.55±0.14ResNet-18 + AM + CL [9] 92.17±0.08 92.46±0.09VGGNet-16 + RIFD [61] 90.12 92.27D-CNN with VGGNet-16 [14] 89.22±0.50 91.89±0.22VGG-VD16 + MSCP + MRA [57] 88.07±0.18 90.81±0.13SAL-TS-Net [62] 85.02±0.25 87.01±0.19TEX-TS-Net [62] 84.77±0.24 86.36±0.19ELM based Two-Stream [11] 80.22±0.22 83.16±0.18AlexNet [21] 76.69±0.21 79.85±0.13BoVW + SPM [21] 27.83±0.61 32.96±0.47LBP [21] 19.20±0.41 21.74±0.18

proposed two-stream architecture, and make a comparison ofOA(%) with some state-of-the-art methods, which are shownin Table VI. According to the results, our GoogleNet basedglobal-local two-stream architecture wins the second placeunder the training ratio of 10% and the first place under 20%.The current best method of ResNet18 + AM + CL [9] has anOA gain of 0.29% when training ratio increases from 10% to20%. However, our GoogleNet based two-stream architecturehas an OA gain of 2.54%, which indicates that our method hasbetter potential of OA with the increase of training samples.

The CM of our ResNet18 based two-stream architecture isreported in Fig 10. The samples most likely to be misclassifiedmostly belong to the classes of freeway, church, railway

station, industrial area, palace, commercial area, wetland, riverand medium residential. There are lots of confusing objectsand features among these scene classes that limits the OA.

V. CONCLUSION

In this paper, we propose a structured key area localization(SKAL) strategy to localize the most important area in remotesensing scene images. Based on SKAL, a global-local two-stream architecture, which can individually extract the globaland local features, is further presented for scene classificationof remote sensing images. To verify the effectiveness and ro-bustness of the proposed SKAL based global-local two-streamarchitecture, we conduct a lot of comparative experimentsbased on three kinds of widely used CNN models, includingAlexNet, ResNet18 and GoogleNet, on four popular remotesensing scene data sets, and achieve all the state-of-the-artresults of these data sets. The experimental results demonstratethe powerful capability of the joint global and local featurerepresentation of the proposed method, which can solve theproblem of large scale variance in remote sensing scene imagesto some extent.

REFERENCES

[1] G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convolu-tional neural networks for object detection in vhr optical remote sensingimages,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54,no. 12, pp. 7405–7415, 2016.

[2] Y. Li, Y. Zhang, X. Huang, H. Zhu, and J. Ma, “Large-scale remotesensing image retrieval by deep hashing neural networks,” IEEE Trans-actions on Geoscience and Remote Sensing, vol. 56, no. 2, pp. 950–965,2017.

[3] X. Lu, B. Wang, X. Zheng, and X. Li, “Exploring models and datafor remote sensing image caption generation,” IEEE Transactions onGeoscience and Remote Sensing, vol. 56, no. 4, pp. 2183–2195, 2017.

[4] X. Zhang, Q. Wang, S. Chen, and X. Li, “Multi-scale cropping mecha-nism for remote sensing image captioning,” in 2019 IEEE InternationalGeoscience and Remote Sensing Symposium (IGARSS), 2019.

[5] Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep residual u-net,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp.749–753, 2018.

[6] W. Huang, Q. Wang, and X. Li, “Feature sparsity in convolutional neuralnetworks for scene classification of remote sensing image,” in 2019 IEEEInternational Geoscience and Remote Sensing Symposium (IGARSS),2019.

[7] L. Yan, R. Zhu, N. Mo, and Y. Liu, “Improved class-specific codebookwith two-step classification for scene-level classification of high resolu-tion remote sensing images,” Remote Sensing, vol. 9, p. 223, 03 2017.

[8] Q. Wang, S. Liu, J. Chanussot, and X. Li, “Scene classification withrecurrent attention of vhr remote sensing images,” IEEE Transactionson Geoscience and Remote Sensing, vol. 57, no. 2, pp. 1155–1167, 2018.

[9] J. Li, D. Lin, Y. Wang, G. Xu, and C. Ding, “Deep discriminativerepresentation learning with attention map for scene classification,”arXiv preprint arXiv:1902.07967, 2019.

[10] N. He, L. Fang, S. Li, J. Plaza, and A. Plaza, “Skip-connected covariancenetwork for remote sensing scene classification,” IEEE transactions onneural networks and learning systems, 2019.

[11] Y. Yu and F. Liu, “A two-stream deep fusion framework for high-resolution aerial scene classification,” Computational intelligence andneuroscience, vol. 2018, 2018.

[12] R. M. Anwer, F. S. Khan, J. van de Weijer, M. Molinier, and J. Laakso-nen, “Binary patterns encoded convolutional neural networks for texturerecognition and remote sensing scene classification,” ISPRS journal ofphotogrammetry and remote sensing, vol. 138, pp. 74–85, 2018.

[13] J. Xie, N. He, L. Fang, and A. Plaza, “Scale-free convolutional neuralnetwork for remote sensing scene classification,” IEEE Transactions onGeoscience and Remote Sensing, 2019.


freewayrectangular_farm

landchurchm

eadowforestshipcircular_farm

landrailway_stationislandairplaneparking_lotindustrial_areatherm

al_power_stationpalacechaparralground_track_fieldcloudrunwaybeachcom

mercial_area

airportwetlandsea_icesparse_residentialdesertrailwayharborriverterracebaseball_diam

ondtennis_courtm

obile_home_park

golf_coursebridgem

ountainlakeoverpassroundaboutm

edium_residential

storage_tankstadiumbasketball_courtsnowbergintersectiondense_residential

freewayrectangular_farmland

churchmeadow

forestship

circular_farmlandrailway_station

islandairplane

parking_lotindustrial_area

thermal_power_stationpalace

chaparralground_track_field

cloudrunway

beachcommercial_area

airportwetlandsea_ice

sparse_residentialdesert

railwayharbor

riverterrace

baseball_diamondtennis_court

mobile_home_parkgolf_course

bridgemountain

lakeoverpass

roundaboutmedium_residential

storage_tankstadium

basketball_courtsnowberg

intersectiondense_residential

0.91 0.03 0.02 0.01

0.94 0.04

0.80 0.010.13 0.03

0.01 0.900.01 0.02 0.03

0.97 0.01

0.97

0.99

0.89 0.01 0.04

0.93 0.05

0.97 0.01

0.95

0.03 0.010.860.03

0.94

0.13 0.01 0.010.72 0.020.02 0.01 0.01 0.01

0.99

0.95 0.04

0.99

0.02 0.01 0.90 0.05

0.04 0.93

0.06 0.01 0.010.04 0.83

0.01 0.94

0.010.01 0.89 0.01 0.04

0.98

0.90 0.02 0.04

0.93 0.04

0.02 0.06 0.89

0.99

0.01 0.01 0.02 0.870.02 0.01 0.01

0.05 0.93

0.96 0.01

0.94 0.02

0.95 0.01 0.02

0.97

0.01 0.95

0.02 0.94

0.06 0.01 0.91

0.02 0.940.01

0.98

0.02 0.030.02 0.010.85 0.05

0.98

0.97

0.03 0.94

0.99

0.02 0.010.01 0.93

0.03 0.04 0.91 0.0

0.2

0.4

0.6

0.8

Fig. 10: Confusion matrix of NWPU-RESISC45 data set under the training ratio of 50% using the proposed SKAL basedglobal-local two-stream architecture based on ResNet18.

[14] G. Cheng, C. Yang, X. Yao, L. Guo, and J. Han, “When deep learningmeets metric learning: Remote sensing image scene classification vialearning discriminative cnns,” IEEE transactions on geoscience andremote sensing, vol. 56, no. 5, pp. 2811–2821, 2018.

[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neural infor-mation processing systems, 2012, pp. 1097–1105.

[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 770–778.

[17] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in Proceedings of the IEEE conference on computer vision and patternrecognition, 2015, pp. 1–9.

[18] Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensionsfor land-use classification,” in Proceedings of the 18th SIGSPATIAL in-ternational conference on advances in geographic information systems.ACM, 2010, pp. 270–279.

[19] Q. Zou, L. Ni, T. Zhang, and Q. Wang, “Deep learning based featureselection for remote sensing scene classification,” IEEE Geoscience andRemote Sensing Letters, vol. 12, no. 11, pp. 2321–2325, 2015.

[20] G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, and X. Lu,

“Aid: A benchmark data set for performance evaluation of aerial sceneclassification,” IEEE Transactions on Geoscience and Remote Sensing,vol. 55, no. 7, pp. 3965–3981, 2017.

[21] G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classifi-cation: mark and state of the art,” Proceedings of the IEEE, vol. 105,no. 10, pp. 1865–1883, 2017.

[22] J. A. dos Santos, O. A. B. Penatti, and R. da Silva Torres, “Evaluatingthe potential of texture and color descriptors for remote sensing imageretrieval and classification.” in VISAPP (2), 2010, pp. 203–208.

[23] S. Bhagavathy and B. S. Manjunath, “Modeling and detection of geospa-tial objects using texture motifs,” IEEE Transactions on Geoscience andRemote Sensing, vol. 44, no. 12, pp. 3706–3715, 2006.

[24] G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class geospatial objectdetection and geographic image classification based on collection ofpart detectors,” ISPRS Journal of Photogrammetry and Remote Sensing,vol. 98, pp. 119–132, 2014.

[25] G. Cheng, P. Zhou, J. Han, L. Guo, and J. Han, “Auto-encoder-basedshared mid-level visual dictionary learning for scene classification usingvery high resolution remote sensing images,” IET Computer Vision,vol. 9, no. 5, pp. 639–647, 2015.

[26] Y. Yang and S. Newsam, “Geographic image retrieval using local invari-ant features,” IEEE Transactions on Geoscience and Remote Sensing,


vol. 51, no. 2, pp. 818–832, 2012.[27] V. Risojevic and Z. Babic, “Fusion of global and local descriptors

for remote sensing image classification,” IEEE Geoscience and RemoteSensing Letters, vol. 10, no. 4, pp. 836–840, 2012.

[28] L. Fei-Fei and P. Perona, “A bayesian hierarchical model for learningnatural scene categories,” in 2005 IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR’05), vol. 2. IEEE,2005, pp. 524–531.

[29] Q. Zhu, Y. Zhong, B. Zhao, G.-S. Xia, and L. Zhang, “Bag-of-visual-words scene classifier with local and global features for high spatialresolution remote sensing imagery,” IEEE Geoscience and RemoteSensing Letters, vol. 13, no. 6, pp. 747–751, 2016.

[30] A. M. Cheriyadat, “Unsupervised feature learning for aerial sceneclassification,” IEEE Transactions on Geoscience and Remote Sensing,vol. 52, no. 1, pp. 439–451, 2013.

[31] G. Sheng, W. Yang, T. Xu, and H. Sun, “High-resolution satellite sceneclassification using a sparse coding based multiple feature combination,”International journal of remote sensing, vol. 33, no. 8, pp. 2395–2412,2012.

[32] D. Dai and W. Yang, “Satellite image classification via two-layer sparsecoding with biased image representation,” IEEE Geoscience and RemoteSensing Letters, vol. 8, no. 1, pp. 173–176, 2010.

[33] Y. Li, C. Tao, Y. Tan, K. Shang, and J. Tian, “Unsupervised mul-tilayer feature learning for satellite image scene classification,” IEEEGeoscience and Remote Sensing Letters, vol. 13, no. 2, pp. 157–161,2016.

[34] X. Lu, X. Zheng, and Y. Yuan, “Remote sensing scene classification byunsupervised representation learning,” IEEE Transactions on Geoscienceand Remote Sensing, vol. 55, no. 9, pp. 5148–5157, 2017.

[35] W. Zhou, Z. Shao, C. Diao, and Q. Cheng, “High-resolution remote-sensing imagery retrieval using sparse features by auto-encoder,” Remotesensing letters, vol. 6, no. 10, pp. 775–783, 2015.

[36] K. Nogueira, O. A. Penatti, and J. A. dos Santos, “Towards betterexploiting convolutional neural networks for remote sensing sceneclassification,” Pattern Recognition, vol. 61, pp. 539–556, 2017.

[37] E. Li, J. Xia, P. Du, C. Lin, and A. Samat, “Integrating multilayerfeatures of convolutional neural networks for remote sensing sceneclassification,” IEEE Transactions on Geoscience and Remote Sensing,vol. 55, no. 10, pp. 5653–5665, 2017.

[38] F. Hu, G.-S. Xia, J. Hu, and L. Zhang, “Transferring deep convolutionalneural networks for the scene classification of high-resolution remotesensing imagery,” Remote Sensing, vol. 7, no. 11, pp. 14 680–14 707,2015.

[39] M. Pandey and S. Lazebnik, “Scene recognition and weakly supervisedobject localization with deformable part-based models,” in 2011 Inter-national Conference on Computer Vision. IEEE, 2011, pp. 1307–1314.

[40] H. Bilen and A. Vedaldi, “Weakly supervised deep detection networks,”in Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2016, pp. 2846–2854.

[41] R. G. Cinbis, J. Verbeek, and C. Schmid, “Weakly supervised objectlocalization with multi-fold multiple instance learning,” IEEE transac-tions on pattern analysis and machine intelligence, vol. 39, no. 1, pp.189–203, 2016.

[42] D. Zhang, D. Meng, L. Zhao, and J. Han, “Bridging saliency detectionto weakly supervised object detection based on self-paced curriculumlearning,” arXiv preprint arXiv:1703.01290, 2017.

[43] J. Fu, H. Zheng, and T. Mei, “Look closer to see better: Recurrent atten-tion convolutional neural network for fine-grained image recognition,”in Proceedings of the IEEE conference on computer vision and patternrecognition, 2017, pp. 4438–4446.

[44] P. Tang, X. Wang, S. Bai, W. Shen, X. Bai, W. Liu, and A. L. Yuille,“Pcl: Proposal cluster learning for weakly supervised object detection,”IEEE transactions on pattern analysis and machine intelligence, 2018.

[45] Z. Yang, D. Mahajan, D. Ghadiyaram, R. Nevatia, and V. Ramanathan,“Activity driven weakly supervised object detection,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,2019, pp. 2917–2926.

[46] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” in Advances in neuralinformation processing systems, 2015, pp. 91–99.

[47] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE internationalconference on computer vision, 2015, pp. 1440–1448.

[48] T. Hoeser and C. Kuenzer, “Object detection and image segmentationwith deep learning on earth observation data: A review-part i: Evolutionand recent trends,” Remote Sensing, vol. 12, no. 10, p. 1667, 2020.

[49] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167, 2015.

[50] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in 2009 IEEE conference oncomputer vision and pattern recognition. Ieee, 2009, pp. 248–255.

[51] R. Pires de Lima and K. Marfurt, “Convolutional neural network forremote-sensing scene classification: Transfer learning analysis,” RemoteSensing, vol. 12, no. 1, p. 86, 2020.

[52] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnnfeatures off-the-shelf: an astounding baseline for recognition,” in Pro-ceedings of the IEEE conference on computer vision and patternrecognition workshops, 2014, pp. 806–813.

[53] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.

[54] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: Animperative style, high-performance deep learning library,” in Advancesin neural information processing systems, 2019, pp. 8026–8037.

[55] X. Bian, C. Chen, L. Tian, and Q. Du, “Fusing local and global featuresfor high-resolution scene classification,” IEEE Journal of Selected Topicsin Applied Earth Observations and Remote Sensing, vol. 10, no. 6, pp.2889–2901, 2017.

[56] L. Huang, C. Chen, W. Li, and Q. Du, “Remote sensing image sceneclassification using multi-scale completed local binary patterns and fishervectors,” Remote Sensing, vol. 8, no. 6, p. 483, 2016.

[57] N. He, L. Fang, S. Li, A. Plaza, and J. Plaza, “Remote sensingscene classification using multilayer stacked covariance pooling,” IEEETransactions on Geoscience and Remote Sensing, vol. 56, no. 12, pp.6899–6910, 2018.

[58] S. Chaib, H. Liu, Y. Gu, and H. Yao, “Deep feature fusion for vhrremote sensing scene classification,” IEEE Transactions on Geoscienceand Remote Sensing, vol. 55, no. 8, pp. 4775–4784, 2017.

[59] E. Othman, Y. Bazi, N. Alajlan, H. Alhichri, and F. Melgani, “Usingconvolutional features and a sparse autoencoder for land-use sceneclassification,” International Journal of Remote Sensing, vol. 37, no. 10,pp. 2149–2167, 2016.

[60] H. Wu, B. Liu, W. Su, W. Zhang, and J. Sun, “Deep filter banks for land-use scene classification,” IEEE Geoscience and Remote Sensing Letters,vol. 13, no. 12, pp. 1895–1899, 2016.

[61] G. Cheng, J. Han, P. Zhou, and D. Xu, “Learning rotation-invariant andfisher discriminative convolutional neural networks for object detection,”IEEE Transactions on Image Processing, vol. 28, no. 1, pp. 265–278,2018.

[62] Y. Yu and F. Liu, “Dense connectivity based two-stream deep feature fu-sion framework for aerial scene classification,” Remote Sensing, vol. 10,no. 7, p. 1158, 2018.

Qi Wang (M’15-SM’15) received the B.E. degree inautomation and the Ph.D. degree in pattern recog-nition and intelligent systems from the Universityof Science and Technology of China, Hefei, China,in 2005 and 2010, respectively. He is currently aProfessor with the School of Computer Science, withthe Center for OPTical IMagery Analysis and Learn-ing, Northwestern Polytechnical University, Xi’an,China. His research interests include computer vi-sion and pattern recognition.


Wei Huang received the B.E. degree in controltheory and engineering from the Northwestern Poly-technical University, Xi’an, China, in 2018. He iscurrently working toward the M.S. degree in com-puter science in the Center for OPTical IMageryAnalysis and Learning, Northwestern PolytechnicalUniversity, Xi’an, China. His research interests in-clude deep learning and computer vision.

Zhitong Xiong received the M.E. degree from theNorthwestern Polytechnical University, Xian, China,where he is currently pursuing the Ph.D. degree withthe School of Computer Science and the Centerfor OPTical IMagery Analysis and Learning (OP-TIMAL). His research interests include computervision and machine learning.

Xuelong Li (M’02-SM’07-F’12) is currently a Professor with the Schoolof Computer Science, with the Center for OPTical IMagery Analysis andLearning, Northwestern Polytechnical University, Xi’an, China.

Date post:	15-Oct-2021
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING …

Documents