Feature Extraction for Classification of Hyperspectral and ...utilized extinction proﬁles for...

100 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 50, NO. 1, JANUARY 2020

Feature Extraction for Classification ofHyperspectral and LiDAR Data Using

Patch-to-Patch CNNMengmeng Zhang, Student Member, IEEE, Wei Li , Senior Member, IEEE, Qian Du , Fellow, IEEE,

Lianru Gao , Senior Member, IEEE, and Bing Zhang, Senior Member, IEEE

Abstract—Multisensor fusion is of great importance in Earthobservation related applications. For instance, hyperspectralimages (HSIs) provide wealthy spectral information while lightdetection and ranging (LiDAR) data provide elevation infor-mation, and using HSI and LiDAR data together can achievebetter classification performance. In this paper, an unsupervisedfeature extraction framework, named as patch-to-patch convolu-tional neural network (PToP CNN), is proposed for collaborativeclassification of hyperspectral and LiDAR data. More specific,a three-tower PToP mapping is first developed to seek an accu-rate representation from HSI to LiDAR data, aiming at mergingmultiscale features between two different sources. Then, by inte-grating hidden layers of the designed PToP CNN, extractedfeatures are expected to possess deeply fused characteristics.Accordingly, features from different hidden layers are concate-nated into a stacked vector and fed into three fully connectedlayers. To verify the effectiveness of the proposed classificationframework, experiments are executed on two benchmark remotesensing data sets. The experimental results demonstrate that theproposed method provides superior performance when comparedwith some state-of-the-art classifiers, such as two-branch CNNand context CNN.

Index Terms—Deep convolutional neural network (CNN),feature extraction, hyperspectral image (HSI) classification, mul-tisensor fusion.

I. INTRODUCTION

SENSOR technology has experienced importantadvances [1]–[4] lately, allowing us to measure various

Manuscript received March 27, 2018; revised June 11, 2018and July 24, 2018; accepted August 2, 2018. Date of publicationSeptember 18, 2018; date of current version October 22, 2019. Thiswork was supported in part by the National Natural Science Foundationof China under Grant NSFC-91638201 and Grant 61571033, in part bythe Beijing Natural Science Foundation under Grant 4172043, in part bythe Beijing Nova Program under Grant Z171100001117050, and in partby the Fundamental Research Funds for the Central Universities underGrant BUCTRC201615. This paper was recommended by Associate EditorP. P. Angelov. (Corresponding author: Wei Li.)

M. Zhang and W. Li are with the College of Information Science andTechnology, Beijing University of Chemical Technology, Beijing 100029,China (e-mail: [email protected]).

Q. Du is with the Department of Electrical and Computer Engineering,Mississippi State University, Starkville, MS 39762 USA (e-mail:[email protected]).

L. Gao and B. Zhang are with the Institute of Remote Sensing andDigital Earth, Chinese Academy of Science, Beijing 100094, China (e-mail:[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCYB.2018.2864670

aspects of the objects on the surface of Earth. Remotelysensed hyperspectral images (HSIs) provide wealthy spec-tral information to uniquely discriminate various materialsof interest, leading to finer classification of land-coverclasses [5]–[8]. However, in certain circumstances, it maybe necessary to resort to different source to complement theinformation provided solely by hyperspectral instrumentsfor further improving and/or refining classification. For thispurpose, a series of approaches have been investigated inthe literature for fusion of data collected from differentsources [9], [10]. Light detection and ranging (LiDAR) data,which provide elevation information about the surveyed area,are very useful source for complementing the informationprovided solely by HSI [11], [12]. Collaborative classificationof HSI and LiDAR has been extensively employed in variousapplications, such as complex area classification [13], forestfire management [14], etc., due to its fine behavior. Numerousstudies have indicated that classification performance can beimproved after integrating HSI and LiDAR data. For example,in [15], LiDAR was used for the scene segmentation andHSI data for classifying the segmented regions; in [16],Ghamisi et al. exploited morphological extinction-profiles toextract both HSI and LiDAR features; and in [17], Rasti et al.utilized extinction profiles for joint feature extraction, fol-lowed by total variation component analysis for furtherfusion.

On the other hand, Liao et al. [18] emphasized that simpleconcatenation or stacking of features such as morpholog-ical attribute profiles may contain redundant information,and despite of the simplicity of such feature fusion meth-ods, the fusion systems may not perform better (or evenworse) than using a single type of features. This is due tothe fact that the element values of different features maybe significantly unbalanced, and the information containedby different features is not equally represented or measured.Khodadadzadeh et al. [19] further pointed out that the dimen-sional increasement of the stacked features and the limitednumber of labeled samples may cause the issue of the curseof dimensionality. Thus, in [20], a decision-fusion method forHSI and LiDAR data classification was presented; besides,linear and nonlinear features were also combined througha decision fusion strategy in [21]. Although these decisionfusion-based researches have shown excellent performanceon classification task, they cannot be a feasible solution

2168-2267 c© 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

https://orcid.org/0000-0001-7015-7335

https://orcid.org/0000-0001-8354-7500

https://orcid.org/0000-0003-3888-8124

ZHANG et al.: FEATURE EXTRACTION FOR CLASSIFICATION OF HYPERSPECTRAL AND LiDAR DATA USING PToP CNN 101

to deal with limited training samples, merely avoiding thefeatures extraction process with more demand for trainingsamples.

How to extract joint features containing complete informa-tion of HSI and LiDAR data, without suffering from Hugheseffect, still faces great challenges. Traditionally, human-engineered features depending on the experts’ experience andparameter setting have been the main workhorse for classifica-tion tasks [22]; nevertheless, Liu et al. further pointed out thatit is difficult to find appropriate parameters to generate featuresfor different classification tasks. Recently, deep learning-basedmethods have broadly replaced hand-engineered approachesin many domains, and have aroused wide attention for theircapabilities of automatically extracting robust and high-levelfeatures, which are known to be generally invariant to mostlocal changes of the input, at deeper layers. Zhang et al. [23]explored the saliency features in remote sensing scenes tobuild networks for scene classification. The general way forconstructing deep networks of remotely sensing images hasbeen systematically analyzed in [24], which attempted to eval-uate the effectiveness of all the state-of-the-art deep learningmethods on remote sensing images. For the sake of extract-ing high-level features in HSI, a deep learning architecturewith multilayer stacked auto-encoder (AE) was constructedthrough an unsupervised manner in [25]. The convolutionalneural network (CNN), which needs fewer parameters thanfully connected networks with the same number of hiddenunits, has drawn increasing attention in image analysis. Forexample, CNN was exploited to extract features for HSIclassification and obtained excellent performance in [26]. Inaddition, a set of improved methods based on CNN were usedfor remote sensing classification tasks and yielded excellentperformance [27], [28].

CNN-based methods can yield promising results onlywhen a sufficient supply of labeled training samples isensured [22], [29]–[31]; unfortunately, only a small numberof labeled samples is available for training in practical situa-tions, especially for remote sensing data. In other words, thesupervised CNN generally suffers from either limited train-ing samples or imbalanced data sets. Meanwhile, deep modelsare also trained to seek feature representations by means ofunsupervised learning methods. Unsupervised feature learning,which has a quick access to arbitrary amounts of unlabeleddata, has become the focus of concern in both academia andindustry. In general, the chief aim of unsupervised featurelearning is to extract useful features from unlabeled data,detect and eliminate input redundancies, and preserve onlyessential aspects of the data in robust and discriminativerepresentations [32]. Romero et al. [33] proposed a pioneerwork moving from the supervised CNN to unsupervised CNNfor learning spectral–spatial features, which was based onsparse learning to estimate the network weights. However,this model was trained in a greedy layer-wise fashion, i.e.,not an end-to-end network. The image-to-image translationnetworks [34], belonging to an end-to-end network, mappedan image from one domain to another domain for learning thetranslation function. Moreover, feature representation with suf-ficient information can be effectively explored in an automatic

learning process of the end-to-end network, AE, by minimiz-ing the reconstruction error between an input sample and itsreconstruction [35].

In this paper, an intuitive yet effective procedure for thejoint feature extraction of HSI and LiDAR data via a patch-to-patch (PToP) CNN is investigated. The proposed frameworkrefers to an unsupervised feature extraction framework via aPToP CNN to learn joint features between HSI and LiDARdata without labeled data considerations. PToP CNN is basedon the so-called encoder–decoder translation architecture.Specifically, the input source HSI is first mapped into a hiddensubspace via an encoding path (encoder), and then reversedto reconstruct the output source LiDAR data by a decodingpath (decoder). Meanwhile, the hidden representation withinthe translation procedure can be deemed as fused featuresof HSI and LiDAR data. After that, features derived fromdifferent hidden layers of PToP CNN are integrated by thehierarchical fusion module, afterward converted into a stackedvector and fed into three fully connected layers (3 FC) toproduce the final classification map.

The main contributions can be summarized as follows.1) A multiscale PToP CNN is proposed for feature extrac-

tion, which consists of three tunnels covering a convolu-tional filter bank with well-designed structure. Learningsuch an end-to-end network for joint feature extractionof HSI and LiDAR data has not been studied yet to thebest of our knowledge. Different hidden nodes in thenetwork are capable of precisely capturing attributes indifferent information levels, which ensures convenientaccess to multiscale joint features.

2) The hierarchical fusion module can optimally exploitjoint features extracted by the PToP CNN for clas-sification. Through the module, initial spatial–spectralfeatures obtained from the multiscale filter bank are thencombined together to form a joint spatial–spectral fea-ture map. The feature map, representing rich spectraland spatial properties of HSI and LiDAR data, is thenfed into a convolution-based block for “multilayer blockconcatenation” operation.

3) The neural weights learning process of the designedPToP CNN is completely unsupervised, which is inde-pendent of labeled samples. In the proposed PToP CNN,training patches are first collected by adopting a slid-ing window from the HSI and LiDAR data, and thelearning process based on these patches can ensure theperfectibility of features even with small-size labeledsamples.

The remainder of this paper is organized as follows. Somerelated works are introduced in Section II. The proposedmethodology is described in Section III. The experimen-tal results are discussed in Section IV. The conclusion issummarized in Section V.

II. RELATED WORKS

Several traditional unsupervised end-to-end network archi-tectures have been leveraged for diverse purposes, e.g.,


single-source feature learning, or image segmentation. Here,some basic principles of such models are reviewed.

A. Classic Auto-Encoder

The basic AE takes an input x ∈ �D for seeking a represen-tation a ∈ �N of x via a nonlinear mapping by minimizing thereconstruction error [35]. The architecture of AE involves anencoder and a decoder. The encoder maps the input x into thehidden representation a with encoding parameters W and b

a = f (Wx + b) (1)

where W is a weight matrix to be estimated during the train-ing course, b is a bias vector, and f (·) stands for a nonlinearactivation function, such as a sigmoid function and a rectifiedfunction. The encoded feature representation a is then used bythe decoder for reconstructing the input x based on a reversemapping. Mathematically, it can be denoted as

y = f(W′a + b′) (2)

where W′ and b′ are the corresponding decoding parameters.The typical AE network [32] is trained by minimizing theerror between the input x and the decoding reconstruction ofthe input vector x, i.e., y, making ‖x − y‖2

2 → 0. The param-eters of AE are generally optimized by stochastic gradientdescent (SGD) [36]. Particularly, if x is reconstructed by yaccurately, the hidden representation a can extract sufficientinformation from x. In this case, the hidden representationa can be deemed as the features of x. Thus, the AE modelcan realize nonlinear feature extraction by the encoding anddecoding steps.

B. End-to-End Architecture for Image Analysis

As we know, the most common usage of convolutionalnetworks is for classification tasks [37], where the output to animage is a single class label. However, in many visual tasks,the desired output should include localization; for example, aclass label is supposed to be assigned to each pixel.

End-to-end architectures of existing networks, which predictdense outputs from arbitrary-sized inputs, are trained end-to-end and pixel-to-pixel in image analysis, e.g., semanticsegmentation. Both learning and inference are performedwhole-image-at-a-time by dense feedforward computation andbackpropagation. In the end-to-end network, the subsampledpooling operations enable the learning process, while upsam-pling layers enable pixel-wise prediction [38]. In other words,the end-to-end architecture for image segmentation gener-ally consists of an encoder path for context capturing anda symmetric decoder path for precise localization. Widelyused deep architectures for segmentation have identical end-to-end structure as illustrated in Fig. 1, but differ in the formof the encoder–decoder network design and training strat-egy. For example, Long et al. [38] built fully convolutionalnetworks that took input of arbitrary size and produced corre-spondingly sized output with efficient inference and learning.Moreover, in [37], an architecture called as U-Net consistedof an expanding path and an expansion path, which could betrained end-to-end using very few sample images.

Fig. 1. Illustration of end-to-end network architecture.

III. PROPOSED CLASSIFICATION FRAMEWORK

In aforementioned methods, feature extraction methods withencoder–decoder architecture solely involve one type of datasource. Obviously, the encoder–decoder architecture holds twoqualities: 1) feature extraction and 2) two-domain translation.For example, image segmentation tasks always map an imagefrom one domain to another domain. Different from exist-ing single-source feature extraction methods, we focus onencoder–decoder architecture to integrate two-domain trans-lation during feature extraction process, thus enabling theseamless fusion of HSI and LiDAR data.

In this section, an unsupervised feature extraction methodfor fusion of HSI and LiDAR data is described in detail,and network training is elaborated. As shown in Fig. 2, theproposed framework includes a three-tower feature extractorcalled PToP CNN (Part I), and a hierarchical fusion mod-ule (Part II), followed by a classifier that consists of fullyconnected layers with softmax loss (Part III).

A. Feature Extraction by PToP CNN

The unsupervised PToP CNN framework is designed tolearn a translation function between two image domains. LetXsourceI and XsourceII be two images in different domains,XsourceI represents HSI data and XsourceII represents LiDARdata. Different from the widely used supervised fashion, weattempt to discover the relationship between XsourceI andXsourceII in an unsupervised way.

First of all, we assume that the relationship between XsourceIand XsourceII not only exists at the level of the entire imagebut also at the level of local regions [34]. Furthermore, forany given patch xsourceI (patch derived from XsourceI) andxsourceII (patch derived from XsourceII), there exists an under-lying representation hW,b(xsourceI) such that patch xsourceIIcan be recovered from this underlying representation withpatch xsourceI, and that this underlying representation canbe computed from each of the patch-pair xsourceI-xsourceII.Mathematically, the translation process with weight W andbias b is denoted as

hW,b(xsourceI) ≈ xsourceII (3)

where xsourceI and xsourceII can be viewed as the input and out-put of the classic AE [35], respectively. xsourceII is the learningobject of PToP CNN, whose value and inherent characteristicsare the only criteria for making reversing adjustments to allhidden units via back propagation. With a fully trained AE,the translation process in (3) can be viewed as fused featureextraction from xsourceI and xsourceII.

As illustrated in Fig. 2, Part I of the architecture reflectsthe translation process, which is elaborated on more details


Fig. 2. Proposed feature extraction and classification framework of PToP CNN.

Fig. 3. Overall parameter configuration of the designed PToP CNN network.

in Fig. 3. The common CNN model always involves withmultiple layers of neurons, each of which extracts differentlevel features. While in the PToP CNN, as shown in Fig. 3,the convolutional layers applied to the input image use amultiscale filter group that simultaneously convolves the inputimage by three convolution-based towers of different sizes(i.e., 3 × 3 × D, 5 × 5 × D, and 7 × 7 × D, where D is thenumber of feature bands). The 3 × 3 × D tunnel concentrateson addressing spectral correlations while the 5 × 5 × D and7 × 7 × D tunnels are used to exploit local spatial correla-tions of HSI and LiDAR data. The outputs of the three-scalefeature extractor in the PToP CNN, are combined together

to form a joint feature map for reconstructing Source II(LiDAR data).

Since the dimensionality of feature maps from the last con-volutional layer is different from the number of bands inSource II, the last convolutional operation is implementedto make them consistent. The three-tower feature extractorwith a mass number of parameters can be computationallyexpensive, and merging of the output of three convolutiontunnels increases the size of the network, which inevitablyleads to high computational complexity. As the network sizeis increased, optimizing the network with limited trainingsamples also faces overfitting and divergence. Fortunately,


Fig. 4. Process of constructing training patches from original image: (left)original image and (right) training patches.

in the unsupervised setting, the number of training samplesof the designed PToP network is related to the number ofHSI and LiDAR patches rather than labeled pixels. The patchpre-processing part indicated in Fig. 4 will be described inSection III-C.

B. Hierarchical Fusion Module

Under the PToP network, except of high-level features,many low-level and mid-level features are shared betweenHSI and LiDAR data, e.g., edge locations for some build-ings in urban image scenes. In particular, as shown in Fig. 5(a detailed elaboration for Part II of Fig. 2), the hierarchicalfusion module is provided for integrating features of diversehierarchies, including different filter scales of convolution anddifferent convolutional layers.

The multiscale filter group, which is conceptually similarto the inception module in [39], is adopted for taking fulladvantage of different local structures of the input image. Inthe proposed framework, the multiscale filter group is adoptedin a different manner; both local spatial structures and localspectral correlations are jointly exploited at the translation pro-cess for better HSI and LiDAR data information integration,which is implemented by the “multiscale filter concatenationoperation” as illustrated in Fig. 5.

Additionally, in the designed PToP CNN, different hiddenlayers represent distinct features; for example, the shallowerlayers may extract features with more information from dataSource I, while the deeper layers extract abstractive featureswith more information from data Source II. That is to say,features extracted via the proposed PToP CNN involve bothvarious spatial scales and some of the ratios possessed in eachimage domain, i.e., Source I and Source II. Features exploitedin different hidden layers are fused by the multilayer blockconcatenation operation as illustrated in Fig. 5, further ensur-ing perfectibility of information. Specifically, we choose eightunequal layers in the PToP CNN, which are layers 2–9.

Overall, the proposed PToP CNN can not only provideexcellent mapping between two sources of data but also deliverfeatures of multiple-scales and different levels from unequalhidden layers. Moreover, the hierarchical fusion module canplay a role in structuring the integration hierarchy, and appar-ently, the first stage of the module is to implement multiscalefilter concatenation operation. Then, multiscale concatenation

data in each layer are fed into the network following a convo-lutional operation with a spatial kernel (e.g., a 3 × 3 kernel).After batch normalization [40], [41], features are then flat-ten and fed to a fully connected layer for multilayer blockconcatenation operation.

C. Training Process of PToP CNN

For the PToP CNN, training patches are collected by adopt-ing a sliding window in an unsupervised way, as depicted inFig. 4. Both HSI and LiDAR patches are collected throughthe process shown in Fig. 4, and each patch-pair of HSI-LiDAR is acquired over the same area, thus ensuring the highcorrelation between two-source data for further joint featureextraction. The value of S (moving step length) is set to 2 with11 × 11 window size, and the total number of training sam-ples stands at about [(Width×Height)/S2] (Width and Heightare the spatial size of the image), ensuring sufficient trainingsamples. For the hierarchical fusion stage and final classifi-cation, a simple but effective data augmentation method isutilized, which produces additional data without introducingextra labeling costs. Specifically, a random seed is generatedfor controlling counter-clockwise rotation angle, 90◦, 180◦,270◦, and 360◦ in the training phase.

Then, the training process is divided into three stages asshown in Fig. 2. In the first stage, training patches derived fromSource I (HSI data) are fed into the designed PToP network,and the detailed network configuration is depicted in Fig. 3.After that, input samples flow through the PToP network toobtain features of different hierarchy (different filter scales,different layers). Therefore, the input of “hierarchical fusionmodule” is acquired, where the detailed parameter setting andnetwork structure are shown in Fig. 5. The right half of Fig. 5illustrates one of the branches, a specific layer L, includingsome fixed operation, convolution and batch normalization.Since optimizing parameters in 8 branches simultaneously isdifficult, each branch of hierarchical fusion module is trainedseparately. When the 8 branches are merged, the pretrainedfeature extractor extracts the corresponding features from inputdata with their fully connected layer and softmax predictionlayer being removed. The remaining layers in 8 branches arefixed or trainable with a small learning rate of SGD rule. Allthe branches are concatenated to generate the final informativefeature vector.

During the learning process, all data are normalized toa range of 0-1 for accelerating convergence process of thenetwork. Weights and bias of all the convolutional layers areinitialized with Glorot normalization, and then updated withsmall learning rate.

IV. EXPERIMENTS AND ANALYSIS

In this section, two widely used remote sensing data sets areused to validate the performance of the proposed method. Forthe proposed PToP CNN, all the programs are implementedusing Python language, and the networks are constructed usingTensorflow1 with the high-level API Keras.2 TensorFlow is

1http://tensorflow.org/2https://github.com/fchollet/keras


Fig. 5. Architecture of hierarchical fusion module (Part II of Fig. 2).

TABLE INUMBER OF TRAINING AND TESTING SAMPLES FOR THE HOUSTON DATA

an open source software library for numerical computationusing data flow graphs, and Keras can be seen as a simpli-fied interface to Tensorflow. All experiments are conductedby a personal computer equipped with Ubuntu 14.04 andGTX-1080 GPU.

A. Experimental Data

1) Houston Data: The first scene was acquired by theNSF-funded Center on June 2012 over the area of Universityof Houston campus and neighbor area [19]. The data arecomposed of HSI and LiDAR-derived DSM (digital surfacemodel), and both data consist of 349×1905 pixels with a spa-tial resolution of 2.5 m. The hyperspectral data were acquiredon June 23, 2012. The average height of the sensor aboveground was 5500 ft. The HSI scene consists of 144 spec-tral bands with wavelength ranging from 0.38 to 1.05 µmincluding 15 classes. The LiDAR data were acquired on June22, 2012. The average height of the sensor above ground was2000 ft. Table I lists the available training and testing samples.

2) Trento Data: The second scene was acquired over a ruralarea in the south of the city of Trento, Italy. The data are

TABLE IINUMBER OF TRAINING AND TESTING SAMPLES FOR THE TRENTO DATA

composed of 600 × 166 pixels covering six classes with aspatial resolution of 1 m. The HSI data, which were capturedby the AISA Eagle sensor [17] and comprise 63 spectral bandscovering the range from 0.42 to 0.99 µm. The LiDAR datawere acquired by the Optech ALTM 3100EA sensor. Table IIlists the available training and testing samples.

B. Parameter Tuning

In order to validate the hierarchical fusion strategy describedin Section III-B, we compare the classification results usingPToP CNN features of different layers and filter scales. Asshown in Fig. 3, a unique number has been designed for spe-cific PToP features. For example, in M-N, M represents theconvolutional tunnel and N represents the layer index. Thefully connected layer with softmax loss acts as the classifier.

1) Multilayer Comparison: Fig. 6 illustrates the classifica-tion performance of PToP features of different layers. Morespecifically, in each layer, multiscale features with the size of11 × 11 × 192 are employed as clarified in the right side ofFig. 5. It is apparent that features of unequal hidden layershave great influence on classification performance, but fea-tures derived from shallower layers (i.e., layer 0 and layer 1)achieve poor performance.

2) Multiscale Comparison: We further exploit the effect ofdifferent branches of PToP feature extractor. Table III listsoverall accuracy (OA), average accuracy (AA), and KappaCoefficient (Kappa) from using features extracted by thesingle-tunnel PToP feature extractor with different filter scale


(a)

(b)

Fig. 6. Classification performance of using PToP CNN features in individuallayer: (a) Houston and (b) Trento.

Fig. 7. Structure of the single-tunnel PToP feature extractor (D demotes thenumber of bands of input data).

W (i.e., 3 × 3, 5 × 5, and 7 × 7). Fig. 7 illustrates the detailednetwork structure of the single-tunnel feature extractor. Theexperimental results indicate that features extracted by the

TABLE IIICLASSIFICATION PERFORMANCE OF USING PTOP

CNN FEATURES IN SINGLE-TUNNEL

TABLE IVCLASSIFICATION PERFORMANCE OF THE DESIGNED PTOP CNNFEATURES EXTRACTED UNDER DIFFERENT LEARNING RATES

single-tunnel PToP feature extractor with various filter scalesyield different classification performance, and the filter scaleof 5 × 5 offers the best performance. Moreover, the proposedthree-tunnel feature extractor achieves further improvement;the OA is 92.48% and 98.73% for the Houston data and Trentodata, respectively.

3) Learning Rate: Learning rate controls the step of gra-dient descent in training process, and also affects learningbehavior of the network. The parameter is set with an initialvalue with the policy of Adam [42] in practical implemen-tation. Different learning rates are tested, and correspondingfusion features are utilized for obtaining classification resultsthrough 3 FC. As listed in Table IV, the best learning rate is0.001 for two experimental data.

4) With/Without Fine-Tune Comparison: Actually, fine-tuneprocedure plays a crucial role in achieving better classifi-cation performance and building more robust network [29].By considering the experimental analysis of Fig. 6, branchesrelated to individual layers 2–9 are trained first, and thefine-tune strategy is adopted based on pretrained 8 lay-ers. The low-level features are extracted by pretrained lay-ers 2–9 and then integrated via multilayer block concatenationfor expected objective. Fig. 8 illustrates the classificationperformance of with and without fine-tune. It is observedthat the fine-tune strategy achieves better performance thanwithout fine-tune for most of the classes of each data. Sincethe fine-tune strategy can efficiently decrease computationalquantity and increase computational efficiency, leading to bet-ter classification performance, it is adopted in our followingexperiments.

C. Classification Performance

In order to validate the effectiveness of the proposedmethod, the model is compared with several classifiers, suchas the traditional SVM and ELM, recently developed CNN-PPF [26], two-branch CNN [29], and context CNN [31].Note that SVM-based methods are implemented using the


(a) (b)

Fig. 8. Performance of fine-tune on class-specific accuracy (%) and overall accuracy (OA%) of (a) Houston and (b) Trento data.

Fig. 9. Dataset visualization and classification maps for the Houston data obtained with different methods including (a) Pseudo-color image for HSI, (b) Grayimage for LiDAR, (c) Ground truth map, (d) Legend, (e) SVM (80.49%), (f) ELM (81.92%), (g) CNN-PPF (83.33%), and (h) Context CNN (86.90%),(i) Two-Branch CNN (87.98%), and (j) PToP CNN (92.48%).

LIBSVM toolbox,3 and all the comparative methods are imple-mented with optimal parameters. For the purpose of clarity,some notations are defined hereafter: HSI data are expressedas H, the LiDAR data are represented symbolically by L,H + L represents that HSI and LiDAR data are concatenatedtogether for classification. Furthermore, we discuss several

3http://www.csie.ntu.edu.tw/ cjlin/libsvm/

collaborative classification framework; that is, SVM(H+L),CNN-PPF(H+L), ELM(H+L), Two-Branch CNN(H+L), andContext CNN(H+L).

Tables V and VI list the OA, AA, and Kappa fortwo experimental data. The proposed PToP CNN is obvi-ously superior to other methods. Taking the Houston datafor example, the OA of the proposed PToP CNN is92.48%, which is 4.5% and over 9% higher than that


Fig. 10. Dataset visualization and classification maps for the Trento data obtained with different methods including (a) Pseudo-color Image for HSI, (b) Grayimage for LiDAR, (c) Ground truth Map, (d) Legend, (e) SVM (92.77%), (f) ELM (91.32%), (g) CNN-PPF (94.76%), and (h) Context CNN (96.11%),(i) Two-Branch CNN (97.92%), and (j) PToP CNN (98.34%).

(a) (b)

Fig. 11. Classification performance of methods with different size of training samples using (a) Houston, and (b) Trento data.

of the two-branch CNN(H+L) [29] and CNN-PPF(H+L),respectively. Moreover, our approach outperforms classicalELM(H+L) and SVM(H+L) by approximately 11% and 12%,

respectively. It is clear also that due to more robust featurerepresentation, the proposed joint features extraction approachfor HSI and LiDAR data can give a significant improvement


TABLE VCOMPARISON OF THE CLASSIFICATION ACCURACY (%) USING THE HOUSTON DATA

TABLE VICOMPARISON OF THE CLASSIFICATION ACCURACY (%) USING THE TRENTO DATA

in classification accuracy when compared to other comparisonbaselines.

For a visual evaluation of the classification performance,classification maps are provided in Figs. 9 and 10. Also, theground truth map and pseudo-color maps of entire imagescenes (including unlabeled pixels) are provided for clarity.It can be clearly noticed that the proposed method producesthe most accurate and noiseless classification maps, e.g., theVineyard class in Fig. 10. Also, it can be concluded that thevisual results are consistent with those in Tables V and VI.

Fig. 11 further lists the classification performance with dif-ferent numbers of training samples to evaluate the sensitivityof all methods to the training-samples size. The percentage oftraining samples (as listed in Tables I and II) is changed from20% to 100%. Obviously, the proposed method consistentlyoutperforms other methods. Note that even for an extremelysmall training data size, such as 20%, the proposed networkstill provides excellent classification performance. For exam-ple, in Fig. 11(a), with 20% training samples, the accuracy ofthe proposed method is approximately 89% while the accu-racy of other methods are all below 80%. This verifies theproposed framework is robust to small training sample sizes.

Table VII summarizes the training and testing time of theproposed method. The training procedure takes much longertime while the testing for a whole scene is relatively faster

TABLE VIIELAPSED TIME (H: HOURS, S: SECONDS) OF TRAINING AND

TESTING TIME FOR THE PROPOSED METHOD

USING THE EXPERIMENTAL DATA SETS

for all the methods. The PToP CNN is more time-consuming,which is actually affected by two reasons. First, the exe-cution of the proposed method includes one separate buteffective joint features extraction procedure and the classifica-tion process, while the others only concentrate on classificationprocess. Second, the iteration epoch for the PToP CNN is 500,which is much higher than other methods.

V. CONCLUSION

In this paper, the PToP CNN model was proposed for jointfeature extraction, to take full advantages of very wealth spec-tral information and spatial/contextual information containedin HSI and LiDAR data. Experimental results demonstrated


that the feature extractor PToP in conjunction with the hierar-chical fusion module could simultaneously utilize the informa-tion of HSI and LiDAR data to achieve excellent collaborativeclassification performance. Besides, the feature extraction pro-cedure of PToP was carried out in a completely unsupervisedway, thus ensuring the perfectibility of information and robust-ness of extracted features under small training sample sizes.After validating with experimental data, the proposed methodhas been demonstrated to provide obviously higher accuracythan many state-of-the-art techniques.

ACKNOWLEDGMENT

The authors would like to thank Dr. P. Ghamisi for providingthe Trento Data.

REFERENCES

[1] X. Lu, Y. Yuan, and X. Zheng, “Joint dictionary learning for multispec-tral change detection,” IEEE Trans. Cybern., vol. 47, no. 4, pp. 884–897,Apr. 2017.

[2] Y. Xu, Z. Wu, J. Chanussot, and Z. Wei, “Joint reconstructionand anomaly detection from compressive hyperspectral images usingMahalanobis distance-regularized tensor RPCA,” IEEE Trans. Geosci.Remote Sens., vol. 56, no. 5, pp. 2919–2930, May 2018.

[3] S. Jia, L. Shen, J. Zhu, and Q. Li, “A 3-D Gabor phase-based codingand matching framework for hyperspectral imagery classification,” IEEETrans. Cybern., vol. 48, no. 4, pp. 1176–1188, Apr. 2018.

[4] M. Zhang, W. Li, and Q. Du, “Diverse region-based CNN for hyperspec-tral image classification,” IEEE Trans. Image Process., vol. 27, no. 6,pp. 2623–2634, Jun. 2018.

[5] W. Li, E. W. Tramel, S. Prasad, and J. E. Fowler, “Nearest regularizedsubspace for hyperspectral classification,” IEEE Trans. Geosci. RemoteSens., vol. 52, no. 1, pp. 477–489, Jan. 2014.

[6] X. Zheng, Y. Yuan, and X. Lu, “Dimensionality reduction by spatial–spectral preservation in selected bands,” IEEE Trans. Geosci. RemoteSens., vol. 55, no. 9, pp. 5185–5197, Sep. 2017.

[7] Z. Wu et al., “GPU parallel implementation of spatially adaptive hyper-spectral image classification,” IEEE J. Sel. Topics Appl. Earth Observ.Remote Sens., vol. 11, no. 4, pp. 1131–1143, Apr. 2018.

[8] L. Zhang et al., “Simultaneous spectral–spatial feature selection andextraction for hyperspectral images,” IEEE Trans. Cybern., vol. 48, no. 1,pp. 16–28, Jan. 2018.

[9] M. Khodadadzadeh, A. Cuartero, J. Li, A. Felicísimo, and A. Plaza,“Fusion of hyperspectral and LiDAR data using generalized compositekernels: A case study in extremadura, Spain,” in Proc. IGARSS, Milan,Italy, Jul. 2015, pp. 61–64.

[10] M. Zhang, W. Li, and Q. Du, “Collaborative classification of hyper-spectral and visible images with convolutional neural network,” J. Appl.Remote Sens., vol. 11, no. 4, 2017, Art. no. 042607.

[11] J. Jung, E. Pasolli, S. Prasad, J. C. Tilton, and M. M. Crawford, “Aframework for land cover classification using discrete return LiDARdata: Adopting pseudo-waveform and hierarchical segmentation,” IEEEJ. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 2,pp. 491–502, Feb. 2014.

[12] M. Zhang, P. Ghamisi, and W. Li, “Classification of hyperspectral andLiDAR data using extinction profiles with feature fusion,” Remote Sens.Lett., vol. 8, no. 10, pp. 957–966, 2017.

[13] M. Dalponte, L. Bruzzone, and D. Gianelle, “Fusion of hyperspectraland LiDAR remote sensing data for classification of complex forestareas,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 5, pp. 1416–1427,Jun. 2008.

[14] B. Koetz, F. Morsdorf, S. Van der Linden, and B. Allgöwer, “Multi-source land cover classification for forest fire management based onimaging spectrometry and LiDAR data,” Forest Ecol. Manag., vol. 256,no. 3, pp. 263–271, Jul. 2008.

[15] D. Lemp and U. Weidner, “Improvements of roof surface classificationusing hyperspectral and laser scanning data,” in Proc. ISPRS, Tempe,AZ, USA, Mar. 2005, pp. 14–16.

[16] P. Ghamisi, B. Höfle, and X. Zhu, “Hyperspectral and LiDAR datafusion using extinction profiles and deep convolutional neural network,”IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 10, no. 6,pp. 3011–3024, Jun. 2017.

[17] B. Rasti, P. Ghamisi, and R. Gloaguen, “Hyperspectral and LiDARfusion using extinction profiles and total variation component analy-sis,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 7, pp. 3997–4007,Jul. 2017.

[18] W. Liao, R. Bellens, A. Pizurica, S. Gautama, and W. Philips, “Graph-based feature fusion of hyperspectral and LiDAR remote sensing datausing morphological features,” in Proc. IGARSS, Melbourne, VIC,Australia, Jul. 2013, pp. 4942–4945.

[19] M. Khodadadzadeh, J. Li, S. Prasad, and A. Plaza, “Fusion of hyperspec-tral and LiDAR remote sensing data using multiple feature learning,”IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 6,pp. 2971–2983, Jun. 2015.

[20] W. Liao, R. Bellens, A. Pižurica, S. Gautama, and W. Philips,“Combining feature fusion and decision fusion for classification ofhyperspectral and LiDAR data,” in Proc. IGARSS, Quebec City, QC,Canada, Jul. 2014, pp. 1241–1244.

[21] C. Zhao, X. Gao, Y. Wang, and J. Li, “Efficient multiple-feature learning-based hyperspectral image classification with limited training samples,”IEEE Trans. Geosci. Remote Sens., vol. 54, no. 7, pp. 4052–4062,Jul. 2016.

[22] B. Liu et al., “Supervised deep feature extraction for hyperspectralimage classification,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 4,pp. 1909–1921, Apr. 2018.

[23] F. Zhang, B. Du, and L. Zhang, “Saliency-guided unsupervised featurelearning for scene classification,” IEEE Trans. Geosci. Remote Sens.,vol. 53, no. 4, pp. 2175–2184, Apr. 2015.

[24] L. Zhang, L. Zhang, and B. Du, “Deep learning for remote sensing data:A technical tutorial on the state of the art,” IEEE Trans. Geosci. RemoteSens., vol. 4, no. 2, pp. 22–40, Jun. 2016.

[25] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning-basedclassification of hyperspectral data,” IEEE J. Sel. Topics Appl. EarthObserv. Remote Sens., vol. 7, no. 6, pp. 2094–2107, Jun. 2014.

[26] W. Li, G. Wu, F. Zhang, and Q. Du, “Hyperspectral image classifica-tion using deep pixel-pair features,” IEEE Trans. Geosci. Remote Sens.,vol. 52, no. 2, pp. 844–853, Apr. 2017.

[27] S. Mei, J. Ji, J. Hou, X. Li, and Q. Du, “Learning sensor-specificspatial–spectral features of hyperspectral images via convolutional neu-ral networks,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 8,pp. 4520–4533, Aug. 2017.

[28] Y. Li, H. Zhang, and Q. Shen, “Spectral–spatial classification of hyper-spectral imagery with 3D convolutional neural network,” Remote Sens.,vol. 9, no. 1, p. 67, Jan. 2017.

[29] X. Xu et al., “Multisource remote sensing data classification basedon convolutional neural network,” IEEE Trans. Geosci. Remote Sens.,vol. 56, no. 2, pp. 937–949, Feb. 2018.

[30] W. Li, G. Wu, and Q. Du, “Transferred deep learning for anomaly detec-tion in hyperspectral imagery,” IEEE Geosci. Remote Sens. Lett., vol. 14,no. 5, pp. 597–601, May 2017.

[31] H. Lee and H. Kwon, “Going deeper with contextual CNN for hyper-spectral image classification,” IEEE Trans. Image Process., vol. 26,no. 10, pp. 4843–4855, Oct. 2017.

[32] L. Mou, P. Ghamisi, and X. Zhu, “Unsupervised spectral–spatial fea-ture learning via deep residual conv–deconv network for hyperspectralimage classification,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 1,pp. 391–406, Jan. 2018.

[33] A. Romero, C. Gatta, and G. Camps-Valls, “Unsupervised deep featureextraction for remote sensing image classification,” IEEE Trans. Geosci.Remote Sens., vol. 54, no. 3, pp. 1349–1362, Mar. 2016.

[34] M.-Y. Liu, T. Breuel, and J. Kautz, Unsupervised Image-to-ImageTranslation Networks, NIPS, Long Beach, CA, USA, Dec. 2017,pp. 700–708.

[35] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,Jul. 2006.

[36] Y. LeCun et al., “Backpropagation applied to handwritten zip coderecognition,” Neural Comput., vol. 1, no. 4, pp. 541–551, Dec. 1989.

[37] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutionalnetworks for biomedical image segmentation,” in Medical ImageComputing and Computer-Assisted Intervention. Cham, Switzerland:Springer, Oct. 2015, pp. 234–241.

[38] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in Proc. Comput. Vis. Pattern Recognit.,Boston, MA, USA, Jun. 2015, pp. 3431–3440.

[39] C. Szegedy et al., “Going deeper with convolutions,” in Proc. Comput.Vis. Pattern Recognit., Boston, MA, USA, 2015, pp. 1–9.


[40] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in Proc. ICML,Lille, France, Jul. 2015, pp. 448–456.

[41] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., Honolulu, HI, USA, Jul. 2016, pp. 2261–2269.

[42] D. P. Kingma and J. L. Ba, “Adam: A method for stochasticoptimization,” in Proc. 3rd Int. Conf. Learn. Represent., San Diego,CA, USA, May 2015.

Mengmeng Zhang (S’15) received the B.S. degreefrom the Qingdao University of Science andTechnology, Qingdao, China, in 2014. She is cur-rently pursuing the Ph.D. degree with the BeijingUniversity of Chemical Technology, Beijing, China,under the supervision of Dr. W. Li.

Her current research interests include remote sens-ing image process and pattern recognition.

Wei Li (S’11–M’13–SM’16) received the B.E.degree in telecommunications engineering fromXidian University, Xi’an, China, in 2007, the M.S.degree in information science and technology fromSun Yat-sen University, Guangzhou, China, in 2009,and the Ph.D. degree in electrical and computer engi-neering from Mississippi State University, Starkville,MS, USA, in 2012.

He was a Post-Doctoral Researcher with theUniversity of California at Davis, Davis, CA, USA,for one year. He is currently a Professor and the

Vice Dean of the College of Information Science and Technology, BeijingUniversity of Chemical Technology, Beijing, China. His current researchinterests include hyperspectral image analysis, pattern recognition, and datacompression.

Dr. Li was a recipient of the 2015 Best Reviewer Award from IEEEGeoscience and Remote Sensing Society for his service for IEEE JOURNAL

OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE

SENSING. He is an active Reviewer for the IEEE TRANSACTIONS ON

GEOSCIENCE AND REMOTE SENSING, the IEEE GEOSCIENCE REMOTE

SENSING LETTERS, and the IEEE JOURNAL OF SELECTED TOPICS IN

APPLIED EARTH OBSERVATIONS AND REMOTE SENSING. He is currentlyserving as an Associate Editor for the IEEE SIGNAL PROCESSING LETTERS.He has served as a Guest Editor for special issue of Journal of Real-TimeImage Processing, Remote Sensing, and the IEEE JOURNAL OF SELECTED

TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING.

Qian Du (S’98–M’00–SM’05–F’18) received thePh.D. degree in electrical engineering from theUniversity of Maryland at Baltimore, Baltimore,MD, USA, in 2000.

She is currently the Bobby Shackouls Professorwith the Department of Electrical and ComputerEngineering, Mississippi State University, Starkville,MS, USA. Her current research interests includehyperspectral remote sensing image analysis andapplications, pattern classification, data compres-sion, and neural networks.

Dr. Du was a recipient of the 2010 Best Reviewer Award from the IEEEGeoscience and Remote Sensing Society. She was the Co-Chair of the DataFusion Technical Committee of the IEEE Geoscience and Remote SensingSociety from 2009 to 2013 and the Chair of the Remote Sensing and MappingTechnical Committee of the International Association for Pattern Recognitionfrom 2010 to 2014. She has served as an Associate Editor for the IEEEJOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND

REMOTE SENSING, the Journal of Applied Remote Sensing, and the IEEESIGNAL PROCESSING LETTERS. Since 2016, she has been the Editor-in-Chief of the IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH

OBSERVATIONS AND REMOTE SENSING. She is the General Chair of the4th IEEE GRSS Workshop on Hyperspectral Image and Signal Processing:Evolution in Remote Sensing, Shanghai, China, in 2012. She is a fellow ofthe SPIE-International Society for Optics and Photonics.

Lianru Gao (M’12–SM’18) received the B.S.degree in civil engineering from TsinghuaUniversity, Beijing, China, in 2002 and the Ph.D.degree in cartography and geographic informationsystem from the Institute of Remote SensingApplications, Chinese Academy of Sciences (CAS),Beijing, in 2007.

He is currently a Professor with the KeyLaboratory of Digital Earth Science, Institute ofRemote Sensing and Digital Earth, CAS. He hasalso been a Visiting Scholar with the University

of Extremadura, Cáceres, Spain, in 2014 and Mississippi State University,Starkville, MS, USA, in 2017. In last ten years, he was the PI of ten scientificresearch projects at national and ministerial levels, including projects by theNational Natural Science Foundation of China from 2010 to 2012, 2016 to2019, and 2018 to 2020, and the Key Research Program of the CAS from2013 to 2015. He has published over 120 peer-reviewed papers, and there are50 journal papers included by Science Citation Index. He has co-authored thebook entitled Hyperspectral Image Classification and Target Detection. Heholds 17 National Invention Patents and 4 Software Copyright Registrationsin China. His current research interests include models and algorithms forhyperspectral image processing, analysis and applications.

Dr. Gao was a recipient of the Outstanding Science and TechnologyAchievement Prize of the CAS in 2016 and the China National Science Fundfor Excellent Young Scholars in 2017. He received the recognition of theBest Reviewers of the IEEE JOURNAL OF SELECTED TOPICS IN APPLIED

EARTH OBSERVATIONS AND REMOTE SENSING in 2015 and the IEEETRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING in 2017.

Bing Zhang (M’11–SM’12) received the B.S.degree in geography from Peking University,Beijing, China, and the M.S. and Ph.D. degrees inremote sensing from the Institute of Remote SensingApplications, Chinese Academy of Sciences (CAS),Beijing.

He is currently a Full Professor and the DeputyDirector of the Institute of Remote Sensing andDigital Earth, CAS, where he has been leading keyscientific projects in the area of hyperspectral remotesensing for over 20 years. He has developed five

software systems in the image processing and applications. He has authoredover 300 publications, including over 170 journal papers. He has edited sixbooks/contributed book chapters on hyperspectral image processing and sub-sequent applications. His current research interests include the developmentof mathematical and physical models and image processing software for theanalysis of hyperspectral remote sensing data in many different areas.

Dr. Zhang was a recipient of the National Science Foundation forDistinguished Young Scholars of China in 2013 and the 2016 OutstandingScience and Technology Achievement Prize of the Chinese Academy ofSciences, the highest level of Awards for the CAS scholars. His creativeachievements were rewarded ten important prizes from Chinese government,and special government allowances of the Chinese State Council. He is cur-rently serving as an Associate Editor for the IEEE JOURNAL OF SELECTED

TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING andIEEE GEOSCIENCE AND REMOTE SENSING LETTERS. He has been serv-ing as a Technical Committee Member of IEEE Workshop on HyperspectralImage and Signal Processing since 2011 and the President of hyperspectralRemote Sensing Committee of China National Committee of InternationalSociety for Digital Earth since 2012. He is the Student Paper CompetitionCommittee Member in IGARSS 2015, 2016, and 2017.

Date post:	30-Aug-2021
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Feature Extraction for Classification of Hyperspectral and ...utilized extinction proﬁles for...

Documents