06331573

7/25/2019 06331573

1/12

1344 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 5, NO. 5,OCTOBER 2012

Evaluation of SVM, RVM and SMLR for AccurateImage Classification With Limited Ground Data

Mahesh Pal and Giles M. Foody, Senior Member, IEEE

AbstractThe accuracy of a conventional supervised classifica-

tion is in part a function of the training set used, notably impacted

by the quantity and quality of the training cases. Since it can be

costly to acquire a large number of highquality training cases, re-

cent research has focused on methods that allow accurate classifi-

cation from small training sets. Previous work has shown the po-

tential of support vector machine (SVM) based classifiers. Here,

the potential of the relevance vector machine (RVM) and sparse

multinominal logistic regression (SMLR) approaches is evaluated

relative to SVM classification. With both airborne and spaceborne

multispectral data sets, the RVM and SMLR were able to derive

classifications of similar accuracy to the SVM but required consid-

erably fewer training cases. For example, from a training set com-prising 600 casesacquired with a conventional stratified random

sampling designfrom an airborne thematic mapper (ATM) data

set, the RVM produced the most accurate classification, 93.75%,

and needed only 7.33% of the available training cases. In com-

parison, theSVM yielded a classification that had an accuracy of

92.50% and needed 4.5 times more useful training cases. Similarly,

with a Landsat ETM+ (Littleport, Cambridgeshire, UK) data set,

the SVM required 4.0 times more useful training cases than the

RVM. For each data set, however, the classifications derived by

each classifier were of similar magnitude, differing by no more

than 1.25%. Finally, for both the ATM and ETM+ (Littleport)

data sets, the useful training cases by SVM and RVM had dis-

tinct and potentially predictable characteristics. Support vectors

were generally atypical but lay in the boundary region between

classes in feature space while the relevance vectors were atypical

but anti-boundary in nature. The SMLR also tended to mostly, but

not always, use extreme cases that lay away from class boundary.

The results, therefore, suggest a potential to design classifier-spe-

cific intelligenttraining data acquisition activitiesfor accurate clas-

sification from small training sets, especially with the SVM and

RVM.

Index TermsGround truth, relevance vector machines, sparsemultinomial logistic regression, support vector machines, trainingdata, typicality.

I. INTRODUCTION

L AND cover mapping is one of the most common applica-tions of remote sensing. Land cover maps are produced tomeet the needs of a diverse array of users and are typically de-

rived via some form of image classification analysis, which is,

Manuscript received September 30, 2011; revised February 12, 2012; ac-cepted August 02, 2012. Date of publication October 16, 2012; date of currentversion November 14, 2012. This work was supported in part by the Associ-ation of Commonwealth Universities (ACU), London, through a fellowship toM. Pal.

M. Pal is with the Department of Civil Engineering, NIT Kurukshetra,Haryana, 136119 India (e-mail: [email protected]).

G. M. Foody is with the School of Geography, University of Nottingham,Nottingham, NG 7 2RD, U.K.

Digital Object Identifier 10.1109/JSTARS.2012.2215310

one of the main pattern recognition techniques applied in remote

sensing. Although remote sensing offers the potential to acquire

imagery of large areas inexpensively, there are still major costs

to be incurred in a mapping programme. One major cost to be

met in a mapping application is associated with ground refer-

ence data [1][3].

Ground data requirements may vary from study to study but

it is common to find that ground data are required to train a su-

pervised classification analysis and to evaluate map accuracy.

The training dataset should beclassifier specific. A maximum

likelihood classifi

er might need a large sample acquired with arandom sampling design to provide accurate information about

the mean and variance of the classes while a SVM may need

only a smaller training set of spectrally extreme cases that lie

close to the decision boundaries [1].Given that ground reference

data are expensive anddifficult to acquire, many have sought

to reduce the groundreference data requirements. While it may

sometimes be possible to reduce the ground data requirements in

the testing stage for accuracy assessment [4] most attention has

focused on the training stage. For example, strategies adopted

include the use of unlabeled cases in training [2], [5][9], adop-

tion of pre-processing methods such as feature reduction to re-

duce training data set requirements [10], [11], use of intelligent

training selection strategies to focus on the acquisition of in-

formative training samples [1], [12], [13] and strategies to re-

duce training set size when attention is focused on a specific

class [12], [14], [15]. This article develops aspects of previous

work and focuses on the potential for accurate classification

with small training sets through the use of contemporary ma-

chine learning classifiers that may theoretically require only few

training samples.

The support vector machine (SVM) has been extensively

used asa state-of-art supervised classifier with remote sensing

data [16][21]. A key reason behind its popularity is its ability

to yield highly accurate classifications, often more accurate than

from other contemporary approaches such as neural networksand decision trees [20], [22][24]. Moreover, of particular

concern to this article, research has shown that the SVM may

be used to produce an accurate classification from a small

number of useful training cases lying close to the decision

boundary [1] and that the financial savings to a mapping project

derived from this feature can be large. For example, [13] show

a reduction in the total cost of a mapping project by

focussing attention on the most informative training samples

forclassification with a SVM.

The SVM based approach to classification is, however, not

problem-free. Concerns include the need to define a set of

parameters [25], [26], an inability to form a full confusion

1939-1404/$31.00 2012 IEEE

7/25/2019 06331573

2/12

PAL AND FOODY: EVALUATION OF SVM, RVM AND SMLR FOR ACCURATE IMAGE CLASSIFICATION WITH LIMITED GROUND DATA 1345

matrix in some strategies to multi-class classification [27] and

a lack of information on per-case classification uncertainty[28].

Other classifiers may sometimes be attractive alternatives to

the SVM, especially with regard to the aforementioned con-

cerns with SVM-based classification. Recently, for example,

[29][31] showed that a Bayesian extension of the SVM, called

the relevance vector machine (RVM; [32]), can be used as an

alternative to the SVM for image classification and has the

ability to provide per-case uncertainty data in the form of poste-

rior probabilities of class membership. Moreover, comparative

studies suggest that a RVM may require fewer training cases

than a SVM in order to classify a data set [29]. It has been

suggested that that the useful training cases for classification

by a RVM are anti-boundary in nature while those for use in

classification by a SVM tend to lie near the boundary between

classes [32]. The potential to use a small training set and derive

per-case uncertainty information is also offered by the use the

sparse multinomial logistic regression (SMLR; [33]) for image

classification.

The aim of this study was to evaluate the potential of theRVM and SMLR classifiers for accurate classification from

small training sets relative to the SVM, which has been eval-

uated previously [1]. The key focus in the evaluation was

on the accuracy with which data sets may be classified and

the number of training cases required. Many studies have

shown that only a small proportion of a training set acquired

by conventional sampling methods is actually required for

accurate classification by classifiers such as the SVM, RVM

and SMLR [1], [24], [29], [33].A key challenge is finding these

useful training cases in a way that allows accurate and efficient

classification. Sometimes researchers have acquired a large

training sample by conventional methods and then from thisidentified the useful training cases [20], [29], [31], [33][35].

Such approaches can be inefficient, notably in relation to the

effort required to collect redundant training samples. A popular

alternative is to adopt approaches such as active learning in

which useful training sites are identified in an iterative analysis

of the image [8], [9]. While attractive such approaches have

limitations [36]. One key concern is that this type of method

can only be applied post-image acquisition and can be costly

and inefficient in terms of ground data acquisition as sites for

labeling are identified iteratively. The realization of the full

potential of classification methods such as the SVM, RVM

and SMLR requires an ability to identify the useful training

cases and predict their location on the ground in advance of

the classification [1], [12], [13].This would allow an intelli-

gent training programme [1], [13] to be defined. For this, it

is necessary for the characteristics of useful training sites to

be predictable. Thus in addition to the accuracy with which

data may be classified, a key focus of this article is the nature

of the training set required for an accurate classification and

especially the characterization of the useful training cases to act

as a guide to their predictability. The remainder of this article

is structured such that the three classification algorithms used

are briefly outlined in Section II before presenting the data sets

they are applied to in Section III. The results of the analyses

are presented on Section IV and key conclusions drawn inSection V.

II. CLASSIFICATIONALGORITHMS

Three classification algorithms were used: SVM, RVM and

SMLR. All three use the training cases to define the location

of classification decision boundaries to partition the data space

such that cases of unknown class membership may be allocated

to a class. The way the training data are used and nature of

the classifiers differs, however, and so a brief summary of eachalgorithm is given below. In each discussion the focus is on

the training of the classifier. Particular attention is paid to the

training cases that are used to form the decision boundaries.

A subset of the available cases for training is typically used

in classification by each of the three selected algorithms. These

useful training cases are the support vectors, relevance vec-

tors and retained kernel basis functions in classifications by the

SVM, RVM and SMLR respectively. In the discussion below a

training set of cases, represented by , ,

where is input vector with input

features (wavebands) and is the

class vector with classes, is available to the classifiers.

A. SVM

The SVM is based on statistical learning theory and has the

aim of determining the location of decision boundaries that pro-

duce the optimal separation of the classes [37]. In the case of

a two-class pattern recognition problem in which the classes

are linearly separable, the SVM selects from among the infinite

number of linear decision boundaries the one that minimises the

generalisation error. Thus, the selected decision boundary will

be one that leaves the greatest margin between the two classes,

where the margin is defined as the sum of the distances to the

hyperplane from the closest points of the two classes [37]. Theproblem of maximising the margin can be solved using standard

quadratic programming optimisation techniques. The training

cases that are closest to the hyperplane are used to measure the

margin and these training cases are termed support vectors.

Only the support vectors are needed to form the classification

decision boundaries and these typically represent a very small

proportion of the total training set. If regions likely to furnish

support vectors can be predicted then only a small training set,

comprising the support vectors, may be acquired for a classifi-

cation [1], [13].

For a 2-class classification problem (i.e., ),

the training cases are linearly separable if there exists a weightvector (determining the orientation of a discriminating plane)

and a scalar (determining the offset of the discriminating plane

from the origin) such that and the hy-

pothesis space can be defined by the set of functions given by

(1)

The SVMfinds the separating hyperplanes for which the dis-

tance between the classes, measured along a line perpendic-

ular to the hyperplane, is maximised. This can be achieved by

solving following constrained optimization problem

(2)

7/25/2019 06331573

3/12


If the two classes are not linearly separable, the SVM tries

to find the hyperplane that maximises the margin while, at the

same time, minimising a quantity proportional to the number

of misclassification errors. The restriction that all training cases

of a given class lie on the same side of the optimal hyperplane

can be relaxed by the introduction of a slack variable

and the trade-off between margin and misclassification error is

controlled by a positive user-defined constant such that

[38]. Thus, for non-separable data, (2) can be written as:

(3)

SVM can also be extended to handle non-linear decision sur-

faces. [39] propose a method of projecting the input data onto

a high-dimensional feature space through some nonlinear map-

ping and formulating a linear classification problem in that fea-

ture space. Kernel functions are used to reduce the computa-

tional cost of dealing with high-dimensional feature space [37].

A kernel function is defined as andwith the use of a kernel function (1) becomes:

(4)

where is a Lagrange multiplier.

Further and more detailed discussion on SVM can be found

in [37], [40], [41].

B. RVM

The RVM is a recent development in kernel based machine

learning approaches and can be used as an alternative to SVMfor image classification. The RVM is a possibilistic counterpart

to the SVM, based on a Bayesian formulation of a linear model

with an appropriate prior that results in a sparser representation

than that achieved by SVM. The RVM is based on a hierarchical

prior, where an independent Gaussian prior is defined on the

weight parameters in the first level, and an independent Gamma

hyper prior is used for the variance parameters in the second

level, which leads to model sparseness [32]. An algorithm pro-

duces sparse results when among all the coefficients defining

the model only few are non-zero. This property helps in fast

model evaluation and provides a potential for accurate classi-

fication from small training sets. Key advantages of the RVMover the SVM include a reduced sensitivity to the hyperparam-

eter settings, an ability to use non-Mercer kernels, the provision

of a probabilistic output, no need to define the parameter , and

often a requirement for fewer relevance vectors than support

vectors for a particular analysis [31], [32].

In a two class classification by RVM, the aim is, essentially,

to predict the posterior probability of membership for one of the

classes for a given input. A case may then be allocated to the

class with which it has the greatest likelihood of membership.

Using a Bernoulli distribution the likelihood function for the

analysis would be:

(5)

where is a set of adjustable weights. For multiclass classifica-

tion (5) can be written as:

(6)

where is the logistic sigmoid function:

(7)

and an iterative method is used to obtain , Let de-

notes the maximum-a-posteriori estimate of the hyperparameter

. The maximum-a-posteriori estimate of the weights ( )

can be obtained by maximizing the following objective func-

tion:

(8)

where the first summation term corresponds to the likelihoodof the class labels and the second term corresponds to the prior

on the parameters . In the resulting solution, the gradient of

with respect to is calculated and only those training cases

having non-zero coefficients , which are called relevance vec-

tors, will contribute to the generation of a decision function. The

posterior is approximated around weights by a Gaussian approx-

imation with

where is the Hessian of , matrix has elementsand is a diagonal matrix with elements defined

by .

An iterative analysis is followed to find the set of weights

that maximizes the value of (8) in which the hyperparameters

, associated with each weight are updated. During training,

the hyperparameter for a large number of training cases will at-

tain very large value and the associated weights will be reduced

to zero. Thus, the training process applied to a typical training

set acquired following standard methods will make most of the

training cases irrelevant and leave only the useful training

cases. As a result only a small number of training cases are re-

quired forfinal classification. The assignment of an individual

hyperparameter to each weight is the ultimate reason for the

sparse property of RVM. Further details on the RVM are given

by [32].

C. SMLR

The Sparse Multinomial Logistic Regression algorithm [33]

utilises a Laplacian prior on the weights of the linear combi-

nation of functions to enforce sparseness. This prior favours a

few large weights with many of the others set to exactly zero.

The SMLR algorithm learns a multi-class classifier based on the

multinomial logistic regression. This method performs simulta-

neously a feature selection, to identify a small subset of the most

relevant features, and the learning of the classification decision

rules.

7/25/2019 06331573

4/12


TABLE ITHEMEAN ANDSTANDARDDEVIATIONVALUES OF THE SYNTHETIC DATA

If is the weight vector associated with class , then the

probability that a given training case belongs to class is given

by

(9)

Usually a maximum likelihood estimation procedure is used to

obtain the components of from the training data by maxi-

mizing the log-likelihood function [42]:

(10)

In order to achieve the sparsity, a Laplacian prior ( ) is incor-

porated while estimating . [32] propose to use a maximum a

posteriori (MAP) criterion for multinomial logistic regression.

The estimate of are then given by:

(11)

in which is Laplacian prior on , which means that

where is a user-defined parameter

and affects the level of sparsity with SMLR. Thus, similar tothe SVM and RVM, the SMLR uses a small number of training

cases, called retained kernel basis functions, in model creation.

Further details about SMLR and modified SMLR may be found

in [10], [33], [43][45].

III. DATASETS ANDMETHODS

The three classifiers were used to undertake a series of clas-

sifications to highlight the potential for accurate classification

from small training sets. The support vectors, relevance vectors

and retained kernel basis functions that are central to the classi-

fication by SVM, RVM and SMLR algorithms respectively will

be referred to as useful training cases in all classifications.

Four data sets were used. First, a simple simulated data set

was used to aid understanding and interpretation of the useful

training cases. This data set comprised three classes generated

randomly from Gaussian normal distributions in two wavebands

(Table I). Here, a training sample of 100 cases of each class was

randomly generated and made available to each of the classifiers

and the analyses undertaken using ten-fold cross validation.

Second, a dataset acquired in bands 1 and 5 from ETM+ of

a test site near Boston in Lincolnshire UK were used. Atten-

tion focused on three classes that were abundant at the test site:

wheat, sugar beet and oilseed rape. One hundred cases of each

class selected at random were used for training and testing all

three classifiers. These data were used only to extent the evalu-ation of the characterisation of the useful training cases with the

simulated data to a real data set. As with the simulated data set,

ten-fold cross validation was used with ETM+ (Boston) dataset.

More extensive analyses were undertaken with the remaining

two data sets with the accuracy of the resulting classifications

evaluated against ground data.

The third data set was obtained by Daedalus 1268 airborne

thematic mapper (ATM) for an agricultural test site near

Feltwell, UK. The ATM data were acquired in 3 spectral wave-

bands, with a spatial resolution of 5 m [46]. The ATM data

were used to classify six different crop types: sugar beet, wheat,

barley, carrot, potato and grass. A map depicting the crop type

planted in each field produced near the time of the ATM data

acquisition was used as ground data to inform the training and

testing of the classifications. The training sets comprised of

100 randomly selected pixels of each class for the analyses of

the ATM data set. The testing set comprised 320 pixels drawn

at random from the test site.

The fourth and final data set used was acquired by the

Landsat ETM+ for an agricultural area near Littleport in

Cambridgeshire, UK. The data in the six-non-thermal spectralwavebands with a 30 m spatial resolution were used to classify

seven agriculture land cover types: wheat, sugar beet, potato,

onion, peas, lettuce and beans [47].A map depicting the crop

type planted in each field produced near the time of the ETM+

(Littleport) data acquisitions was used as ground data. For

each class, 100 randomly selected pixels were used to train the

classifiers. The accuracy of the classifications was evaluated

using an independent testing set that comprised 1,400 randomly

selected pixels.

For each classification undertaken with the ATM and ETM+

(Littleport) data sets, accuracy was assessed with the aid of a

confusion matrix and expressed as the percentage of the testingcases correctly allocated. As the potential for accurate classifi-

cation by the SVM from small training sets has been demon-

strated, a desire was to determine if the RVM and SMLR ap-

proaches were at least as accurate as the SVM classification,

which may be assessed by a test of non-inferiority. For both the

RVM and SMLR methods, this was evaluated by using the con-

fidence interval of the difference in accuracy obtained from that

observed with the SVM in a test of non-inferiority, which fo-

cuses on the lower limit of the defined confidence interval [48],

[49]. In this evaluation it was assumed that the zone of indiffer-

ence was 2.00%; this value was selected arbitrarily but ensures

that small differences in accuracy are inconsequential. For all

experiments, a personal computer with a Pentium IV processor

and 3 GB of RAM was used.

SVM were initially designed for binary classification prob-

lems. A range of methods have been suggested for multi-class

classification [20], [50], [51]. Here, the one against rest, ap-

proach with ATM dataset [17], [24] and one against one with

simulated and ETM+ datasets [51] was used. Throughout, a ra-

dial basis function kernel with kernel specific parameter ( ) was

used with SVM, RVM and SMLR algorithms. The softwares

LIBSVM and BSVM [50], [52] were used to implement the

SVM whereas software SMLR [33] was used to implement the

sparse multinomial logistic regression classifier. A multiclass

implementation of original RVM codes [32]; [53] was used toimplement RVM classifier. Similar to the parameter required

7/25/2019 06331573

5/12


TABLE IIUSERDEFINED PARAMETERSWITH ALL FOURDATASETSUSED INTHISSTUDY

TABLE IIIMEANMAHALANOBIS DISTANCEMEASURES COMPUTEDOVERALL USEFUL TRAININGCASES FOR ACLASSBASED ONANALYSES OF THE SIMULATEDDATA

in the design of SVM classifier, the value of the hyperparameterand parameter influence the accuracy of classifications by

the RVM and SMLR algorithms respectively. In order to find a

suitable value for each of the user-defined parameters with the

different classification algorithms, cross validation and trial and

error methods were used. Specifically, five-fold cross validation

was used with the SVM while trial and error procedures were

used with the RVM and SMLR to find the suitable values for the

user-defined parameters for the classifications of both the simu-

lated and real remote sensing data sets. For classification by the

RVM, the trials involved varying the and values from 0.1 to

2.0with a step size of 0.1 and to with a step size of

respectively. For classifi

cation by the SMLR, the andvalues were varied from 0.1 to 15.0 and 0.1 to 2.5 with a step

size of 0.1 respectively. For the analyses of all four data sets,

the optimal values of the user-defined parameters are provided

in Table II.

The position of the useful training cases in feature space was

evaluated visually and quantitatively characterised with mea-

sures based on their Mahalanobis distance to the centroid of

each class. The Mahalanobisdistance between a case and a class

centroid is inversely related to the typicality of the case to the

class [54], [55]. Thus, a low distance indicates that the case lies

close to the class centroid and so is typical of the class while

a large distance indicates that the case is atypical of the class.

As well as providing a simple guide to the typicality of a case

to a class, the set of Mahalanobis distances computed over all

classes for a case may be used to provide a simple descriptorof the location of the case relative to the class centroids and,

more critically, the decision boundaries. For example, a decision

boundary may be expected to lie between two class centroids

and so at a similar Mahalanobis distance from each centroid.

Thus, if the difference between the two smallest Mahalanobis

distances computed for a case was small this would indicate

that the case lies close to the border region between two classes

and near the location of a decision boundary [56]. Conversely, if

the difference between the two smallest Mahalanobis distances

was large the case lies away from the border region between

two classes and the decision boundary that separates them [56].

Here, the Mahalanobis distances and the difference between thetwo smallest Mahalanobis distances for each case were com-

puted to indicate the typicality of each case to a class and its

position relative to inter-class transition regions respectively.

IV. RESULTS

The classifications of the simulated data set allowed the gen-

eral characteristics of the useful training cases for each classi-

fier to be determined. Two major attributes of the useful training

cases were apparent. First, all three classifiers used only a small

proportion of the available training data set in classifying the

data (Table III). The total number of useful training cases ranged

from 6 for the RVM to 76 for the SMLR, representing 2.00%

and 25.00% of the total sample size respectively. Second, the

7/25/2019 06331573

6/12


Fig. 1. Location of the useful training cases for classifications of the simulateddata by (a) SVM, (b) RVM and (c) SMLR.

useful training cases were distributed in feature space in a rel-

atively systematic fashion (Fig. 1). The location of the useful

training cases, however, varied between the three classifiers.

The trends were visually most apparent for class 2. For this

class, the support vectors were a set of extreme cases that lay at

the edge of the class distribution and between the distributions

of the other classes (Fig. 1(a)). As expected, the support vectors,

therefore, lay in region close to where a classification decision

Fig. 2. Location of the useful training cases for classifications of the ETM+(Boston) data by (a) SVM, (b) RVM, (c) SMLR.

boundary would be fitted. With the RVM, the relevance vectors

were also extreme cases but located away from the boundary re-

gion (Fig. 1(b)). Note that for all three classes the support vec-

tors have a relatively large Mahalanobis distance to the actual

7/25/2019 06331573

7/12


TABLE IVMEAN MAHALANOBIS DISTANCE MEASURES COMPUTED OVER ALL USEFUL TRAINING CASES

FOR ACLASS BASED ONANALYSES OF THE ETM+ (BOSTON) DATASET

class of membership and a small difference between the two

smallest Mahalanobis distances, which indicate that they are

extreme cases located in the region of a classification decision

boundary (Table III). The atypical nature of the support vectors

is perhaps most apparent if the Mahalanobis distance to the ac-

tual class of membership is expressed as a typicality probability

[54], with the mean typicality of the support vectors for each

class being 0.05. The relevance vectors also show a relatively

large Mahalanobis distance to the actual class of membershipbut a large difference between the two smallest Mahalanobis

distances, indicating that they are extreme but anti-boundary

in nature (Table III). Again, for each class, the mean Maha-

lanobis distance to the actual class for the selected relevance

vectors equated to typicality probabilities of 0.05 or less. With

the SMLR, the location of the retained kernel basis functions

varied from relatively typical (class 2) to atypical and near/anti

decision boundary (classes 1 and 3) (Fig. 1(c)). Whilethe Maha-

lanobis distance to the actual class were generally smaller than

for the SVM and RVM, the useful training cases for classes 1

and 3 were still highly atypical, with mean typicality proba-

bilities of . These trends are also apparent in the Maha-lanobis distance based metrics that characterise the location of

the useful training cases (Table III).

Keeping in view the uncertainty in the location of useful

training cases provided by SMLR classifier with simulated

data set, further analysis with ETM+ (Boston) data set were

undertaken. The results summarised in Table IV and Fig. 2

suggests similar trends to those observed with the simulated

dataset with regard to the location of support vectors and

relevance vectors. The total number of useful training cases

with this data set varied from 7 for the RVM to 19 for the

SVM representing about 2.00% and 6.00% of the total training

sample respectively. The results indicate that useful trainingcases with SMLR are located away from the class boundary

for all three classes (Fig. 2). A comparison of Mahalanobis

distance and difference between the two smallest Mahalanobis

distance (Table IV) suggests them to be extreme cases lying

away from class boundary.

Similar trends were observed with the classifications of the

ATM data set. Of the 600 training cases available, the SVM,

RVM and SMLR used only 202, 44 and 101 respectively, repre-

senting between 7.33%and 33.66% of thetotal set. In thecaseof

the analyses of the ETM+ (Littleport) data set, the SVM, RVM

and SMLR used 314, 79 and 172 training cases representing be-tween 11.29% and 44.90% of the total set of 700 training cases.

The difference between the two smallest Mahalanobis distances

was generally small for the support vectors but generally large

for the relevance vectors and the retained kernel basis functions

for both the ATM (Table V) and ETM+ (Table VI) data sets. The

results again suggest that useful training cases for the SVM are

atypical and lie in the border region between classes while for

the RVM and SMLR the useful training cases are atypical but

located away from the border region.

Together, the two attributes of the useful training cases, their

small number and systematic location in feature space, indicate

a potential to use small training sets for classification by eachof the three classifiers. Critically, the systematic nature of their

location in feature space suggests a potential to predict their lo-

cation on the ground in advance. That is, the systematic location

of the useful training cases in feature space can be re-projected

into geographical space to allow intelligent training [1], [13].

This has been achieved with SVM, for example by deliberately

focusing training data collection activities on extreme cases that

are expected to have most spectral similarity to other classes

[1], [13]. It should also be possible, however, to design intel-

ligent training data acquisition programmes in ways to focus

on potentially useful training cases for the RVM and SMLR.

For example, like the SVM, attention might focus on spectrallyextreme cases, but not those in the border region, when using

the RVM and SMLR classifiers. The operation of an intelli-

7/25/2019 06331573

8/12


TABLE VMEANMAHALANOBIS DISTANCEMEASURES COMPUTEDOVERALL USEFULTRAINING CASES FOR ACLASSBASED ONANALYSES OF THE ATM DATA

gent training scheme requires moving between feature and ge-

ographical space. For example, the approach used in [13] was

based on using fundamental knowledge of the variables that in-

fluence the spectral response to aid the selection of training sites

on the ground that would be expected to lie at extreme posi-

tions in feature space. For example, with a crop, extreme cases

might be expected to occur in regions of differing growth stage

and cover as well as with differing soil backgrounds. Moreover,

different extremities can be defined. For example, sites of ex-

tremely high and low plant cover would be expected to liein dif-

ferent locations in feature space. Similarly, crops grown on dif-

ferent soil types or perhaps growing on wet and dry soils wouldbe expected to lie in different, potentially predictable, locations

of feature space [13], [57]. The precise approach will depend

on the specific data sets used but provided the useful training

cases have a potentially predictable nature an intelligent training

scheme should be feasible. Finally, it is apparent that the results

also highlight that training data collection programmes should

be designed in a classifier-specific manner. Note, for example

with both the ATM and ETM+ (Littleport) data sets, that few of

the training cases selected as useful by one classifier were also

selected as useful by another classifier (Table VII).

The results above indicate that all three classifiers use mostly

different training cases and so point to a desire for classifier-spe-

cific training data acquisition programmes. The importance of

this can be seen in the results of classifications of the ATM data

derived using a classifier trained upon data useful for another

classifier. For example, the useful training cases for classifica-

tion by a SVM (support vectors) were used to train the RVM and

SMLR classifiers. The resulting classifications had an accuracy

of 91.00% and 85.00% for the RVM and SMLR respectively;

both less than the accuracy of 92.50% derived when the sup-

port vectors identified from the entire training set were used.

Similarly, the SMLR and SVM yielded classifications with an

accuracy of 42.50% and 70.31% when trained with the useful

training cases defined for the RVM; both substantially less than

the 93.75% obtained when relevance vectors identified from the

entire training set were used. Lastly, when trained with usefultraining cases for a SMLR classification, the accuracy of the

SVM and RVM classifications were 87.18% and 78.00% re-

spectively; again both substantially less than the 92.81% ob-

tained when the retained kernel basis functions identified from

the entire training set were used. These results indicate a decline

in classification accuracy, by all three classification algorithms,

when trained with useful training cases defined for another clas-

sifier. Thus, a training set defined for one classifier and able to

yield an accurate classification may yield a low accuracy if used

with a different classifier. Taken together, these results highlight

the impact of the training data on the accuracy of a classification

and the desire for classifier specific training data acquisition.

The potential to characterise useful training sites and so to

design an intelligent training data collection programme offers

7/25/2019 06331573

9/12


TABLE VIMEAN MAHALANOBIS DISTANCE MEASURES COMPUTED OVER ALL USEFUL TRAINING CASES

FOR ACLASS BASED ONANALYSES O F THE ETM+(LITTLEPORT) DATA

TABLE VIINUMBER OFCOMMON USEFUL TRAINING CASES

attractive benefits relative to alternative approaches for efficient

training such as active learning. The ideal training set is ac-

quired at or close to the time of image acquisition. Intelligent

training allows the location of potentially useful training sites to

be predicted in advance of an analysis, as in [13], and so allow

close temporal coincidence of ground and image data acquisi-

tions. As noted above, methods such as active learning can only

be applied post-image acquisition. In some situations the time

gap between theacquisition of ground and image data may be a

source of error and uncertainty (e.g., the image may have been

acquired justbefore crop harvesting and so show the presence

of a mature crop but post-image acquisition ground data surveysmight find bare fieldsetc.). Additionally, as the active learning

methods highlight useful training sites iteratively their use could

require a series of ground data collection programmes. Such a

situation does not allow efficient design of ground data collec-

tion with multiple, possibly overlapping,field programmes re-

quired.The magnitude of the concerns will vary as a function of

factors such as class temporal variability and the source of the

ground data. It should also be noted that the various approaches

to efficient training may be complementary and so could per-

haps be usefully combined. For example, intelligently selected

training sites could act perhaps as seeds or starting point for the

selection of other potentially useful but unlabeled pixels.

A key result is that smaller training sets than required for

the SVM may be used by the RVM and SMLR with ATM and

7/25/2019 06331573

10/12


TABLE VIIICONFUSIONMATRICES FOR THECLASSIFICATIONS OF THEATM DATA(A) SVM, (B) RVM AND (C) SMLR. THEOVERALL ACCURACY OF THECLASSIFICATIONS

WAS93.75% FORRVM, 92.50% FORSVM AND 92.81% FORSMLR. PER-CLASSACCURACY(%) SHOWNFROMUSERS ANDPRODUCERSPERSPECTIVES

TABLE IXNON-INFERIORITY TEST RESULTSRELATIVE TOSVM BASED ON95%

CONFIDENCE INTERVAL ON THE ESTIMATEDDIFFERENCE INACCURACY. NOTETHAT THEDIFFERENCES INACCURACY WERE ALL VERYSMALL( )ANDINSIDE THE DEFINED ZONE OFINDIFFERENCE.

ETM+ (Littleport) data sets, making them attractive alternatives

to the established SVM forimage classification.The value of this

attribute, however, is a function of the accuracy and compu-

tational cost of the classifications. In terms of classification

accuracy, all three classifiers produced highly accurate classi-

fications of the ATM data set (Table VIII). Critically, the lower

limit of the derived 95% confidence interval for the difference in

accuracy from the SVM classification was above 0 for both the

RVM and SMLR classificationsand the entire intervallay within

thezoneof indifference, indicatingthatthe RVMand SMLR clas-

sifications were statistically non-inferior to that from the SVM

at the 97.5% level of confidence with ATM data set. Indeed their

estimated accuracies of 93.75% and 92.81% respectively were

marginally higher than that from the SVM (92.50%; Table IX).

TABLE XVARIATION OFCLASSIFICATIONACCURACY ANDNUMBER OFRELEVANCE

VECTORS WITH USINGATM DATASET.

It is evident that the RVM produced the highest accuracy yet

required the smallest training set, although it should be noted

that the results of the trial analyses highlighted variation in the

number of relevance vectors needed and classification accuracy

with the value of the parameter; at large values of the entire

set of available training cases were required highlighting the

importance of careful parameter value selection (Table X). In

the case of analyses with the ETM+ (Littleport) data set, the

classifications were of similar accuracy with the classification

obtained by RVM (80.21%) slightly lower than that from

the SVM (81.36%)and SMLR (81.71%), though critically

the RVM and SMLR classifications were not inferior to the

7/25/2019 06331573

11/12


TABLE XICOMPUTATIONAL COST AND THENUMBER OFUSEFUL TRAININGCASESUSED

BY THECLASSIFIERS.

SVM with the confidence intervals lying within the zone of

indifference. Additionally, the SVM required approximately

4.0 and 1.8 times the useful training cases used by RVM and

SMLR respectively. In terms of training and testing time used

by SVM, RVM and RMLR, precise value for the computationalcost cannot be exactly compared because all three algorithms

were implemented using different programming languages.

Nevertheless, a comparison of computational cost suggests that

the RVM and, to a lesser degree, the SMLR were computation-

ally more demanding than the SVM (Table XI), which may be

a concern for analyses of large data sets.

V. CONCLUSIONS

The potential of SVM for accurate classification from small

training sets has been established in previous research. Other

classifiers such as RVM and SMLR, however, offer additionalfeatures, such as information on per-case classification uncer-

tainty that may sometimes be useful. Here, it has been shown

that the RVM and SMLR are able to classify data to similar

accuracies to the SVM. Moreover, both RVM and SMLR re-

quire fewer training cases than a SVM when used with remotely

sensed data. Additionally, the useful training cases for SVM and

RVM classifiers have different but well-defined characteristics

which may make them easily predictable. The training cases

for the SMLR was also mostly well characterised, being of an

extreme nature and lying away from class boundaries. Conse-

quently, it may be possible to predict potentially useful training

sites, especially for the SVM and RVM.

ACKNOWLEDGMENT

Dr. Pal wishes to thank the Association of Commonwealth

Universities for this fellowship. The authors thank the School

of Geography, University of Nottingham, for use of computing

facilities. The ATM data were acquired as part of European

AgriSAR campaign. For SVM, LIBSVM and BSVM packages

were made available by C.-J. Lin of National Taiwan Univer-

sity, SMLR package was provided by A. Hartemink, Duke Uni-

versity and multiclass RVM was provided by Y.-f. Mao, Elec-

tronics and Information Department, SCUT, Guangzhou, China.

The authors are also grateful to the editors and the referees for

their helpful comments on the original manuscript.

REFERENCES

[1] G. M. Foody and A. Mathur, Toward intelligent training of supervisedimage classifications: Directing training data acquisitionfor SVM clas-

sification,Remote Sens. Environ., vol. 93, no. 12, pp. 107117, Oct.2004.

[2] M. Chi and L. Bruzzone, A semilabeled-sample-driven baggingtechnique for ill-posed classification problems,IEEE Geosci. Remote

Sens. Lett., vol. 2, no. 1, pp. 6973, Jan. 2005.[3] P. Mantero, G. Moser, and S. B. Serpico, Partially supervised classifi-

cation of remote sensing images through SVM-based probability den-sity estimation,IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3, pp.559570, Mar. 2005.

[4] G. M. Foody, Assessing the accuracy of land cover change with im-perfect ground reference data, Remote Sens. Environ., vol. 114, no.10, pp. 22712285, Oct. 2010.

[5] L. Bruzzone, M. Chi, and M. Marconcini, A novel transductive SVMfor the semisupervised classification of remote-sensing images,IEEETrans. Geosci. Remote Sens., vol. 44, no. 11, pp. 33633373, Nov.2006.

[6] M. Marconcini, G. Camps-Valls, and L. Bruzzone, A composite

semisupervised SVM for classification of hyperspectral images,IEEE Geosci. Remote Sens. Lett., vol. 6, no. 2, pp. 234238, Apr.2009.

[7] L. Bruzzone and C. Persello, A novel context-sensitive semi-supervised SVM classifier robust to mislabeled training samples,

IEEE Trans. Geosci. Remote Sens ., vol. 47, no. 7, pp. 21422154, Jul.2009.

[8] S. Rajan, J. Ghosh, and M. M. Crawford, An active learning approachto hyperspectral data classification, IEEE Trans. Geosci. Remote

Sens., vol. 46, no. 4, pp. 12311242, Apr. 2008.

[9] D. Tuia,F. Ratle, F. Pacifici, M. F. Kanevski, and W.J. Emery, Activelearning methods for remotesensing imageclassification,IEEE Trans.

Geosci. Remote Sens., vol. 47, no. 7, pp. 22182232, Jul. 2009.

[10] P. Zhong,P. Zhang,and R. Wang, Dynamic learning of SMLR forfea-ture selection and classification of hyperspectral data, IEEE Geosci.

Remote S ens. Lett., vol. 5, no. 2, pp. 280284, Apr. 2008.[11] M. Pal and G. M. Foody, Feature selection for classification of hyper-

spectral data by SVM,IEEE Trans. Geosci. Remote Sens., vol. 48, no.

5, pp. 22972307, May 2010.[12] G. M. Foody and A. Mathur, The use of small training sets containing

mixed pixels for accurate hard image classification: Training on mixedspectral responsesfor classificationby a SVM,Remote Sens. Environ.,vol. 103, no. 2, pp. 179189, Jul. 2006.

[13] A. Mathur and G. M. Foody, Crop classification by support vectormachine with intelligently selected training data for an operational ap-plication, Int. J. Remote Sens., vol. 29, no. 8, pp. 22272240, Apr.2008.

[14] C. Sanchez-Hernandez, D. S. Boyd, and G. M. Foody, One-class clas-sification for mapping a specific land cover class: SVDD classifica-tion of fenland,IEEE Trans. Geosci. Remote Sens., vol. 45, no. 4, pp.10611073, Apr. 2007.

[15] W. Li, Q. Guo, and C. Elkan, A positive and unlabeled learning algo-rithm for one-class classification of remote-sensing data,IEEE Trans.Geosci. Remote Sens., vol. 49, no. 2, pp. 717725, Feb. 2011.

[16] J. A. Gualtieri and R. F. Cromp, Support vector machines for hyper-spectral remote sensing classification, inProc. 27th AIPR Workshop:Advances in Computer Assisted Recognition, Washington, DC,Oct. 27,1998, pp. 221232.

[17] C. Huang, L. S. Davis, and J. R. G. Townshend, An assessment ofsupport vector machines for land cover classification,Int. J. RemoteSens., vol. 23, no. 4, pp. 725749, Feb. 2002.

[18] G. Zhu and D. G. Blumberg, Classification using ASTER data andSVM algorithms; The case study of Beer Sheva, Israel, Remote Sens.

Environ., vol. 80, no. 5, pp. 233240, May 2002.[19] M. Pal and P. M. Mather, Assessment of the effectiveness of support

vector machines for hyperspectral data, Future Gen. Comput. Syst.,vol. 20, no. 7, pp. 1215122, Oct. 2004.

[20] F. Melgani and L. Bruzzone, Classification of hyperspectral remote

sensing images with support vector machines, IEEE Trans. Geosci.Remote Sens., vol. 42, no. 8, pp. 17781790, Aug. 2004.

[21] D. Lu and Q. Weng, A survey of image classification methods andtechniques for improving classification performance, Int. J. Remote

Sens., vol. 28, no. 5, pp. 823870, Mar. 2007.

7/25/2019 06331573

12/12


[22] B. Waske and J. A. Benediktsson, Fusion of support vector machines

for classification of multisensor data, IEEE Trans. Geosci. RemoteSens., vol. 45, no. 12, pp. 38583866, Dec. 2007.

[23] M. Pal and P. M. Mather, Some issue in classification of DAIS hy-perspectral data, Int. J. Remote Sens., vol. 27, no. 14, pp. 28952916,

Jul. 2006.[24] G. M. Foody andA. Mathur, A relative evaluation of multiclass image

classification by support vector machines, IEEE Trans. Geosci. Re-

mote Sens., vol. 42, no. 6, pp. 13351343, Jun. 2004.[25] M. Pal, Kernelmethods in remote sensing: A review,ISH J. Hydraul.Eng. (Special Issue), vol. 15, no. 1, pp. 194215, May 2009.

[26] G. Mountrakis, J. Im, and C. Ogole, Support vector machines in re-mote sensing: A review, ISPRS J. Photogramm. Remote Sens. , vol.66, no. 3, pp. 247259, May 2011.

[27] A. Mathur and G. M. Foody, Multiclass and binary SVM classifica-tion: Implications for training and classification Users,IEEE Geosci.

Remote Sens. Lett., vol. 5, no. 2, pp. 241245, Feb. 2008.[28] J. Platt,Probabilistic outputs forsupport vectormachinesand compar-

ison to regularized likelihood methods, in Advances in Large MarginClassifiers, A. Smola, P. Bartlett, B. Schlkopf, and D. Schuurmans,Eds. Cambridge, MA: MIT Press, 2000, pp. 6174.

[29] B. Demir and S. Ertrk, Hyperspectral image classification using rel-evance vector machines,IEEE Geosci. Remote Sens. Lett., vol. 4, no.

4, pp. 586590, Apr. 2007.

[30] G. M. Foody, RVM-based multi-class classification of remotelysensed data,Int. J. Remote Sens., vol. 29, no. 6, pp. 18171823, Mar.

2008.[31] F. A. Mianji and Y. Zhang, Robust hyperspectral classification using

relevance vector machine,IEEE Trans. Geosci. Remote Sen s., vol. 49,no. 6, pp. 21002112, Jun. 2011.

[32] M. E. Tipping, Sparse Bayesian learning and the relevance vector ma-

chine,J. Mach. Learn. Res., vol. 1, pp. 211244, Jun. 2001.[33] B. Krishnapuram, L. Carin, M. A. T. Figueiredo, and A. J. Hartemink,

Sparse multinomial logistic regression: Fast algorithms and general-ization bounds,IEEE Trans. Pattern Analysis Mach. Intell. , vol. 27,no. 6, pp. 957968, Jun. 2005.

[34] G. Camps-Valls, L. Gmez-Chova, J. Calpe-Maravilla, J. D. Martn-Guerrero, E. Soria-Olivas, L. Alonso-Chord, and J. Moreno, Robustsupport vector method for hyperspectral data classification and knowl-edge discovery,IEEE Trans. Geosci. Remote Sens., vol. 42, no. 7, pp.

15301542, Jul. 2004.[35] G. Camps-Valls and L. Bruzzone, Kernel-based methods for hyper-

spectral image classification,IEEE Trans. Geosci. Remote Sens., vol.43, no. 6, pp. 13511362, Jun. 2005.

[36] D. Tuia, M. Volpi, L. Copa, M. Kanevski, and J. Muoz-Mar, Asurvey of active learning algorithms for supervised remote sensingimage classification,IEEE J. Sel. Topics Signal Process., vol. 5, no.3, pp. 606617, Jun. 2011.

[37] V. N. Vapnik, The Nature of Statistical Lea rning Theo ry. New York:Springer-Verlag, 1995.

[38] C. Cortes and V. Vapnik, Support-vector networks, MachineLearning, vol. 20, no. 3, pp. 273297, Mar. 1995.

[39] B. E. Boser, I. M. Guyon, and V. Vapnik, A training algorithm for

optimum margin classifiers, in Proc. 5th Annual Workshop on Com-putational Learning The ory (CO LT 92), New York, Jul. 2729, 1992,pp. 144152.

[40] N. Cristianini and J. Shawe-Taylor, An Introduction to S upport VectorMachines. Cambridge, UK: Cambridge University Press, 2000.

[41] G. Camps-Valls and L. Bruzzone, Kernel Methods for Remote SensingData Analysis. Chichester, UK: Wiley, 2009.

[42] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of StatisticalLearning: Data Mining, Inference, and Prediction. New York:Springer-Verlag, 2001.

[43] J. Borges, J. Bioucas-Dias, and A. Maral, Fast sparse multinomialregression applied to hyperspectral data, presented at the Int. Conf.Image Analysis and Recognition- ICIAR 2006, Pvoa de Varzim, Por-

tugal, 2006.[44] J. Borges, J. Bioucas-Dias, and A. Maral, Bayesian hyperspectral

image segmentation with discriminative class learning, IEEE Trans.

Geosci. Remote Sens., vol. 49, no. 6, pp. 21512164, Jun. 2011.[45] J. Li, J. M. Bioucas-Dias, and A. Plaza, Hyperspectral image seg-

mentation using a new Bayesian approach with active learning,IEEETrans.Geosci. Remote Sens., vol. 49,no. 10,pp. 39473960, Oct. 2011.

[46] G. M. Foodyand M. K. Arora, Anevaluationof some factors affecting

the accuracy of classification by an artificial neural network, Int. J.Remote S ens., vol. 18, no. 4, pp. 799810, Mar. 1997.

[47] M. Pal and P. M. Mather, An assessment of the effectiveness of deci-sion tree methods forland cover classification,Remote Sens. Environ.,vol. 86, no. 4, pp. 554565, Oct. 2003.

[48] J. L. Fleiss, B. Levin, and M. C. Paik, Sta tistical M ethods for Rates &Proportions, 3rd ed. New York: Wiley-Interscience, 2003.

[49] G. M. Foody, Classifi

cation accuracy comparison: Hypothesis testsand the use of confidence intervals in evaluations of difference, equiv-

alence and non-inferiority,Remote Sens. Environ., vol. 113, no. 8, pp.16581663, Aug. 2009.

[50] C.-W. Hsu and C.-J. Lin, A comparison of methods for multi-classsupport vector machines, IEEE Trans. Neural Netw., vol. 13, no. 2,

pp. 415425, Feb. 2002.[51] M. Pal, Multiclass approaches for support vector machine based land

cover classification, in8th Annual Int. Conf., Map India 2005, 2005.[52] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector

machines,ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 127,Apr. 2011.

[53] X.-M. Xu,Y.-F. Mao, J.-N. Xiong,and F.-L. Zhou, Classification per-formance comparison between RVM and SVM, in IEEE Int. Work-

shop on Anti-Counterfeiting, Security, Identification, Xiamen, China,

Apr. 1618, 2007, pp. 208211.

[54] G. M. Foody, N. A. Campbell, N. M. Trodd, and T. F. Wood, Deriva-tion and applications of probabilistic measures of class membership

from the maximum-likelihood classification,Photogramm. Eng. Re-mote Sens., vol. 58, no. 9, pp. 13351341, Sep. 1992.

[55] N. A. Campbell, Some aspects of allocation and discrimination, inMultivariate Statistical Methods in Physical Anthropology, G. N. VanVark and W. W. Howells, Eds. Dordrecht, The Netherlands: Reidel,

1984, pp. 177192.[56] G. M. Foody, The significance of border training patterns in clas-

sification by a feed-forward neural network using back propagationlearning,Int. J. Remote Sens., vol. 20, no. 18, pp. 35493562, Dec.1999.

[57] G. M. Foody, On training and evaluation of SVM for remote sensingapplications, in Kernel Methods for Remote Sensing Data Analysis,G. Camps-Valls and L. Bruzzone, Eds. Chichester, UK: Wiley, 2009,pp. 85109.

Mahesh Palreceived the Ph.D. degree from the Uni-versity of Nottingham, U.K., in 2002.

He is presently an Associate Professor in theDepartment of Civil Engineering, NIT Kurukshetra,Haryana, India. His major research areas are landcover classification, feature selection and applicationof artificial intelligence techniques in various civilengineering application.

Dr. Pal is on the editorial board ofRemote SensingLetters. Part of the research work reported in thispaper was carried out when Dr. Pal was on a com-

monwealth fellowship in the University of Nottingham during the period ofOctober 2008March 2009.

Giles M. Foody (M01) received the B.Sc. andPh.D. degrees in geography from the Universityof Sheffield, Sheffield, U.K., in 1983 and 1986,respectively.

He is currently Professor of Geographical Informa-tion Science at the University of Nottingham, U.K.His main research interests focus on the interface be-tween remote sensing, ecology and informatics.

Dr. Foody is currently Editor-in-Chief of theInternational Journal of Remote Sensingand of therecently launched journal Remote Sensing Letters,

holds editorial roles with Landscape Ecologyand Ecological Informatics, andserves on the editorial board of several other journals. He was awarded theRemote Sensing and Photogrammetry Societys Award, its highest award, forservices to remote sensing in 2009.

Date post:	28-Feb-2018
Category:	Documents
Upload:	jay-krishna
View:	220 times
Download:	0 times

06331573

Documents