+ All Categories
Home > Documents > A novel ensemble algorithm for biomedical classification based on Ant Colony Optimization

A novel ensemble algorithm for biomedical classification based on Ant Colony Optimization

Date post: 26-Aug-2016
Category:
Upload: lei-shi
View: 218 times
Download: 1 times
Share this document with a friend
10
Applied Soft Computing 11 (2011) 5674–5683 Contents lists available at ScienceDirect Applied Soft Computing j ourna l ho me p age: www.elsevier.com/l ocate/asoc A novel ensemble algorithm for biomedical classification based on Ant Colony Optimization Lei Shi a,b,c , Lei Xi c , Xinming Ma a,b,c,, Mei Weng c , Xiaohong Hu c a Agronomy College, HeNan Agricultural University, Zhengzhou 450002, China b The Incubation Base of National Key Laboratory for Physiological Ecology and Genetic Improvement of Food Crops in Henan Province, HeNan Agricultural University, Zhengzhou 450002, China c College of Information and Management Science, HeNan Agricultural University, Zhengzhou 450002, China a r t i c l e i n f o Article history: Received 5 December 2010 Received in revised form 17 March 2011 Accepted 23 March 2011 Available online 2 April 2011 Keywords: Ant Colony Optimization Rough set Ensemble learning Biomedical classification a b s t r a c t One of the major tasks in biomedicine is the classification and prediction of biomedical data. Ensemble learning is an effective method to significantly improve the generalization ability of classification and thus have obtained more and more attentions in the biomedicine community. However, most existing techniques in ensemble learning employ all the trained component classifiers to constitute ensembles, which are sometimes unnecessarily large and can lead to extra memory costs and computational times. For improving the generalization ability and efficiency of ensemble for biomedical classification, an Ant Colony Optimization and rough set based ensemble approach is proposed in this paper. Ant Colony Optimization and rough set theory are incorporated to select a subset of all the trained component classifiers for aggregation. Experiment results show that compared with existing methods, it not only decreases the size of ensemble, but also obtains higher prediction performance. © 2011 Elsevier B.V. All rights reserved. 1. Introduction One of the major tasks in biomedicine is the classification and prediction of biomedical data. It could lead us to the elucidation of the secrets of life or ways to prevent certain currently non-curable diseases such as HIV. Although laboratory experiment is the most effective method for investigating the data, it is very financially and labor expensive. With the rapid increase in size of the biomedical databases, it is essential to use computational algorithms and tools to automate the classification process. Now, many algorithms in the fields of machine learning have therefore been widely used for the classification analysis of biomedical data, such as decision trees, k-nearest neighbor and artificial neural network [1]. Ensemble method is one of the major advances in machine learning in the past years. It is learning algorithm that trains a set of component classifiers and then combines their predictions to classify new examples [2]. As an effective method to improve classification performance, ensemble technique is available for the classification analysis of biomedical data and thus gaining more and more attentions in biomedicine community. However, ensem- ble method has two important drawbacks. Firstly, it requires much Corresponding author at: Agronomy College, HeNan Agricultural University, Zhengzhou 450002, China. E-mail address: [email protected] (X. Ma). more memory to store all the learning models in the ensemble, and secondly it takes much more computation time to produce a predic- tion for an unlabeled example. The storage and computation time increase with the number of component classifiers in the ensem- ble. Most existing techniques in ensemble learning employ all the trained component classifiers to constitute ensembles, which are sometimes unnecessarily large and can lead to extra memory costs and computational times. The problems frequently limit the appli- cation of ensemble method to classification of biomedical data. Rough set theory, introduced by Pawlak in 1982, is a formal mathematical tool to deal with imprecision, uncertainty and vague- ness [3]. As an important feature selection method, rough set can preserve the meaning of the features. The essence of rough set approach to feature selection is to find a subset of the original fea- tures. However, the number of possible subsets is always very large when N (N is the number of features) is large because there are 2 N subsets and to examine exhaustively all subsets of features for selecting the optimal one is an NP-hard problem. Therefore, it is necessary to investigate fast and effective approximate algorithms. Previous methods employed an incremental hill-climbing algo- rithm to select feature. However, this often led to a non-minimal feature combination. Ant Colony Optimization (ACO) is a population-based paradigm that can be used to find approximate solutions to difficult opti- mization problems. The first ACO algorithm which can be classified within this technique was introduced in the early 1990s by Col- 1568-4946/$ see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2011.03.025
Transcript
Page 1: A novel ensemble algorithm for biomedical classification based on Ant Colony Optimization

AO

La

b

Zc

a

ARRAA

KAREB

1

ptdeldtttk

lstccab

Z

1d

Applied Soft Computing 11 (2011) 5674–5683

Contents lists available at ScienceDirect

Applied Soft Computing

j ourna l ho me p age: www.elsev ier .com/ l ocate /asoc

novel ensemble algorithm for biomedical classification based on Ant Colonyptimization

ei Shia,b,c, Lei Xic, Xinming Maa,b,c,∗, Mei Wengc, Xiaohong Huc

Agronomy College, HeNan Agricultural University, Zhengzhou 450002, ChinaThe Incubation Base of National Key Laboratory for Physiological Ecology and Genetic Improvement of Food Crops in Henan Province, HeNan Agricultural University,hengzhou 450002, ChinaCollege of Information and Management Science, HeNan Agricultural University, Zhengzhou 450002, China

r t i c l e i n f o

rticle history:eceived 5 December 2010eceived in revised form 17 March 2011ccepted 23 March 2011vailable online 2 April 2011

a b s t r a c t

One of the major tasks in biomedicine is the classification and prediction of biomedical data. Ensemblelearning is an effective method to significantly improve the generalization ability of classification andthus have obtained more and more attentions in the biomedicine community. However, most existingtechniques in ensemble learning employ all the trained component classifiers to constitute ensembles,

eywords:nt Colony Optimizationough setnsemble learningiomedical classification

which are sometimes unnecessarily large and can lead to extra memory costs and computational times.For improving the generalization ability and efficiency of ensemble for biomedical classification, an AntColony Optimization and rough set based ensemble approach is proposed in this paper. Ant ColonyOptimization and rough set theory are incorporated to select a subset of all the trained componentclassifiers for aggregation. Experiment results show that compared with existing methods, it not onlydecreases the size of ensemble, but also obtains higher prediction performance.

. Introduction

One of the major tasks in biomedicine is the classification andrediction of biomedical data. It could lead us to the elucidation ofhe secrets of life or ways to prevent certain currently non-curableiseases such as HIV. Although laboratory experiment is the mostffective method for investigating the data, it is very financially andabor expensive. With the rapid increase in size of the biomedicalatabases, it is essential to use computational algorithms and toolso automate the classification process. Now, many algorithms inhe fields of machine learning have therefore been widely used forhe classification analysis of biomedical data, such as decision trees,-nearest neighbor and artificial neural network [1].

Ensemble method is one of the major advances in machineearning in the past years. It is learning algorithm that trains aet of component classifiers and then combines their predictionso classify new examples [2]. As an effective method to improvelassification performance, ensemble technique is available for the

lassification analysis of biomedical data and thus gaining morend more attentions in biomedicine community. However, ensem-le method has two important drawbacks. Firstly, it requires much

∗ Corresponding author at: Agronomy College, HeNan Agricultural University,hengzhou 450002, China.

E-mail address: [email protected] (X. Ma).

568-4946/$ – see front matter © 2011 Elsevier B.V. All rights reserved.oi:10.1016/j.asoc.2011.03.025

© 2011 Elsevier B.V. All rights reserved.

more memory to store all the learning models in the ensemble, andsecondly it takes much more computation time to produce a predic-tion for an unlabeled example. The storage and computation timeincrease with the number of component classifiers in the ensem-ble. Most existing techniques in ensemble learning employ all thetrained component classifiers to constitute ensembles, which aresometimes unnecessarily large and can lead to extra memory costsand computational times. The problems frequently limit the appli-cation of ensemble method to classification of biomedical data.

Rough set theory, introduced by Pawlak in 1982, is a formalmathematical tool to deal with imprecision, uncertainty and vague-ness [3]. As an important feature selection method, rough set canpreserve the meaning of the features. The essence of rough setapproach to feature selection is to find a subset of the original fea-tures. However, the number of possible subsets is always very largewhen N (N is the number of features) is large because there are2N subsets and to examine exhaustively all subsets of features forselecting the optimal one is an NP-hard problem. Therefore, it isnecessary to investigate fast and effective approximate algorithms.Previous methods employed an incremental hill-climbing algo-rithm to select feature. However, this often led to a non-minimalfeature combination.

Ant Colony Optimization (ACO) is a population-based paradigmthat can be used to find approximate solutions to difficult opti-mization problems. The first ACO algorithm which can be classifiedwithin this technique was introduced in the early 1990s by Col-

Page 2: A novel ensemble algorithm for biomedical classification based on Ant Colony Optimization

omput

obrsnftdAgGirstmgAieag

briswbd

2bAtSdtw

2

fffbsdabt

bttTcbtaitto

L. Shi et al. / Applied Soft C

rni and Dorigo [4], and since then many diverse variants of theasic principle have been reported in the literature. ACO algo-ithm is inspired by the social behavior of ant colonies in theirearch for the shortest path to food sources. Although they haveo sight, ants are capable of finding the shortest route between a

ood source and their nest by chemical materials called pheromonehat they leave when moving. As an important branch of newlyeveloped form of artificial intelligence called Swarm Intelligence,CO algorithm has been shown to be an effective tool in findingood solutions. It has an advantage over simulated annealing andenetic Algorithm approaches when the graph may change dynam-

cally because it can be run continuously and adapt to changes ineal time [5]. ACO algorithm was firstly used in solving travelingalesman problem (TSP) [6] and then has been successfully appliedo a large number of difficult problems like the quadratic assign-

ent problem (QAP), routing in telecommunication networks,raph coloring problems, feature selection, etc. [7]. Particularly,CO is attractive for feature selection since there is no heuristic

nformation that can guide search to the optimal minimal subsetvery time and ants can discover the best feature combinationss they traverse the graph when features are represented as araph.

For improving the prediction ability and efficiency to classifyiomedical data, an ACO and rough set based ensemble algo-ithm is proposed in this paper. ACO and rough set theory arencorporated to select a subset of the all trained component clas-ifiers for aggregation. Experiment results show that comparedith existing methods, it not only decreases the size of ensemble,

ut also obtains higher performance of prediction for biomedicalata.

The remainder of the paper is organized as follows. Section gives an overview of related work. Section 3 introduces theasic background ideas about ensemble learning, rough set andCO for the sake of further discussion. Section 4 introduces

he incorporation of ACO with rough set for feature selection.ection 5 describes the proposed novel ensemble algorithm inetail. Section 6 discusses experimental results. Finally, Sec-ion 7 presents concluding remarks and directions of our futureork.

. Related work

Machine learning is the subfield of artificial intelligence whichocuses on methods to construct computer programs that learnrom experience with respect to some class of tasks and a per-ormance measure [8]. Machine learning methods are suitable foriomedical data due to the learning algorithm’s ability to con-truct classifiers that can explain complex relationships in theata. Recently, the use of machine learning has become widelyccepted in biomedical applications and many researches haveeen conducted to address the classification of biomedical data inhe literature.

In [9], the shrunken centroid method is proposed to classifyiomedical data. It relies on a nearest-class centroid classifica-ion, but using the centroids of the classes shrunken towardshe centroid of all classes by a threshold-controlled amount.he threshold can be determined by a cross validation pro-ess. The method is similar to linear discriminant analysis,ut assumes a diagonal pooled covariance matrix. In [10],he classification methods Fisher Linear Discriminant Analysisnd Least Squares SVM (linear and radial kernel) are stud-

ed for classification of biomedical data. The performance ofhe two methods was compared by leave-one-out cross valida-ion and the experimental results indicate that the importancef regularizing the classifiers and suggest that the LS-SVM

ing 11 (2011) 5674–5683 5675

with the RBF kernel is prone to overfitting in biomedicalclassification.

Artificial neural networks have been identified as effec-tive approach in biomedical classification. In [11], artificialneural network is used to develop a method of classifyingcancers to specific diagnostic categories. The experiment demon-strates the potential applications of artificial neural network fortumor diagnosis and the identification of candidate targets fortherapy.

Ensemble learning has increasingly gained attention in biomed-ical research. In [12], a comparison of single supervised machinelearning and ensemble methods is performed in classifying sevenpublicly available cancerous data. The experimental results indi-cate that ensemble methods consistently perform well over allthe datasets in terms of their specificity. A combinational featureselection and ensemble neural network method is introduced forclassification of biomedical data in [13]. However, those researchesof ensemble learning employ all the trained component classifiersto constitute ensembles, which are sometimes unnecessarily largeand can lead to extra computational times. In some scenarios ofbiomedical data classification such as dynamically mining largerepositories, ensemble learning is not suitable due to its lowerefficiency.

Intelligent algorithms such as Genetic Algorithms (GA) [14] andParticle Swarm Optimization (PSO) [15] have been conducted todesign ensembles in a number of studies. In [16], GA is employedfor selecting the features as well as selecting the types of individualclassifiers to design the fusion strategy of classifier. In [17], GA isused to optimize the weights of the feature vector that representsthe importance of the features to obtain better prediction accuracy.In [18], a PSO based ensemble classifier is proposed and evalu-ated. Each nearest prototype classifier of the ensemble is generatedsequentially using PSO. The PSO is used to find the prototypes’ loca-tions with the objective of reducing the error rate and the diversityamong the members of the ensemble is enforced through differentinitialization of PSO. Simulation experiments on different classifi-cation problems show that the PSO based ensemble classifier hasbetter performance than a single classifier. Like GA and PSO, ACO isa new evolutionary computation technique and has been appliedto many combinatorial optimization problems. ACO has an advan-tage over PSO and GA approaches of similar problems when thegraph may change dynamically and the ant colony algorithm canbe run continuously and adapt to changes in real time [5]. Com-pared with GA and PSO, ACO requires only primitive and simplemathematical operators, and does not need complex operatorssuch as crossover and mutation. Then, it is inexpensive in termsof both memory costs and computational times. Furthermore, ACOis particularly attractive for feature selection as there seems to beno heuristic that can guide search to the optimal minimal subsetevery time. Additionally, it can be the case that ants discover thebest feature combinations as they proceed throughout the searchspace. In this paper, ACO is adopted to combine with rough settheory to select a subset of all the trained component classifiersfor aggregation, and a novel improved ensemble algorithm basedon ACO and rough set is proposed. Experiments are carried out onseveral public biomedical datasets. The experimental results indi-cate that the proposed approach achieves significant performanceimprovement.

3. Preliminaries

3.1. Ensemble learning

Ensemble learning is a method that trains a set of individualclassifiers and then combines their predictions in some way to clas-

Page 3: A novel ensemble algorithm for biomedical classification based on Ant Colony Optimization

5676 L. Shi et al. / Applied Soft Comput

Fig. 1. A schematic view of ensemble learning.

slmnFipiF

ebpdwsmEiviw

bSfisCca

AIO(((

(

oteptttb

ify new examples [2]. As one of the major advances in inductiveearning in the past years, ensemble learning is gaining more and

ore attention in the machine learning and data mining commu-ities. The basic framework of ensemble learning is illustrated inig. 1. Each classifier (classifier 1 through classifier N in this case)s trained using the training examples firstly. Then, for each exam-le, the predicted output of each of these classifiers (Oi in Fig. 1)

s combined to produce the final prediction of the ensemble (O inig. 1).

Bagging, introduced by Breiman [19], is one of the most famousnsemble learning techniques and has achieved great success inuilding ensembles of unstable classifiers. It utilizes bootstrap sam-ling technique to generate multiple training sets from the originalataset. The bootstrap sampling is obtained by uniformly samplingith replacement examples from the training set. The size of each

ample equal to the size of the original set, and so each exampleay appear repeated times or not at all in any particular sample.

very training set is used to learn an individual classifier. Then,t combines the predictions of all individual classifiers via majorityoting. Theoretically, if bootstrap can induce significant differencesn the constructed individual classifiers, the accuracy of Bagging

ill improve greatly.Given the parameter T which is the number of repetitions, T

ootstrap samples S1, S2, · · · , ST are generated. From each samplei a classifier Ci is induced by the same learning algorithm and thenal classifier C is formed by aggregating T classifiers. A final clas-ification of example x is built by a uniform voting scheme on C1,2, · · · , CT, i.e., is assigned to the class predicted most often by theseomponent classifiers. The detailed Bagging algorithm is describeds Algorithm 1.

lgorithm 1 (The Bagging algorithm).nput: original training set S, number of bootstrap samples T;utput: Bagging classifier.

1) For t = 1 to T2) Create a training set St via bootstrap sample from S;3) Build a classifier Ct on St ;

4) Output C(x) = argmaxy(∑T

t=1I(Ct(x) = y)).

where I is an index function such that I(true) = 1, I(false) = 0.The Boosting is another category of powerful ensemble meth-

ds [20]. It explicitly alters the distribution of training data fedo every individual classifier, specifically weights of each trainingxample. Initially the weights are uniform for all the training exam-les. During the Boosting procedure, they are adjusted after theraining of each classifier is completed. For misclassified examples

he weights are increased, while for correctly classified exampleshey are decreased. The final ensemble is constructed by com-ining individual classifiers according to their own accuracies.

ing 11 (2011) 5674–5683

AdaBoost extends Boosting to multi-class and regression problems.The detailed AdaBoost algorithm is described as Algorithm 2.

Algorithm 2 (The AdaBoost algorithm).Input: original training set S={(x1, y1), · · · , (xN , yN )}, number of iterations T;Output: AdaBoost classifier.(1) Initialize d(1)

n = 1N for all n = 1, · · · , N;

(2) For t = 1 to T(3) Train classifier with respect to the weighted example set {S, d(t)} and

obtain hypothesis: ht : x → {−1, +1}, i.e., ht = L(S, d(t));(4) Calculate the weighted training error εt of ht :

εt =∑N

n=1d(t)

n I(yn /= ht(xn));

(5) Set ˛t = 12 log 1−εt

εt;

(6) Update weights d(t+1)n = d(t)

n exp{−˛t ynht (xn)}Zt

; Where Zt is a normalization

constant, such that∑N

n=1d(t+1)

n = 1;(7) Break if ε = 0 or εt ≥ 1

2 and set T = t − 1;

(8) Output fT (x) =∑T

t=1˛t∑T

r=1˛r

ht(x).

3.2. Rough set

Rough set theory provides a methodology for handling vague-ness and uncertainty in data analysis [3], and has been successfullyapplied in many fields, such as text categorization [21] and featureselection [22]. In rough set theory, information is represented bya table called the information system which contains objects andattributes.

Definition 1. Information system. In rough set theory, an infor-mation system is formulated as a 4-tuple as follows:

IS = (U, A, V, f ) (1)

where U, called universe, is a nonempty set of finite objects; Ais a nonempty finite set of attributes characterizing the objects; Vis the domain of attribute value and f : U × A → V is the informationfunction. If the attribute set A is divided into condition attributeC and decision attribute D, the information system is also called adecision table.

Definition 2. Indiscernibility relation. For a given subset ofattributes P ⊆ A, there is an indiscernibility relation IND(P) as fol-lows:

IND(P) = {(x, y) ∈ U2|∀a ∈ P, f (x, a) = f (y, a)} (2)

The partition of U, generated by IND(P) is denoted U/P. If (x,y) ∈ IND(P), then x and y are indiscernible with respect to P. Theindiscernibility relation is an equivalent relation which satisfiesreflexivity, symmetry and transitivity.

Definition 3. Equivalence class. The equivalence classes of theindiscernibility relation IND(P) are denoted by

[x]P = {y|(x, y) ∈ IND(P), y ∈ U} (3)

For an arbitrary set X ⊆ U and attribute subset P ⊆ A, X could beapproximated by the P-lower approximation and P-upper approx-imation.

Definition 4. Lower approximation and upper approximation.The P-lower approximation of X is the set of all elements of U, whichcan be certainly classified as elements of X based on the attributeset P, defined as:

PX = {x|[x]P ⊆ X} (4)

The P-upper approximation of X is the set of all elements of

U, which can be possibly classified as elements of X based on theattribute set P, defined as:

P̄X = {x|[x]P ∩ X /= ∅} (5)

Page 4: A novel ensemble algorithm for biomedical classification based on Ant Colony Optimization

L. Shi et al. / Applied Soft Comput

u

Dbb

P

N

B

Pct

Dd

atQtoc

asd

R

r

R

eT

C

Fig. 2. Lower and upper approximation.

Fig. 2 provides a schematic diagram of a rough set X within thepper and lower approximations.

efinition 5. Positive, negative and boundary regions. Let P, Q ⊆ Ae equivalence relations over U, then the positive, negative andoundary regions can be defined as

OSP(Q ) =⋃

X ∈ U/Q

PX (6)

EGP(Q ) = U −⋃

X ∈ U/Q

P̄X (7)

NDP(Q ) =⋃

X ∈ U/Q

P̄X −⋃

X ∈ U/Q

PX (8)

OSP(Q), the positive region of the partition U/Q with respect to P,ontains all objects in U that can be uniquely classified to blocks ofhe partition U/Q by means of the knowledge in attributes P.

efinition 6. Dependency degree. Let P, Q ⊆ A, we say that Qepends on P in a degree � (0 ≤ � ≤ 1), if

= �P(Q ) = POSP(Q )|U| (9)

where |Y | is the cardinality of set Y. An important issue in datanalysis is discovering dependencies between attributes. The quan-ity � can be used to measure the degree of dependency between

and P. If � = 1, Q depends totally on P, if 0 < � < 1, Q depends par-ially on P, and if � = 0 then Q does not depend on P. When P is a setf condition attributes and Q is the decision attribute set, �P(Q) isalled the quality of approximation of classification.

Attribute reduction in rough set theory is to remove redundantttributes so that the reduced set provides the same quality of clas-ification as the original. For R ⊆ C, the set of all reducts can beefined by the following definition based on dependency degree

ed = {R|�R(D) = �C (D), ∀B ⊂ R, �B(D) /= �C (D)} (10)

Specially, a reduct with minimal cardinality is called minimaleduct and it is searched for in rough set attribute reduction.

edmin = {R ∈ Red|∀R′ ∈ Red, |R| ≤ |R′ |} (11)

The intersection of all reducts is called the core and then the

lements of core are those attributes that cannot be eliminated.he core can be defined as

ore(C) =⋂

Red (12)

ing 11 (2011) 5674–5683 5677

3.3. Ant Colony Optimization

ACO algorithms were introduced in the early 1990s and havebecome one of the most successful Swarm Intelligence techniques.Swarm Intelligence describes the collective behavior of decen-tralized self-organized systems, which are typically made up of apopulation of social insects interacting locally with one anotherand with their environment. Social insects such as ants, wasps,and bees cooperate to accomplish complex, difficult tasks. Thiscooperation is distributed among the entire population, withoutany centralized control. Each individual follows very simple rulesinfluenced by locally available information. This emergent behaviorresults in great achievements that no single member could com-plete by themselves. Swarm intelligent systems possess importantadditional properties, such as robustness against individual mis-behavior or loss, the flexibility to change quickly in a dynamicenvironment, and an inherent parallelism or distributed action.

As a part of Swarm Intelligence, ACO algorithm was inspired bythe observation of the behavior of real ants in finding paths from thecolony to food. In the real world, ants exhibit strong ability to findthe shortest routes from the colony to food using a way of deposit-ing pheromone as they travel. Ants wander randomly, and uponfinding food return to their colony while laying down pheromonetrails. Each ant probabilistically prefers to follow a direction richin this chemical. If other ants find such a path, they are likely notto keep traveling at random, but to instead follow the trail, return-ing and reinforcing it if they eventually find food. The pheromonedecays over time, resulting in much less pheromone on less popu-lar paths. Thus, when one ant finds a shortest path from the colonyto a food source, this shortest route will have the higher rate of anttraversal and other ants are more likely to follow the path. Then, theshortest path will be reinforced and the others diminished. Eventu-ally, all ants follow the same, shortest path. Inspired by the behaviorof real ants, Dorigo and Caro proposed an artificial colony of antsalgorithm [6,23], which was called the Ant Colony Optimization(ACO), to solve hard combinatorial optimization problems. The ideaof the ACO algorithm is to mimic this behavior with “simulatedants” exploiting the problem graph to search for optimal solutions.Every ant has an initial state and one or more terminating condi-tions. The next move is selected by a probabilistic decision rule thatis a function of locally available pheromone trails. Ant can updatethe pheromone trail associated with the link it travels. Once it hasbuilt a solution, it can retrace the same path backward and updatethe pheromone trails. ACO algorithm makes probabilistic decisionin terms of the artificial pheromone trails and the local heuristicinformation, which allows ACO to explore larger number of solu-tions than greedy heuristics. Furthermore, another characteristic ofthe ACO algorithm is the pheromone trail evaporation, i.e., a processfor decreasing the pheromone trail intensity over time. Pheromoneevaporation helps in avoiding rapid convergence of the algorithmtowards a sub-optimal region. In general, an ACO algorithm can beapplied to any optimization problem as far as it can be defined [24]:

(1) An appropriate graph representation for problem. The problemmust be described as a graph with a set of nodes and edgesbetween nodes. The graph should accurately represent all statesand transitions between states.

(2) Heuristic desirability. A suitable heuristic measure can bedefined to describe the goodness of edges from one node toevery other linked node in the graph.

(3) Solution construction mechanism. A method must be defined forbuilding possible solutions.

(4) An autocatalytic feedback process. A suitable method of updatingthe pheromone levels on edges is required.

(5) A constraint-satisfaction method. A mechanism is needed toensure that only feasible solutions are constructed.

Page 5: A novel ensemble algorithm for biomedical classification based on Ant Colony Optimization

5678 L. Shi et al. / Applied Soft Comput

fiac

4

sSwmafismatu

4

tstaoFdni

4

mdi

4

suc

p

wr

of feature selection algorithm base ACO and rough set is describedin Fig. 4.

The detailed process of feature selection algorithm base ACO andrough set can be described as Algorithm 3.

Evaluate position

continue

Generate ants

Choose next feature

Transition rule

Gather subsets

Evaluate position

Update pheromone

stop

continue

Ants

Fig. 3. ACO problem representation for feature selection.

Recently, ACO algorithm has been successfully applied in manyelds, such as vehicle routing and data mining. And it is particularlyttractive for feature selection as there seems to be no heuristic thatan guide search to the optimal minimal subset every time [25,26].

. Incorporation of ACO with rough set for feature selection

Attribute reduction in rough set theory is an important featureelection method, which can preserve the meaning of the features.ince the calculation of attribute reduction is an NP-hard problemith regard to computational complexity and memory require-ents, it is necessary to investigate fast and effective approximate

lgorithms. Due to the fact that hill-climbing algorithms often fail tond optimal solutions, ACO is introduced to incorporate with roughet for obtain minimal reduct. Following the standard ACO algorith-ic scheme for static combinatorial optimization problems, several

spects need to be addressed, i.e., graph representation, heuris-ic information, construction of feasible solutions and pheromonepdate rule [26].

.1. Graph representation

For reformulating into an ACO-suitable problem, the reductionask can be described by a complete graph, where nodes repre-ent features and edges between the nodes denote the choice ofhe next feature. The search for the optimal feature subset is thenn ant traversal through the graph where a minimum numberf nodes are visited that satisfies the traversal stopping criterion.ig. 3 illustrates a typical construction process, where C = {a, b, c,, e, f}. At the first step, feature a is chosen in a random man-er, then features b and c are selected. The last selected feature

s d.

.2. Heuristic information

Each edge is assigned a pheromone trail and heuristic infor-ation. The dependency degree is used as a suitable heuristic

esirability of traversing between features. Formally, the heuristicnformation is given as follows:

(a, b) = POSa,b(D)U

(13)

.3. Construction of feasible solutions

While constructing a solution, each ant starts from a randomlyelected feature, and then it selects the second feature from thosenselected features with a given probability. That probability isalculated by [6]:

k�˛

ij�ˇ

ij(t)

ij =�l ∈ allowedk

�˛il

�ˇil

(t)(14)

here j ∈ allowedk, k and t denote the number of ants and iterations,espectively. allowedk denotes the set of conditional features that

ing 11 (2011) 5674–5683

have not yet been selected, �(i, j) and �(i, j) are the pheromone valueand heuristic information of choosing feature j when at feature i. ˛and ̌ are two parameters which determine the relative importanceof the pheromone trail and heuristic information.

A construction process is stopped when one of the following twoconditions is met:

(1) The cardinality of the current solution is larger than that of thetemporary minimal feature reduct.

(2) �R(D) = �C(D), where R is the current solution constructed by anant.

The first condition implies that no better solution will beconstructed, thus it is unnecessary to continue the constructionprocess. The second means that a better solution has been con-structed and a reduct has been found, thus the construction processcan be terminated. The first reduct is regarded as the minimalreduct temporarily.

4.4. Pheromone updating

After each ant has constructed a solution, the pheromone trailsshould be updated by the following rule:

�ij(t + 1) = ��ij(t) + ��ij(t) (15)

where �ij(t) is the amount of pheromone on a given edge (i, j) atiteration t, �ij(t + 1) is the amount of pheromone on a given edge (i, j)at next iteration, �(0 < � < 1) is a decay constant used to simulate theevaporation of pheromone, and ��ij(t) is the amount of pheromonedeposited, � �ij(t) =

∑(q/|R(t) |) if edge(i, j) has been traversed, and

��ij(t) = 0 otherwise.The feature selection algorithm base ACO and rough set per-

forms as follows: at each cycle, every ant constructs a solution andthen pheromone trails are updated. The algorithm stops iteratingwhen one of the two termination conditions is met. The structure

Return best subset

Fig. 4. Structure view of feature selection algorithm.

Page 6: A novel ensemble algorithm for biomedical classification based on Ant Colony Optimization

omputing 11 (2011) 5674–5683 5679

AIO(((((

((((

5s

reIigbcdretpcbcdcsAfirpateA

AsIO(((((((

(

((

(((

Table 1Cases of the classification for one class.

Class C Result of classifier

Belong Not belong

L. Shi et al. / Applied Soft C

lgorithm 3 (Feature selection algorithm base ACO and rough set).nputs: a decision table IS = (U, C

⋃D, V, f ) and parameters;

utputs: a minimal feature reduct Rmin and its cardinality Lmin .1) Initiate Rmin = C and Lmin = |C|;2) Calculate the pheromone trials and heuristic information � and �C (D);3) Do4) For every k ∈ Ant5) Construct a solution according to the transition rule and dependency

degree, then a reduct Rk and its order lk are obtained;6) If (lk < Lmin) then Rmin = Rk and Lmin = lk;7) Update pheromone trails;8) While(the terminated condition is not satisfied)9) Output Rmin and Lmin .

. The novel ensemble algorithm based on ACO and roughet

In this section, a novel ensemble algorithm based on ACO andough set is proposed for classifying biomedical data effectively andfficiently. The proposed novel algorithm comprises five key steps.n the first step, the original training set S is randomly partitionednto two set, i.e., S1 and S2. In the second step, bootstrap samples areenerated from S1 and then a collection of component classifiers isuilt from these bootstrap samples. In the third step, all componentlassifiers are used to classify examples in the S2 set and then theecision table is constructed based on the classified class labels andeal class labels of examples in the S2 set. Assume that there are Mxamples in the S2 set and N component classifiers obtained fromhe second step. Each example in the S2 set is classified by each com-onent classifier and then N classification results are output. Thelassification results of M examples in the S2 set can be representedy a M × N matrix, W = [lij], where lij represent the i th example’slass label classified by j th component classifier. Together withecision attributes, i.e., the real class label of examples, the matrixan be considered as a decision table. In the fourth step, the deci-ion table is reduct using the feature selection algorithm based onCO and rough set to select a subset of component classifiers. In thefth step, a decision tree learner is trained on the decision table aftereduction to build the final classifier. When classifying new exam-le, the selective component classifiers is used to classify examplend then the corresponding predictions are used as input data forhe final decision tree classifier to classify the example. The novelnsemble algorithm based on ACO and rough set is described inlgorithm 4.

lgorithm 4 (The novel ensemble algorithm based on ACO and roughet).nputs: training set S, trials T;utputs: final classifier.

1) Partition the S into two sets S = S1 ∪ S2;2) Create a pool including of T component classifiers;2.1) For t = 1 to T2.2) St = bootstrap sample from S1;2.3) Train the component classifier on St

3) Construct the decision table;3.1) Apply T component classifiers in the pool to classify the examples in the

S2;3.2) Construct a decision table based on the classified class labels and real

class labels of examples in the S2;4) Select the component classifiers;4.1) Process the decision table by using the feature selection algorithm based

on ACO and rough set and obtain a reduct;5) Construct the final classifier for classification;5.1) Train a decision tree classifier L on the decision table after reduction;

5.2) Use the decision tree classifier L to classify new example. When

classifying new example, only select the component classifiers in reduct topredict example and then use decision tree classifier L to aggregate thecorresponding predictions of selective component classifiers to classify theexample.

Real Belong TP FNclassification Not belong FP TN

6. Experiments

In this section, we study the performance of the proposedensemble method based ACO and rough set and demonstrate itseffectiveness for classification of biomedical data.

6.1. Experimental setup

In order to evaluate the proposed algorithm, we have employed4 biomedical datasets from the UCI machine learning data reposi-tory [27], i.e., the wisconsin breast cancer dataset, the pima Indiansdiabetes dataset, the heart disease dataset and the hepatitis domaindataset (respectively abbreviated here as breast, diabetes, heart,hepatitis). Some details of the studied datasets are introduced asfollow.

Wisconsin breast cancer dataset: It consists of 699 examples and9 attributes. The goal is to discriminate between benign and malig-nant of breast cancer.

Pima Indians diabetes dataset: It consists of 768 examples and 8attributes. The goal is to discriminate between tested positive andtested negative for diabetes to indicate whether the patient showssigns of diabetes according to World Health Organization criteria.

Heart disease dataset: It consists of 270 examples and 13attributes. The goal is to discriminate between heart disease pres-ence and heart disease absence for patient.

Hepatitis domain dataset: It consists of 155 examples and 19attributes. The goal is to discriminate between die and live of hep-atitis patient.

6.2. Performance metrics

To analyze the performance of classification, we adopt theaccuracy and AUC measures. As shown in Table 1, four cases areconsidered as the result of classifier to the example [28].

TP (T rue P ositive): the number of examples correctly classifiedto that class.

TN (T rue N egative): the number of examples correctly rejectedfrom that class.

FP (F alse P ositive): the number of examples incorrectly rejectedfrom that class.

FN (F alse N egative): the number of examples incorrectly clas-sified to that class.

Using these quantities, the performance of the classification canbe evaluated in terms of accuracy.

accuracy = TP + TN

TP + TN + FP + FN(16)

Receiving Operating Characteristic (ROC) curve is a techniquefor summarizing a classifier’s performance over a range by consid-ering the tradeoffs between TP rate and FP rate. The TP rate and FPrate are calculated as:

TP rate = TP

TP + FN(17)

FP rate = FP

FP + TN(18)

In ROC curve, TP rate is represented on the Y-axis and FP rateis represented on the X-axis. ROC analysis originated from signal

Page 7: A novel ensemble algorithm for biomedical classification based on Ant Colony Optimization

5 omputing 11 (2011) 5674–5683

picaiptcciibarArp

6

arbvtwuAaonfLkc1sdhdksa

ittieiwswucrisctetmw

Table 2Comparison of accuracy value on datasets (size of full ensemble = 10).

Data ANN Bagging Boosting kNN SVM Proposed Num

Breast 95.85 95.92 95.70 96.70 96.81 97.43 3.2Diabetes 75.52 75.65 73.95 74.73 75.89 76.04 4.7Heart 81.49 84.07 78.88 81.28 84.44 89.70 5.2Hepatitis 85.82 87.01 85.75 85.61 87.64 89.74 4.6avg 84.67 85.66 83.57 84.58 86.20 88.23 4.4

Table 3Comparison of accuracy value on datasets (size of full ensemble = 40).

Data ANN Bagging Boosting kNN SVM Proposed Num

Breast 95.85 96.28 96.42 96.70 96.81 97.65 4.2Diabetes 75.52 75.91 74.16 74.73 75.89 77.76 6.2Heart 81.49 85.23 80.37 81.28 84.44 88.23 6.3Hepatitis 85.82 86.49 85.16 85.61 87.64 88.13 4.3

680 L. Shi et al. / Applied Soft C

rocessing and is increasingly recognized in the machine learn-ng and data mining research community. The area under the ROCurve (AUC) represents the expected performance as a single scalarnd has been shown to be an important criterion for discriminat-ng between competing classifier models [29,30]. AUC measures therobability that the classifier output for a randomly chosen posi-ive example is greater than the classifier output for a randomlyhosen negative example. Furthermore, AUC has a known statisti-al meaning: it is equivalent to the Wilconxon test of ranks, ands also equivalent to several other statistical measures for evaluat-ng classification and ranking models [31]. AUC has been found toe more sensitive to differences in the classifiers’ performances andble to distinguish between classifiers for which classification accu-acies might tie. Rosset [29] recommends using both accuracy andUC measures in evaluating classifier models. In our study, accu-acy and AUC are employed as performance indexes to evaluate theerformance of classification.

.3. Results and discussion

To evaluate the performance of the proposed novel ensemblelgorithm based on ACO and rough set, two popular ensemble algo-ithms, i.e., Bagging [19] and Boosting, are implemented and used asenchmarks for comparison in the experiments. Boosting has manyariations. AdaBoost.M1 [20], the most popular of AdaBoost’s varia-ions, is for classification problems where each classifier can attain aeighted error of no more than, and it is adopted in this paper. Wese artificial neural network as component classifier for Bagging,daBoost and the proposed ACO and rough set based ensemblelgorithm. For comparison, we also include the classification resultf artificial neural network, Support Vector Machine (SVM) and k-earest neighbor (kNN). In the experiment, we use the radial basis

unction network for artificial neural network implementation, theIBSVM for SVM implementation and set linear kernel as defaulternel function of SVM. For kNN classifier used as benchmarks foromparison, we investigate the range of values of parameter k from

to 20 in increments of 1. The tuning results suggest that when k areet to 5 on breast and hepatitis datasets respectively, 7 on diabetesataset, 10 on heart dataset, the kNN achieves the satisfactorilyigh performance. Thus, we choose 5 for k on breast and hepatitisatasets, 7 for k on diabetes dataset and 10 for k on heart dataset forNN algorithm in our subsequent experiments. We range ensembleizes for Bagging and Boosting from 10 to 50 respectively to create

wide range of settings.In the experiments, the statistics of classification performance

s evaluated by 10-fold cross validation. Each dataset is first parti-ioned into 10 equal-sized sets. Then we use nine parts to create theraining set for training the classification models and the remain-ng 10th to create the test set for evaluating the performance ofach technique. In each training-test procedure of 10-fold cross val-dation, we repeat the algorithms used for comparison five times

ith different random seeds in order to ensure that the compari-on among different classifiers does not happen by chance. In thisay we obtain 50 experiment results for each technique. Then, wese the mean of the 50 performances as final results for the artifi-ial neural network, kNN, SVM, Bagging and Boosting algorithms,espectively. For the proposed approach, the process of partition-ng the training set into S1 and S2 is done further. For each dataset,% of the examples from training set is randomly selected as theet S1. The rest of the examples from training set are employed toonstruct the set S2. Then, the set S1 and S2 are used for traininghe proposed approach according to Algorithm 4, and the test set is

mployed for evaluating the performance. For minimizing poten-ial biases from the randomized sampling process and obtaining

ore reliable performance evaluations for the proposed approach,e perform the partition process five times in each training-test

avg 84.67 85.98 84.03 84.58 86.20 87.94 5.3

procedure. In this way, we obtain 250 experiment results for theproposed approach. The effectiveness of the proposed approachexamined in the experiments is then estimated by using meanof the performances obtained from the 250 individual validationtrials. The values for parameter are tuned from 15% to 95% inincrements of 10% to design a wide range of scenarios. From thetuning results, we discover a phenomenon that with the increaseof , the proposed algorithm in the beginning performs better andafterward worse. It is because that too small value of can maketoo little examples be used to train the component classifiers andtoo large value of can make too little examples be used to con-struct the decision table for selecting the component classifiers,both cases can lead to worse performance. The tuning results sug-gest that when is set to 75% on breast, heart and hepatitis, and65% on diabetes respectively, the proposed algorithm achieves thebest results. Thus, we choose 75% for on breast, heart and hep-atitis, and 65% for on diabetes respectively for our subsequentexperiments.

Table 2 shows the classification results of various techniques interms of mean accuracy values when the ensemble sizes of bothBagging and Boosting are 10, and the proposed ensemble algo-rithm selects a subset from 10 trained component classifiers. Thefinal column of the table gives the mean number of componentclassifiers selected and used by the proposed algorithm. Artifi-cial neural network is abbreviated as ANN in Table 2. Accordingto the experimental results presented in Table 2, the proposedensemble algorithm yields top-notch performance among thesealgorithms for all four datasets. For example, in breast dataset,the accuracy value of the proposed algorithm is 97.43%, which isapproximately 1.58% higher than that of artificial neural network,1.51% higher than that of Bagging, 1.73% higher than that of Boost-ing, 0.73% higher than that of kNN and 0.62% higher than that ofSVM. In Table 2, avg shows summarized result which is calculatedby averaging the accuracy values over all datasets. The averageclassification performance of the proposed algorithm beats artifi-cial neural network by about 3.56%, Bagging by 2.57%, Boosting byapproximately 4.66%, kNN by 3.65% and SVM by about 2.03%. Theaverage size of the ensembles generated by the proposed algorithmis about only 44% (4.4/10.0).

The mean experiment results when the ensemble sizes of bothBagging and Boosting are 40 and the proposed algorithm selects asubset from 40 trained component classifiers, are shown in Table 3.The proposed algorithm also yields a higher performance com-pared to other algorithms for all datasets. Here, the mean size of

the ensembles generated by the proposed approach is about only13% (5.3/40.0).
Page 8: A novel ensemble algorithm for biomedical classification based on Ant Colony Optimization

L. Shi et al. / Applied Soft Computing 11 (2011) 5674–5683 5681

10 15 20 25 30 35 40 45 500.95

0.955

0.96

0.965

0.97

0.975

Number of component classifiers in the full ensemble

Acc

urac

y

Bagging Boosting Proposed ANN kNN SVM

Fig. 5. Performance curves of methods on breast dataset.

10 15 20 25 30 35 40 45 500.73

0.735

0.74

0.745

0.75

0.755

0.76

0.765

0.77

0.775

0.78

Number of component classifiers in the full ensembl e

Acc

urac

y

Bagging Boosting Proposed ANN kNN SVM

Fig. 6. Performance curves of methods on the diabetes dataset.

10 15 20 25 30 35 40 45 500.78

0.8

0.82

0.84

0.86

0.88

0.9

Acc

urac

y

Bagging Boosting Proposed ANN kNN SVM

nattpT

10 15 20 25 30 35 40 45 500.83

0.84

0.85

0.86

0.87

0.88

0.89

0.9

0.91

Number of component classifiers in the full ensembl e

Acc

urac

y

Bagging Boosting Proposed ANN kNN SVM

Fig. 8. Performance curves of methods on the hepatitis dataset.

Table 4Comparison of AUC value on datasets (size of full ensemble = 10).

Data ANN Bagging Boosting kNN SVM Proposed

Breast 0.986 0.990 0.985 0.989 0.972 0.993Diabetes 0.783 0.809 0.790 0.766 0.727 0.820Heart 0.893 0.887 0.866 0.845 0.841 0.899Hepatitis 0.835 0.834 0.821 0.782 0.735 0.873

Table 5Comparison of AUC value on datasets (size of full ensemble = 20).

Data ANN Bagging Boosting kNN SVM Proposed

Breast 0.986 0.990 0.984 0.989 0.972 0.993Diabetes 0.783 0.812 0.789 0.766 0.727 0.820

Number of component classifiers in the full ensembl e

Fig. 7. Performance curves of methods on the heart dataset.

Figs. 5–8 display the comparison of mean accuracy for artificialeural network, Bagging, Boosting, kNN, SVM and the proposedlgorithm using different size of ensemble on each dataset, respec-

ively. Note that Bagging and Boosting are full ensembles becausehey use all trained component classifiers for each size, and theroposed algorithm uses a subset of the full ensemble for each size.he figures show that the proposed algorithm is consistently bet-

Heart 0.893 0.891 0.867 0.845 0.841 0.900Hepatitis 0.835 0.848 0.794 0.782 0.735 0.873

ter than artificial neural network, kNN, SVM, Bagging and Boostingon all datasets. These figures also show that SVM, one of state-of-the-art algorithms, outperforms artificial neural network on allfour datasets, and Bagging is also consistently better than artificialneural network on all datasets, but the performance of Boostingis not so stable. On breast dataset, the performance of Boosting isslightly better than that of artificial neural network and even worsethan that of artificial neural network on diabetes, heart and hepati-tis dataset. So, the proposed algorithm is better than both Baggingand Boosting generally for classification of biomedical data.

Table 4 lists the mean AUC values of six methods: ANN, Bagging,Boosting, kNN, SVM and the proposed ACO and rough set basedalgorithm on each dataset when size of full ensemble is equal to10.

The results show that the proposed ACO and rough set basedalgorithm does improve the AUC values for all datasets. The per-formance of the ensembles produced by the proposed algorithm isbetter than the artificial neural network, kNN and SVM. And it evenoutperforms the two popular ensemble algorithms, i.e., Baggingand Boosting. Note that the proposed algorithm uses only about 4classifiers as compared to 10 for the Bagging and Boosting ensem-ble. We can draw the following conclusions of this result. Firstly, itis not always true that the larger the size of an ensemble, the betterit is. Secondly, the process of the proposed algorithm to incorporateACO and rough set for selecting and aggregating a subset of the all

trained component classifiers is quite effective.

Tables 5–8 display the comparison of mean AUC values for artifi-cial neural network, Bagging, Boosting, kNN, SVM and the proposedalgorithm when the sizes of ensemble are equal to 20, 30, 40 and

Page 9: A novel ensemble algorithm for biomedical classification based on Ant Colony Optimization

5682 L. Shi et al. / Applied Soft Computing 11 (2011) 5674–5683

Table 6Comparison of AUC value on datasets (size of full ensemble = 30).

Data ANN Bagging Boosting kNN SVM Proposed

Breast 0.986 0.990 0.982 0.989 0.972 0.994Diabetes 0.783 0.812 0.789 0.766 0.727 0.821Heart 0.893 0.892 0.866 0.845 0.841 0.903Hepatitis 0.835 0.862 0.830 0.782 0.735 0.875

Table 7Comparison of AUC value on datasets (size of full ensemble = 40).

Data ANN Bagging Boosting kNN SVM Proposed

Breast 0.986 0.892 0.981 0.989 0.972 0.994Diabetes 0.783 0.812 0.789 0.766 0.727 0.819Heart 0.893 0.892 0.866 0.845 0.841 0.903Hepatitis 0.835 0.857 0.829 0.782 0.735 0.876

Table 8Comparison of AUC value on datasets (size of full ensemble = 50).

Data ANN Bagging Boosting kNN SVM Proposed

Breast 0.986 0.991 0.981 0.989 0.972 0.994Diabetes 0.783 0.811 0.789 0.766 0.727 0.819Heart 0.893 0.891 0.868 0.845 0.841 0.902Hepatitis 0.835 0.868 0.826 0.782 0.735 0.876

10 20 30 40 501

1.5

2

2.5

3

3.5

4

4.5

5

Number of component classifiers in the full ensembleNum

ber

of c

ompo

nent

cla

ssifi

ers

in p

ropo

sed

algo

rithm

5tpa

cwbpepcntfm

bifs

Table 9Comparison of the efficiency on the breast dataset (size of full ensemble = 40).

Cost resource ANN Bagging Boosting kNN SVM Proposed

Fig. 9. Variation of number of selective component classifiers.

0 on each dataset, respectively. The AUC results also confirmedhe effectiveness of the proposed method. Thus, experimental com-arison clearly shows performance improvement of the proposedlgorithm is significant both on accuracy and on AUC.

Fig. 9 illustrates the variation of mean number of componentlassifiers selected and used in the proposed ensemble algorithmith number of component classifiers in the full ensemble on

reast dataset, respectively. From the figure, we can discover ahenomenon that with the increase of component classifiers in fullnsemble, the number of selected component classifiers in the pro-osed ensemble algorithm is only a few parts of all componentlassifiers, and it increases slightly with the increase of compo-ent classifiers in full ensemble from 10 to 50. For other datasets,he phenomenon is similar. This is an obvious improvement overull ensemble both in computational speed and in storage require-

ents.Table 9 summarizes the comparison of the efficiency on the

reast dataset when the ensemble sizes of both Bagging and Boost-ng are 40, and the proposed ensemble algorithm selects a subsetrom 40 trained component classifiers. The values in columns repre-ent the mean of computation times in seconds and the memory in

Time 1.2 28.6 20.8 0.9 1.3 4.6Memory 9 43 31 3 8 15

MB spent on classifying the test set by different algorithms, respec-tively. All algorithms are coded in Java and running on a Pentium 3PC with single 1.0 GHz CPU and 512 MB memory.

From the experimental results, we make the following observa-tions. The proposed approach is slower than kNN, SVM and artificialneural network but significantly faster than Bagging and Boostingalgorithms. The total time cost by the proposed algorithm is aboutonly 16.1% (4.6/28.6) and 22.1% (4.6/20.8) of those cost by Baggingand Boosting algorithms, respectively. The total memory cost by theproposed algorithm is about only 34.9% (15/43) and 48.4% (15/31)of those cost by Bagging and Boosting algorithms, respectively. Fordiabetes, heart, and hepatitis datasets, the proposed approach alsoimproves the memory requirement and computation time used inthe classification compared with the Bagging and Boosting algo-rithms. Due to space limitations, we do not list their results here.

Above all, the proposed algorithm outperforms Bagging andBoosting in performance on breast, diabetes, heart, and hepati-tis datasets, and it is more efficient than Bagging and Boostingalgorithms. Thus, the proposed approach is a good alternative forBagging and Boosting in some scenarios of biomedical data classi-fication where Bagging and Boosting are not suitable due to theirlower efficiency.

7. Conclusion

With the rapid increase in size of the biomedical databases,classifying the data effectively and efficiently has become critical.Ensemble learning is a hot topic in machine learning and has highpotential application in biomedicine. To improve efficiency andeffectiveness of ensemble for biomedical data, a novel ensemblealgorithm is proposed based on ACO and rough set theory. Compar-ison of the proposed algorithm with popular methods is conductedon four biomedical datasets. The experimental results indicate thatthe proposed algorithm yields much better performance than arti-ficial neural network, Bagging, Boosting, kNN and SVM. Our futureeffort is to combine the proposed algorithm and feature selectionto improve the classification performance further.

Acknowledgements

The authors would like to thank the anonymous reviewers fortheir helpful comments and suggestions.

References

[1] J.W. Lee, J.B. Lee, M. Park, S. Song, An extensive comparison of recent classi-fication tools applied to microarray data, Computational Statistics and DataAnalysis (2005) 869–885.

[2] T.G. Dietterich, Ensemble methods in machine learning, in: Proceedings of theFirst International Workshop on Multiple Classifier Systems, Cagliari, 2000, pp.1–15.

[3] Z. Pawlak, Rough sets, International Journal of Computer and Information Sci-ences 11 (5) (1982) 341–356.

[4] A. Colorni, M. Dorigo, V. Maniezzo, Distributed optimization by ant colonies,in: Proceedings of ECAL’91 European Conference on Artificial Life, Paris, France,1991, pp. 134–142.

[5] M. Dorigo, L.M. Gambardella, Ant Colony System: a cooperative learning

approach to the traveling salesman problem, IEEE Transactions on EvolutionaryComputation 1 (1) (1997) 53–66.

[6] M. Dorigo, V. Maniezzo, A. Colorni, Ant system: optimization by a colony ofcooperating agents, IEEE Transaction on Systems, Man and Cybernetics – PartB 267 (1) (1996) 29–41.

Page 10: A novel ensemble algorithm for biomedical classification based on Ant Colony Optimization

omput

[

[

[

[

[

[

[

[

[

[[

[

[

[

[

[

[

[

[

[

L. Shi et al. / Applied Soft C

[7] M.E. Basiri, N. Ghasem-Aghaee, M.H. Aghdam, Using ant colony optimization-based selected features for predicting post-synaptic activity in proteins,in: Proceedings of 6th European Conference on Evolutionary Computation,Machine Learning and Data Mining in Bioinformatics, Napoli, Italy, 2008, pp.12–23.

[8] T. Mitchell, Machine Learning, McGraw-Hill, New York, 1997.[9] R. Tibshirani, T. Hastie, B. Narasimhan, G. Chu, Diagnosis of multiple can-

cer types by shrunken centroids of gene expression, PNAS 99 (10) (2002)6567–6572.

10] N. Pochet, F.D. Smet, J. Suykens, B.L.R.D. Moor, Systematic benchmarking ofmicroarray data classification: assessing the role of non-linearity and dimen-sionality reduction, Bioinformatics 20 (17) (2004) 3185–3195.

11] J. Khan, J.S. Wei, Classification and diagnostic prediction of cancers using geneexpression profiling and artificial neural networks, Nature Medicine 7 (6)(2001) 673–679.

12] A. Tan, D. Gilbert, Ensemble machine learning on gene expressiondata for cancer classification, Applied Bioinformatics 2 (3) (2003)75–83.

13] B. Liu, Q. Cui, T. Jiang, S. Ma, A combinational feature selection and ensem-ble neural network method for classification of gene expression data, BMCBioinformatics 5 (136) (2004) 51–131.

14] J. Holland, Adaptation in Natural and Artificial Systems, University of Michigan,Ann Arbor, 1975.

15] J. Kennedy, R. Eberhart, Particle swarm optimization, in: Proceedings of theIEEE International Conference on Neural Networks, Perth, Australia, 1995, pp.1942–1948.

16] L.I. Kuncheva, L.C. Jain, Designing classifier fusion systems by genetic algo-rithms, IEEE Transactions On Evolutionary Computation 4 (4) (2000) 327–336.

17] B. Minaei-Bidgoli, G. Kortemeyer, W.F. Punch, Optimizing classification ensem-bles via a genetic algorithm for a web-based educational system, in:Proceedings of the International Workshop on Syntactical and Structural Pat-tern Recognition and Statistical Pattern Recognition, Lisbon, Portugal, 2004, pp.397–406.

[

[

ing 11 (2011) 5674–5683 5683

18] A. Mohemmed, M. Johnston, M. Zhang, Particle swarm optimization basedmulti-prototype ensembles, in: Proceedings of the Genetic and EvolutionaryComputation Conference, Montreal, Qubec, Canada, 2009, pp. 57–64.

19] L. Breiman, Bagging predictors, Machine Learning 24 (2) (1996) 123–140.20] Y. Freund, R.E. Shapire, A decision-theoretic generalization of online learning

and an application to boosting, Journal of Computer and System Sciences 55(1) (1997) 119–139.

21] L. Shi, X.M. Ma, L. Xi, et al., Rough set and ensemble learning based semi-supervised algorithm for text classification, Expert Systems with Applications38 (5) (2011) 6300–6306.

22] R.W. Swiniarski, A. Skowron, Rough set methods in feature selection and recog-nition, Pattern Recognition Letter (2003) 833–849.

23] M. Dorigo, G.D. Caro, Ant colony optimization: a new meta-heuristic, in: Pro-ceedings of IEEE Congress on Evolutionary Computing, Washington, DC, USA,1999, pp. 1470–1477.

24] E. Bonabeau, M. Dorigo, G. Theraulaz, Swarm Intelligence: From Natural toArtificial Systems, Oxford University Press, New York, 1999.

25] K. Liangjun, F. Zuren, R. Zhigang, An efficient ant colony optimization approachto attribute reduction in rough set theory, Pattern Recognition Letter 29 (9)(2008) 1351–1357.

26] M.H. Aghdam, N. Ghasem-Aghaee, M.E. Basiri, Text feature selection using antcolony optimization, Expert Systems with Applications (2009) 6843–6853.

27] C.L. Blake, C.J. Merz, UCI Repository of Machine Learning Databases,http://www.ics.uci.edu/mlearn/MLSummary.html/, 1998.

28] Y. Yang, An evaluation of statistical approaches to text categorization, Informa-tion Retrieval 1 (1–2) (1999) 69–90.

29] S. Rosset, Model selection via the AUC, in: Proceedings of the Twenty-first Inter-national Conference on Machine Learning, Banff, Alberta, Canada, 2004, pp.

89–97.

30] A. Bradley, The use of the area under the ROC curve in the evaluation of machinelearning algorithms, Pattern Recognition 1 (1997) 1145–1159.

31] D.J. Hand, Construction and Assessment of Classification Rules, Wiley, NewYork, 1997.


Recommended