Empirical Investigation of Decision Tree Ensembles for ...Herbert F. Jelinek, Khalifa University,...

International Journal of Data Warehousing and Mining, 9(4), 1-18, January-March 2013 1

Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

ABSTRACTCardiac complications of diabetes require continuous monitoring since they may lead to increased morbidity or sudden death of patients. In order to monitor clinical complications of diabetes using wearable sensors, a small set of features have to be identified and effective algorithms for their processing need to be investi-gated. This article focuses on detecting and monitoring cardiac autonomic neuropathy (CAN) in diabetes patients. The authors investigate and compare the effectiveness of classifiers based on the following decision trees: ADTree, J48, NBTree, RandomTree, REPTree, and SimpleCart. The authors perform a thorough study comparing these decision trees as well as several decision tree ensembles created by applying the follow-ing ensemble methods: AdaBoost, Bagging, Dagging, Decorate, Grading, MultiBoost, Stacking, and two multi-level combinations of AdaBoost and MultiBoost with Bagging for the processing of data from diabetes patients for pervasive health monitoring of CAN. This paper concentrates on the particular task of applying decision tree ensembles for the detection and monitoring of cardiac autonomic neuropathy using these features. Experimental outcomes presented here show that the authors’ application of the decision tree ensembles for the detection and monitoring of CAN in diabetes patients achieved better performance parameters compared with the results obtained previously in the literature.

Empirical Investigation of Decision Tree Ensembles

for Monitoring Cardiac Complications of Diabetes

Andrei V. Kelarev, Deakin University, Burwood, VIC, Australia

Jemal Abawajy, Deakin University, Burwood, VIC, Australia

Andrew Stranieri, University of Ballarat, Ballarat, VIC, Australia

Herbert F. Jelinek, Khalifa University, Abu Dhabi, UAE, & Charles Sturt University, Albury, NSW, Australia

Keywords: Cardiac Autonomic Neuropathy (CAN), Decision Tree Ensembles, Decision Trees, Diabetes, Receiver Operating Characteristic (ROC) Area

INTRODUCTION

Cardiac complications of diabetes may lead to increased morbidity and a higher probability of sudden death of the patients. Many patients suffering from diabetes develop complications

that require continuous cardiac monitoring that can be performed by state of the art mobile phones and commercial-off-the-shelf wearable bio-sensors (Tayebi, Krishnaswamy, Waluyo, Sinha & Gaber, 2011). In order to be able to monitor cardiac complications of diabetes using

DOI: 10.4018/ijdwm.2013100101


2 International Journal of Data Warehousing and Mining, 9(4), 1-18, January-March 2013

wearable sensors, a small set of features needs to be determined and effective algorithms for their processing have to be investigated.

The investigation and development of algorithms for pervasive healthcare systems has attracted serious attention in the literature and has been growing. Many novel advanced techniques for the use in new data acquisition systems have been studied. These advances in research relying on wireless communication and sensor technologies have opened up a new paradigm in the healthcare industry. Examples of recent results obtained in this field include a flexible, efficient and lightweight Wireless Body Area Network (WBAN) Middleware. Song, Xiao, Waluyo, Chen and Wu (2008) investigated a service-specific middleware architecture bridging the gap between the application development and the underlying network sensor devices. The proposed middle-ware is able to provide functions including network initialization and registration, service announcement, sensor actuation and control for flexible multiple modalities data acquisition, and real-time network service management. The middleware has been implemented and tested with a healthcare monitoring test-bed using Imperial College sensor nodes. An extension of the middleware overcame the constraints on the application development in WBAN placed by the limited network resources and bridges the communication gap between sensor nodes and a mobile device. The middleware shielded the underlying sensor and OS/protocol stack away from the WBAN application layer (developed by Chen, Waluyo, Pek, & Yeoh, 2010). The middleware was implemented as a lightweight dynamic link library, which allows an ap-plication developer to simply incorporate the middleware resource dynamic link library into their application to call the required functions.

Within the context of individual healthcare, an architectural framework for the wellness management was implemented by Biswas, Jayachandran, Shue, Xiao, and Yap (2007). The key benefit that this framework brought was in enabling incremental incorporation of new sensors and sensing modalities as well as

other hardware devices. Software modules, such as signal processing algorithms and self-help oriented user interfaces may be added easily, and the responses can be personalized and customized to suit the needs of a patient, caregiver or doctor. The extensibility and per-sonalization are particularly valuable for home based healthcare and wellness management. Sensor technology and continuous monitoring increases the amount of data that needs to be efficiently sorted for point-of-care decision making. This has led to developments in the application of data warehousing including analysis of the defining features of a clinical data warehouse (Nealon, Rahayu, & Pardede, 2009). The analysis centered on the opportu-nities for and threats to the optimization of individual performance solutions based on the structure and merits of the data within the data warehouse to obtain the most optimal configu-ration. This focus was motivated by the large periods of time commonly required to process query results in the data warehousing and on-line analytical processing (OLAP) as crucial elements of decision support in healthcare. A windowing data structure architecture which manages a collection of popular windows was introduced for increasing the performance of OLAP queries on a clinical data warehouse.

An effective application for resource-aware time-series analysis of ECG data on mobile devices using Symbolic Aggregate Approxima-tion (SAX) was investigated by Sinha, Tayebi, Krishnaswamy, Waluyo, and Gaber (2011). Pek, Waluyo, Yeoh, and Chen (2009) proposed and investigated a motion-based wake-up scheme combining motion detection with existing power preservation techniques to achieve a balance between energy saving and data acquisition timeliness. It minimizes possible disruptions to a patient’s daily activities, which is crucial since wearable sensors attached on patients for continuous real-time medical monitoring typically need to remain operational for peri-ods of up to 24 hours before a battery change or recharge. The scheme was integrated in a healthcare application demonstrating the capa-bility of the scheme to deal with critical events.



This showcase confirmed the effectiveness of the proposed motion-based scheme.

Ambulatory sensing can take a number of forms that have been explored. A preventative health sensing system ‘Footpaths’ was devel-oped by Waluyo, Pek, Yeoh, Kok, and Chen (2009). It uses remote sensor technology to integrate wireless body sensor network and walking route navigation system in order to provide the most suitable walking path to the wearer taking into account the health condition. Footpaths facilitates the measurement of the user’s cardio-respiratory fitness level (CRF), via the use of a wearable wireless sensor net-work, in which the result is used to determine the most suitable walking route. MobiSense is another mobile system for ambulatory patients introduced by Waluyo, Yeoh, Pek, Yong, and Chen (2010). It resides in a mobile device and communicates with a set of body sensors attached to the wearer. MobiSense is able to detect body postures such as lying, sitting, and standing, and walking speed, by utilizing a rule-based heuristic activity classification scheme based on the extended Kalman (EK) Filtering algorithm. It controls each of the sensor devices and performs resource reconfiguration and management schemes via the sensor sleep/wake-up mode. The accuracy of the activity clas-sifier scheme has been evaluated by involving several human subjects. Tayebi, Krishnaswamy, Waluyo, Sinha, and Gaber (2011) proposed, developed and evaluated a resource-aware and energy-efficient time series analysis technique RA-SAX for the real-time ECG analysis on mo-bile devices based on the Symbolic Aggregate Approximation (SAX) representation for time series. It was created to address the growing need for continuous cardiac monitoring that leverages state of the art mobile phones and commercial-off-the-shelf (COTS) wearable bio-sensors. This device can now be used to obtain raw ECG features to identify pathology such as atrial fibrillation but also in combination with data mining techniques allows ECG recordings to be incorporated into a mobile decision sup-port system as a preventative healthcare tool.

The present article focuses on the par-ticular question of detecting and monitoring

of cardiac autonomic neuropathy (CAN) from ECG recordings in diabetes patients. We use a set of four features identified previously by Huda, Jelinek, Ray, Stranieri, and Yearwood (2010), for more details please refer to the sec-tion “Cardiac Autonomic Neuropathy” below. It is known that these features form a small but effective combination with high accuracy in detection of CAN. These features can all be collected using wearable sensors, which is particularly beneficial for continuous monitor-ing. Since diabetes and its complications require continuous everyday monitoring of health related tests to administer medication, adjust the diet, update treatment plans and provide further interventions, the development of per-vasive healthcare systems for the monitoring of diabetes patients is particularly valuable. The aim of this paper is to perform a systematic investigation of Decision Tree Ensembles for processing of these features. Thus the paper is devoted to an experimental investigation and comparison of performance of various Decision Tree Ensembles in a novel application to the processing of data from diabetes patients for the detection and monitoring of CAN.

To simulate data collected by the sen-sors, we used a data set derived from a large and comprehensive database created by the Diabetes Complications Screening Research Initiative (DiScRi) at Charles Sturt University and concentrated on the particular task of the detection and monitoring of CAN. This database is discussed in the next section.

Here let us only briefly mention that many of the parameters recorded in the DiScRi data-base can now be collected using novel mobile health monitoring systems like MobiSense. Us-ing state of the art mobile phones, commercial-off-the-shelf (COTS) wearable bio-sensors and resource-aware energy-efficient analysis technique for real-time ECG analysis diverse features of ECG recordings can be obtained (Sinha, Tayebi, Krishnaswamy, Waluyo, & Gaber, 2011; Tayebi, Krishnaswamy, Waluyo, Sinha, & Gaber, 2011).

The results of the present paper demon-strate that the novel application of decision tree ensembles for the detection and monitoring of



CAN achieves substantially better performance parameters compared with the outcomes ob-tained previously in the literature by Huda, Jelinek, Ray, Stranieri, and Yearwood (2010).

DIABETES COMPLICATIONS SCREENING RESEARCH INITIATIVE

The data set used in our article and published online is derived from a large database of test results and health-related parameters collected at the Diabetes Complications Screening Re-search Initiative, DiScRi, organized at Charles Sturt University. It was used, for example, by Cornforth and Jelinek (2007) and Huda, Jelinek, Ray, Stranieri, and Yearwood (2010). In order to investigate machine learning algorithms and attributes that have to be collected for the monitoring of diabetes patients, it makes sense to start by using large data sets already collected in this area. There are no other data sets containing comparable collections of test outcomes. Parameters recorded in the DiScRi database can be extracted from ECG data using state of the art mobile phones, commercial-off-the-shelf (COTS) wearable bio-sensors and resource-aware energy-efficient analysis technique for real-time ECG analysis and can now be routinely collected using MobiSense discussed in the preceding section. The most important set of features recorded for CAN determination is the Ewing battery, originally considered by Ewing, Campbell, and Clarke (1980) and Ewing, Martyn, Young, and Clarke (1985). There are five Ewing tests in the battery: changes in heart rate associated with lying to standing, deep breathing and valsalva maneuver and changes in blood pressure associated with hand grip and lying to standing. In addition features from the ten second samples of 12-lead ECG for all participants were extracted from the database.

The DiScRi database has been actively investigated in the literature and many different questions have been considered. Cornforth and

Jelinek (2007) applied several machine learning techniques and automated classification meth-ods for building effective predictive models. The whole database contained over 200 features. In this paper we use exactly the same set of features as those considered by Huda, Jelinek, Ray, Stranieri, and Yearwood (2010). They can be collected using ready mobile devices. There are many ready off-the-shelf mobile devices that can monitor such parameters in real time. These devices are readily available for purchase on the internet and keep improving in performance every year. Let us refer to ECG for Iphone (2013), Polar RCX5 Heart Rate Monitor (2013), and Wireless iPhone, Android, iPad Heart Rate Chest Belt with Receiver (2013) for examples of such devices.

DiScRi data contained data with miss-ing values and five class labels: ‘atypical’, ‘normal’, ‘definite’, ‘atypical’, ‘severe’, and ‘undefined’. Since the number of ‘atypical’ and ‘severe’ instances was relatively small, to obtain a file with two class labels, we deleted all rows with ‘undefined’ and ̀ atypical’ classes for CAN, kept all instances with ̀ normal’ label, and combined all other subclasses of CAN into one ̀ definite’ class for the detection of CAN. To handle the remaining missing values we could have applied effective data mining techniques recently developed by Qin, Zhang, and Zhang (2010), Williams, Soares, and Gilbert (2012) and Zhang (2010). Following advice from the experts maintaining the database, we managed to interpolate the missing values in the dataset with the small number of features selected for experiments by applying expert editing rules. These rules were collected during several discussions with the experts. Most of them invoke interpolation based on the fact that once a certain medical condition occurs, it seldom vanishes, and so the value of the corresponding feature can be filled in between any two known instances. The application of expert editing eliminated almost all missing values, and it turned out possible to proceed without further applications of advanced and specialized data mining techniques for handling them.



CARDIAC AUTONOMIC NEUROPATHY

Cardiac autonomic neuropathy (CAN) is a condition associated with damage to the auto-nomic nervous system innervating the heart that is highly prevalent in people with diabetes. It has been considered by many authors includ-ing clinicians and engineers. Ewing, Campbell, and Clarke (1980), Ewing, Martyn, Young, and Clarke (1985), and Khandoker, Jelinek, and Palaniswami (2009).

Ewing, Campbell, and Clarke, B. (1980) studied seventy-three diabetics (62 males and 11 females) with symptoms of CAN. The sub-jects were followed for up to five years. Thirty patients presented with impotence alone, while the other 43 presented with one or more of the following: postural hypotension, intermittent diarrhea, hypoglycemic unawareness, sweat-ing abnormalities and gastric fullness. Tests from the Ewing battery were used (responses to the Valsalva maneuver and sustained hand-grip). Most subjects with impotence alone had normal autonomic function tests, whereas the majority with other symptoms had abnormal tests. Twenty-six subjects (20 males and six females) died during the follow-up period. Of the 33 with initially normal tests in the Ewing battery, five (15 per cent) died, whereas of the 40 with initially abnormal tests, 21 (53 per cent) died. Diabetics with symptoms of CAN and abnormal tests from the Ewing battery had a calculated mortality rate after two-and-a-half years of 44 per cent and after five years of 56 per cent. Half the deaths in those with abnormal tests were from renal failure, and the remainder were either sudden and unexpected, or from other causes which may have been associated with the autonomic neuropathy. Testing repeated during the follow-up period showed that some normal tests later became abnormal, but once tests were abnormal, they usually remained abnormal. The results show that symptoms of CAN, particularly postural hypotension, gastric symptoms and hypoglycemic unawareness, together with abnormal tests from the Ewing battery, carry a very poor prognosis. Diarrhea

and impotence, on their own, cannot be relied on as symptoms of CAN. Testing using simple cardiac reflexes from the Ewing battery of tests gave a good guide to the prognosis of CAN.

A follow up study by Ewing, Martyn, Young and Clarke (1985) investigated 543 dia-betic subjects completing all five noninvasive cardiac reflex tests from the Ewing battery to assess CAN. Abnormalities of heart rate tests occurred in 40%, while abnormal blood pressure tests occurred in less than 20%. Their results were grouped as normal (39%), early (15%), definite (18%), and severe (22%) involvement. Six percent had an atypical pattern of results. Two hundred thirty-seven diabetic subjects had the tests repeated greater than or equal to 3 mo apart: 26% worsened, 71% were unchanged, and only 3% improved. The worsening followed a sequential pattern with first heart rate and later additional blood pressure abnormalities. Comparison between a single test (heart rate response to deep breathing) and the full bat-tery in 360 subjects showed that one test alone does not distinguish the degree or severity of autonomic damage.

Thus, the classification of disease progres-sion associated with CAN is important, because it has implications for planning of timely treat-ment, which can lead to an improved well-being of the patients and a reduction in morbidity and mortality associated with cardiac arrhythmias in diabetes. It is known as one of the causes of mortality among type 2 diabetes patients. The most important tests required for identification of CAN evaluate patient responses in heart rate and blood pressure to various activities, consist-ing of the tests described by Ewing, Campbell, and Clarke (1980) and Ewing, Martyn, Young and Clarke (1985): lying to standing heart rate change (LSHR), deep breathing heart rate change (DBHR), valsalva man oeuvre heart rate change (VAHR), hand grip blood pressure change (HGBP), lying to standing blood pres-sure change (LSBP).

Khandoker, Jelinek, and Palaniswami (2009) have shown that early sub clinical de-tection of CAN and intervention are of prime importance for risk stratification in preventing



sudden death due to silent myocardial infarction. Their study presents the usefulness of heart rate variability (HRV) and complexity analyses from short term ECG recordings as a screening tool for CAN. Application of data mining to classify CAN was investigated by Huda, Jelinek, Ray, Stranieri and Yearwood (2010), who explored a novel approach to finding features that can be used for detection of CAN. They studied MR-ANNIGMA, a hybrid of Maximum Relevance (MR) filter and Artificial Neural Net Input Gain Measurement Approximation (ANNIGMA) wrapper approaches. The combined heuristics in the hybrid MR-ANNIGMA exploit the complementary advantages of both filter and wrapper heuristics and has been shown to be able to find significant features.

In this paper we used exactly the same set of features as those considered by Huda, Jelinek, Ray, Stranieri and Yearwood (2010): the traditional attributes of the Valsalva man oeuvre (VAHR), deep breathing (DBHR), and hand-grip (HGBP) tests, and QRS width, which has also been shown to be indicative of CAN in the research by Fang, Prins, and Marwick (2004) that presented evidence as-sociating diabetes with a cardiomyopathy. It was demonstrated that metabolic disturbances, myocardial fibrosis, small vessel disease, cardiac autonomic neuropathy, and insulin re-sistance may all contribute to the development of diabetic heart disease. Their work clarified possible mechanisms responsible for diabetic cardiomyopathy and evaluate the evidence as-sociating diabetes with heart failure. Clinical studies confirmed the association of diabetes with left ventricular dysfunction independent of hypertension, coronary artery disease, and other heart disease; and experimental evidence of myocardial structural and functional changes.

BASE CLASSIFIERS USING DECISION TREE

In medical applications of data mining it is important to consider the classifiers produc-ing models that can be expressed in a clear form facilitating their applications in clinical

practice. The present article concentrates on the investigation of various decision tree en-sembles based on the following decision trees: ADTree, J48, NBTree, RandomTree, REPTree, and SimpleCart.

Let us briefly recall that ADTree is a clas-sifier generating alternating decision tree for two-class problems using optimized induction and heuristic search methods to speed up learn-ing, as explained by Freund and Mason (1999). The alternating decision trees are a common generalization of decision trees, voted deci-sion trees and voted decision stumps. At the same time classifiers of this type are relatively easy to interpret. ADTree generate rules that are usually smaller in size and thus easier to interpret. These rules yield a natural measure of classification confidence.

J48 uses a pruned or unpruned decision tree based on the well known C4.5 algorithm developed by Ross Quinlan extending the earlier ID3 algorithm. C4.5 uses the concept of informa-tion entropy to build decision trees from a set of training data in the same way as ID3. At each node of the tree, C4.5 chooses the attribute that splits the set of samples most effectively. The splitting criterion is the normalized information gain, i.e. the difference in entropy. The attribute with the highest normalized information gain is chosen to create the next split.

NBTree is a classifier generating a deci-sion tree with naive Bayes classifiers at the leaves (Kohavi, 1996). Naive-Bayes induction algorithms are surprisingly accurate for small samples even in situations where the conditional independence assumption does not hold. How-ever, the accuracy of Naive-Bayes does not scale up as well as decision trees. NBTree creates a hybrid of decision-trees and Naive-Bayes clas-sifiers, where the leaves of a traditional decision tree contain Naive-Bayesian classifiers. This approach combines the advantages of Naive-Bayes and decision trees.

RandomTree constructs a tree with random-ly chosen attributes at each node by employing simple pre-pruning that stops at a fixed depth (Witten & Frank, 2011).

REPTree considers all attributes and builds a decision tree using information gain. It prune



the tree using reduced-error pruning with back-fitting. The values for numeric attributes are sorted only once. Missing values are handled in the same way as in the C4.5 algorithm by splitting the instances into pieces (Witten & Frank, 2011).

SimpleCart creates a tree utilizing and a heuristic for binary split and applying mini-mal cost-complexity pruning (Breiman et al., 1984). To deal with missing values of fractional instances are introduced instead of a surrogate split.

DECISION TREE ENSEMBLES

Ensemble techniques are very well known in data mining and artificial intelligence. To furnish just a few recent examples, let us point out that effective ensembles were utilized by Dazeley et al. (2010), Yearwood et al. (2009) for the study of phishing, and by Kang et al. (2006), Yearwood et al. (2008, 2009) for the study of DNA data. Our experiments in this paper for CAN classification are devoted to the investigation and comparison of performance of the following ensemble techniques: AdaBoost, Bagging, Dagging, Decorate, Grading, Multi-Boost and Stacking.

Bagging, also known as bootstrap aggre-gating, generates a collection of new sets by resampling the given training set at random and with replacement. These sets are called bootstrap samples. New classifiers are then trained, one for each of these new training sets. They are amal-gamated via a majority vote (Breiman, 1996). Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class.

AdaBoost trains several classifiers in succession. Every next classifier is trained on the instances that have turned out more dif-ficult for the preceding classifier. To this end all instances are assigned weights, and if an instance turns out difficult to classify, then its weight increases. We used highly successful

AdaBoost classifier described by Freund & Schapire (1996). Theoretically, AdaBoost can be used to significantly reduce the error of any learning algorithm that consistently generates classifiers whose performance is a little better than random guessing. Freund and Schapire (1996) also introduced the related notion of a “pseudo-loss” which is a method for forcing a learning algorithm of multi-label concepts to concentrate on the labels that are hardest to discriminate.

MultiBoost extends the approach of AdaBoost with the wagging technique (Webb, 2000). MultiBoosting is an extension to the highly successful AdaBoost technique for forming decision committees. MultiBoosting can be viewed as combining AdaBoost with wagging. It is able to harness high bias and vari-ance reduction achieved by AdaBoost together with superior variance reduction accomplished by wagging. Wagging is a variant of bagging where the weights of training instances gener-ated during boosting are utilized in selection of the bootstrap samples (Bauer & Kohavi, 1999).

Stacking is a meta classifier that can be regarded as a generalization of voting, where meta-learner aggregates the outputs of several base classifiers, as explained by Wolpert (1992). Stacking minimizes the error rate of one or more base classifiers and also deduces the biases of the classifiers with respect to a learning set. This deduction is followed by a second stage where the guesses of the original classifiers are used as inputs, and where the output is the desired correct guess.

Diverse Ensemble Creation by Opposi-tional Relabeling of Artificial Training Ex-amples (Decorate) constructs special artificial training examples to build diverse ensembles of classifiers. A comprehensive collection of tests have established that Decorate consistently creates ensembles more accurate than the base classifier, Bagging, Random Forests, which are also more accurate than Boosting on small training sets, and are comparable to Boosting on larger training sets, developed by Melville and Mooney (2005).

Dagging is useful in situations where the base classifiers are slow. It divides the training



set into a collection of disjoint (and therefore smaller) stratified samples, trains copies of the same base classifier and averages their outputs using vote (Ting & Witten, 1997). This ensemble technique is useful for base classifiers that are quadratic or worse in time behavior, regarding number of instances in the training data.

Grading trains meta-classifiers, which grade the output of base classifiers as correct or wrong labels. The graded outcomes are then combined (Seewald & Fuernkranz, 2001). Grading is a meta-classification technique that tries to identify incorrect predictions at the base level. It graded the predictions of the base clas-sifiers by marking them as correct or incorrect. For each base classifier, one meta classifier is trained to predict when the base classifier will make an error.

AdaBoost of Bagging and MultiBoost of Bagging are new combined multi-level Decision Tree Ensembles where Boosting is used after Bagging based on decision trees has been ap-plied. They belong to the well-known and broad area of investigating multi-tier constructions of classifiers considered for instance, by Islam and Abawajy (2013), Islam, Abawajy, and Warren (2009), and Raahemi and Mumtaz (2010).

EXPERIMENTAL RESULTS

Our experiments used combinations of al-gorithms implemented in the Waikato Envi-ronment for Knowledge Analysis (WEKA) presented by Hall, Frank, Holmes, Pfahringer, Reutemann and Witten (2009), and Witten and Frank (2011). It is a free suite of machine learning software written in Java and available online under the GNU General Public License. WEKA comes with a zip archive weka-src.jar that contains complete sources of the whole WEKA. We used Simple CLI command line interface to run all algorithms.

The dataset used in our experiments was recorded in the WEKA Attribute Relationship File Format (ARFF). The ARFF file contains two sections: the header and the data section. The first line of the header supplies the relation name. Then there is the list of the attributes.

Each attribute is associated with a unique name and a type. The latter describes the kind of data contained in the variable and what values it can have. The variables types are: numeric, nominal, string and date. The class attribute is by default the last one of the list. author. After that there is the data itself, each line stores the attribute of a single entry separated by a comma.

All our experiments used 10-fold cross validation as a standard measure to prevent overfitting of models. Ten-fold cross-validation is a standard technique for preventing overfitting and assessing how well the results obtained by a classifier will generalize to an independent data set. It is mainly used in settings where the goal is prediction to estimate how accurately a predictive model will perform in practice. The sample is partitioned into ten disjoint and approximately equal subsets. In each of the ten rounds one of the subsets is used as a hold-out testing set, also called a validating set, whereas the nine remaining subsets are combined into a training set. The model is trained on the train-ing sets and its performance is evaluated on the testing set. Ten rounds are performed to reduce variability, and the validation results are averaged over the rounds.

Our experiments used standard measures of performance: accuracy, precision, recall and ROC area. These measures of performance are associated with each other. It turned out that in our experiments all of these measures behaved consistently. For our dataset, an algorithm that was better in one metric, was also better with respect to all other measures of performance. We include diagrams with ROC area for all outcomes of our experiments in this text, since this measure is the one most often used in the literature devoted to the medical applications of data mining and machine learning. The appendix of this article published online as a supplement contains complete WEKA output files for all experiments, where the readers can view all other parameters.

For convenience of the readers, let us in-clude a brief overview of these measures here. An ROC curve is a plot of True Positive Rates against False Positive Rates. False Positive Rate



is the ratio of false positive results to all negative samples. The Area Under Curve or the ROC area can be interpreted as the probability that the classifier ranks a randomly chosen positive instance above a randomly chosen negative one.

The accuracy of a classifier is the per-centage of all instances classified correctly. It is equal to the probability that a prediction of the classifier for an individual instance is correct. Precision is the ratio of true positives to combined true and false positives. Recall is the ratio of true positives to the number of all positive samples (i.e., to the combined true positives and false negatives). Sensitivity is the proportion of positives (patients with CAN) that are identified correctly. Specificity is the proportion of negatives (patients without CAN) which are identified correctly. Sensitivity is also called True Positive Rate. False Positive Rate is equal to 1 - specificity.

In assessing the performance of classifiers, precision and recall refer to their weighted average values. This means that they are calcu-lated for each class separately, and a weighted average is then found. For instance, looking at the class of patients with CAN, the precision is the ratio of the number of patients correctly identified as having CAN to the number of all patients identified as having CAN. The recall calculated for the class of patients with CAN is equal to sensitivity of the whole classifier. For the cohort of patients without CAN, the precision is the ratio of the number of patients correctly identified as having no CAN to the number of all patients identified as free from CAN. The precision of the classifier as a whole is a weighted average of its precisions for these classes.

For the class of patients with CAN, the recall is the ratio of the number of patients cor-rectly identified as having CAN to the number of all patients with CAN. For the cohort of pa-tients without CAN, the recall is the ratio of the number of patients correctly identified as being free from CAN to the number of all patients without CAN. The recall of the classifier is a weighted average of its recalls for both classes.

First, we used 10-fold cross validation to assess the performance of ADTree, J48, NBTree, RandomTree, REPTree and SimpleCart for diabetes patients in our dataset. Experimental results comparing the performance of these clas-sifiers are given in Figure 1. They demonstrate that SimpleCart achieved the best performance with ROC area 0.947 that is best compared with the outcomes of ADTree, J48, NBTree, RandomTree, and REPTree.

Second, our experiments compared deci-sion tree ensembles based on ADTree and created by applying the following meta learn-ers classifiers: AdaBoost, Bagging, Dagging, Decorate, Grading, MultiBoost, Stacking and two multi-level combinations of AdaBoost and MultiBoost with Bagging. The experimental results comparing the performance of these ensemble classifiers based on ADTree are presented in Figure 2.

Third, we performed experiments compar-ing decision tree ensembles based on J48 and created by applying the following meta learners: AdaBoost, Bagging, Dagging, Decorate, Grad-ing, MultiBoost, Stacking and two multi-level combinations of AdaBoost and MultiBoost with Bagging based on J48. The experimental results comparing the performance of these ensemble classifiers based on J48 are presented in Figure 3.

Fourth, we have performed experiments comparing decision tree ensembles based on NBTree and created by applying the following meta learners AdaBoost, Bagging, Dagging, Decorate, Grading, MultiBoost, Stacking and two multi-level combinations of AdaBoost and MultiBoost with Bagging based on NBTree. The experimental results comparing the perfor-mance of these decision tree ensembles based on NBTree are presented in Figure 4.

Fifth, we conducted experiments compar-ing decision tree ensembles based on Random-Tree and created by applying the following meta learners AdaBoost, Bagging, Dagging, Decorate, Grading, MultiBoost, Stacking and two multi-level combinations of AdaBoost and MultiBoost with Bagging based on Random-Tree. The experimental results comparing the



Figure 1. ROC area of decision tree classifiers for diabetes patients

Figure 2. ROC area of decision tree ensembles based on ADTree



Figure 3. ROC area of decision tree ensembles based on J48

Figure 4. ROC area of decision tree ensembles based on NBTree



performance of these decision tree ensembles based on RandomTree are presented in Figure 5.

We have also performed experiments comparing decision tree ensembles based on REPTree and created by the same meta learn-ers. The experimental results comparing the performance of these decision tree ensembles based on REPTree are presented in Figure 6. The best outcome with ROC area 0.984 was achieved by the decision tree ensemble gener-ated by Decorate based on RandomTree.

We have also carried out experiments comparing decision tree ensembles based on SimpleCart and created by applying the follow-ing meta learners AdaBoost, Bagging, Dagging, Decorate, Grading, MultiBoost, Stacking and two multi-level combinations of AdaBoost and MultiBoost with Bagging based on Simple-Cart. The experimental results comparing the performance of these decision tree ensembles based on SimpleCart are presented in Figure 7.

DISCUSSION

Our experiments presented in this paper in-vestigated and compared the effectiveness of decision tree ensembles based on ADTree, J48, NBTree, RandomTree, REPTree, and Simple-Cart. Among these base decision tree classifiers SimpleCart achieved the best performance with ROC area 0.947. Further, we performed a thorough empirical study comparing several ensemble methods in their ability to generate decision tree ensembles based on the decision trees listed above. We investigated the follow-ing ensemble methods: AdaBoost, Bagging, Dagging, Decorate, Grading, MultiBoost, Stacking, and two multi-level combination of AdaBoost and MultiBoost with Bagging for the processing of data from diabetes patients for the monitoring of CAN. The best outcome with an ROC area of 0.984 was achieved by the decision tree ensemble generated by Decorate based on RandomTree. We used Wilcoxon signed rank test to compare the effectiveness of ensemble

Figure 5. ROC area of decision tree ensembles based on RandomTree



Figure 6. ROC area of decision tree ensembles based on REPTree

Figure 7. ROC area of decision tree ensembles based on SimpleCart



methods for improving the performance of the decision trees over the set of all experiments. Wilcoxon signed rank test concluded that the multi-level combination of AdaBoost and Bag-ging outperformed other ensembles with the following p-values: 0.018, 0.172, 0.109, 0.016, 0.109, 0.016, 0.156, 0.016, 0.053 in comparison with the use of no ensemble and the uses of AdaBoost, Bagging, Dagging, Decorate, Grad-ing, MultiBoost, Stacking, and MultiBoost of Bagging, respectively.

Other techniques have turned out less effec-tive. Dagging benefits mainly base classifiers of high complexity, since it uses disjoint stratified training sets to create an ensemble. Our experi-ments show that decision trees are fast enough and this benefit is not essential here. Stacking and grading use a meta classifier to combine the outcomes of base classifiers. These methods are best applied to combine diverse collections of base classifiers. In this paper we were interested only in decision trees as base classifiers, and so stacking and grading turned out less effective.

DiScRi is a very large and unique data set containing a comprehensive collection of tests related to CAN. Our new results show that De-cision Tree Ensembles achieved substantially higher accuracies and other performance pa-rameters compared with the previous outcomes obtained by Huda, Jelinek, Ray, Stranieri, and Yearwood (2010) and can be used in conjunc-tion with remote sensing technology. Notice that Huda, Jelinek, Ray, Stranieri, and Yearwood (2010) did not use ten-fold cross validation. Overall, the outcomes of the present paper are also appropriate for other areas in general when compared to recent results obtained for other data sets using different methods, for example, by Jelinek, Khandoker, Palaniswami, and McDonald (2010), Jelinek, Rocha, Carv-alho, Goldenstein, and Wainer (2011), Kelarev, Kang, and Steane (2006), Yearwood, Bagirov, and Kelarev (2012).

CONCLUSION

Comparing the effectiveness of ensemble methods for improving the performance of the

decision trees on the basis of these experiments, Wilcoxon signed rank test concluded that the multi-level combination of AdaBoost and Bag-ging outperformed other ensembles, rejecting the one-sided alternative hypotheses with p-values of 0.018, 0.172, 0.109, 0.016, 0.109, 0.016, 0.156, 0.016, 0.053 in comparison with the use of no ensemble and the uses of AdaBoost, Bagging, Dagging, Decorate, Grading, Multi-Boost, Stacking, and MultiBoost of Bagging, respectively. In a particular single test the best outcome with ROC area 0.984 was achieved by the decision tree ensemble generated by Decorate based on RandomTree. Therefore Decorate and the AdaBoost of Bagging can be recommended for practical implementations of the pervasive systems monitoring diabetes patients. On the other hand, base decision tree classifiers also performed very well and can also be employed for real-time preliminary assess-ments of sensor data since they consume less energy. In situations where energy consumption level remains an issue of concern, simple deci-sion trees also performed very well and can be recommended for applications too for saving energy. SimpleCart achieved the best ROC area 0.947 among all base decision tree algorithms.

ACKNOWLEDGMENT

The authors are grateful to three referees for comments and corrections that have helped to improve the text, and for suggesting several interesting directions for future research work. H. F. Jelinek is on leave from Charles Sturt University. The source codes, data, screenshots and complete outputs of all tests can be down-loaded from the IJDWM website on http://users.monash.edu/~dtaniar/IJDWM

REFERENCES

Bauer, E., & Kohavi, R. (1999). An empirical com-parison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36, 105–139. doi:10.1023/A:1007515423169



Biswas, J., Jayachandran, M., Shue, L., Xiao, W., & Yap, P. (2007). An extensible system for sleep activ-ity pattern monitoring. In Proceedings of the Third International Conference on Intelligent Sensors, Sensor Networks and Information (ISSNIP2007) (pp. 561-565). IEEE Xplore Digital Library. doi: 10.1109/ISSNIP.2007.4496904.

Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140. doi:10.1007/BF00058655

Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Belmont, CA: Wadsworth International Group.

Chen, X. Waluyo, A., Pek, I., & Yeoh, W. S. (2010). Mobile middleware for wireless body area network. In Proceedings of the 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC2010) (pp. 5504-5507). doi: 10.1109/IEMBS.2010.5626575.

Cornforth, D., & Jelinek, H. F. (2007). Automated classification reveals morphological factors associ-ated with dementia. Applied Soft Computing, 8, 182–190. doi:10.1016/j.asoc.2006.10.015

Dazeley, R., Yearwood, J., Kang, B., & Kelarev, A. (2010). Consensus clustering and supervised clas-sification for profiling phishing emails in internet commerce security. In Proceedings of the Knowledge Management and Acquisition for Smart Systems and Services (PKAW 2010), Lecture Notes in Computer Science, 6232, 235-246. doi: 10.1007/978-3-642-15037-1_20.

ECG for iPhone (2013). ECG app for phone. Re-trieved January 21, 2013, from www.alibaba.com/product-gs/521812907/new_product_ECG_for_Iphone_under.html

Ewing, D., Campbell, J., & Clarke, B. (1980). The natural history of diabetic autonomic neuropathy. The Quarterly Journal of Medicine, 49, 95–100. PMID:7433630

Ewing, D., Martyn, C., Young, R., & Clarke, B. (1985). The value of cardiovascular autonomic func-tion tests: 10 years experience in diabetes. Diabetes Care, 8, 491–498. doi:10.2337/diacare.8.5.491 PMID:4053936

Fang, Z., Prins, J., & Marwick, T. (2004). Diabetic cardiomyopathy: evidence, mechanisms, and thera-peutic implications. Endocrine Reviews, 25, 543–567. doi:10.1210/er.2003-0012 PMID:15294881

Freund, Y., & Mason, L. (1999). The alternating decision tree learning algorithm. In Proceedings of 16th International Conference on Machine Learn-ing, (pp. 124-133). ACM Digital Library. doi: 10.1.1.116.2945.

Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. In Proceedings of 13th International Conference on Machine Learning (pp. 148-156). doi: 10.1.1.51.6252.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. SIGKDD Explora-tions, 11(1), 10–18. doi:10.1145/1656274.1656278

Huda, S., Jelinek, H. F., Ray, B., Stranieri, A., & Yearwood, J. (2010). Exploring novel features and decision rules to identify cardiovascular autonomic neuropathy using a hybrid of wrapper-filter based feature selection. In Proceedings of the Sixth Inter-national Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP 2010) (pp. 297-302). IEEE Xplore Digital Library. doi: 10.1109/ISSNIP.2010.5706769.

Islam, R., & Abawajy, J. (2013). A multi-tier phishing detection and filtering approach. Journal of Net-work and Computer Applications, 36(1), 324–335. doi:10.1016/j.jnca.2012.05.009

Islam, R., Abawajy, J., & Warren, M. (2009). Multi-tier phishing email classification with an impact of classifier rescheduling. In Proceedings of 10th International Symposium on Pervasive Systems, Algorithms, and Networks (ISPAN 2009) (pp. 789-793). doi: 10.1109/I-SPAN.2009.142.

Jelinek, H. F., Khandoker, A., Palaniswami, M., & McDonald, S. (2010). Heart rate variability and QT dispersion in a cohort of diabetes patients. Computers in Cardiology, 37, 613–616.

Jelinek, H. F., Rocha, A., Carvalho, T., Goldenstein, S., & Wainer, J. (2011). Machine learning and pattern classification in identification of indigenous retinal pathology. In Proceedings of the Annual International IEEE Conference of the Engineering in Medicine and Biology Society (pp. 5951-5954). doi: 10.1109/IEMBS.2011.6091471.

Kang, B., Kelarev, A., Sale, A., & Williams, R. (2006). A new model for classifying DNA code inspired by neural networks and FSA. In Proceedings of the 19th Australian Joint Conference on Artificial Intelligence Advances in Knowledge Acquisition and Manage-ment (AI06), Lecture Notes in Computer Science (Vol. 4303, pp. 187-198). doi: 10.1007/11961239_17.

Kelarev, A., Kang, B., & Steane, D. (2006). Cluster-ing algorithms for ITS sequence data with alignment metrics. In Proceedings of the 19th Australian Joint Conference on Artificial Intelligence Advances in Artificial Intelligence (AI06), Lecture Notes in Artificial Intelligence (Vol. 4304, pp. 1027-1031). doi: 10.1007/11941439_116.



Khandoker, A., Jelinek, H. F., & Palaniswami, M. (2009). Identifying diabetic patients with cardiac autonomic neuropathy by heart rate complexity analysis. BioMedical Engineering OnLine, 8(3), 1-12. doi: 10.1186/1475-925X-8-3. Retrieved January 21, 2013, from http://www.biomedical-engineering-online.com/content/8/1/3

Kohavi, R. (1996). Scaling up the accuracy of Naive-Bayes classifiers: A decision-tree hybrid. In Proceedings of the Second International Confer-ence on Knowledge Discovery and Data Mining (pp. 202-207).

Melville, P., & Mooney, R. (2005). Creating diver-sity in ensembles using artificial data. Information Fusion, 6, 99–111. doi:10.1016/j.inffus.2004.04.001

Nealon, J., Rahayu, W., & Pardede, E. (2009). Im-proving clinical data warehouse performance via a windowing data structure architecture. In Proceed-ings of International Conference on Computational Science and Its Applications (ICCSA09) (pp. 243-253). IEEE Xplore Digital Library. doi: 10.1109/ICCSA.2009.23.

Pek, I., Waluyo, A., Yeoh, W.-S., & Chen, X. (2009). Motion-based wake-up scheme for ambulatory monitoring in wireless body sensor networks. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC2009) (pp. 2454-2457). doi: 10.1109/IEMBS.2009.5334702.

Polar RCX5 heart rate monitor. (2013). Retrieved January 21, 2013, from www.fitshop.com.au/Polar Triathlon Heart Rate Monitors/Polar RCX5

Qin, Y., Zhang, S., & Zhang, C. (2010). Combining kNN imputation and bootstrap calibrated: Empirical likelihood for incomplete data analysis. International Journal of Data Warehousing and Mining, 6(4), 1–13. doi:10.4018/jdwm.2010100104

Quinlan, R. (1993). C4.5: programs for machine learning. San Mateo, CA: Morgan Kaufmann.

Raahemi, B., & Mumtaz, A. (2010). Classification of peer-to-peer traffic using a two-stage window-based classifier with fast decision tree and IP layer attributes. International Journal of Data Warehousing and Mining, 6(3), 1–15. doi:10.4018/jdwm.2010070103

Seewald, A., & Fuernkranz, J. (2001). An evaluation of grading classifiers advances in intelligent data analysis. In Advances in Intelligent Data Analysis, Lecture Notes in Computer Science, 2189, 115-124. doi: doi:10.1007/3-540-44816-0_12.

Sinha, A., Tayebi, H., Krishnaswamy, S., Waluyo, A., & Gaber, M. (2011). Resource-aware ECG analysis on mobile devices. In Proceedings of the 2011 ACM Symposium on Applied Computing (SAC11) (pp. 1012-1013). ACM Digital Library. doi:10.1145/1982185.1982407.

Song, Y., Xiao, W., Waluyo, A., Chen, X., & Wu, J. (2008). A service-specific middleware for flexible deployment of wireless body area network applica-tions. In Proceedings of the 2008 IEEE International Conference on Multimedia and Expo (pp. 1041-1044). IEEE Xplore Digital Library. doi: 10.1109/ICME.2008.4607616.

Tayebi, H., Krishnaswamy, S., Waluyo, A., Sinha, A., & Gaber, M. (2011). RA-SAX: Resource-aware symbolic aggregate approximation for mobile ECG analysis. In Proceedings of the 12th IEEE Interna-tional Conference on Mobile Data Management (MDM11) (vol. 1, pp. 289-290). IEEE Xplore Digital Library. doi: 10.1109/MDM.2011.67.

Ting, K., & Witten, I. (1997). Stacking bagged and dagged models. In Proceedings of the Fourteenth International Conference on Machine Learning (pp. 367-375).

Waluyo, A., Pek, I., Yeoh, W.-S., Kok, T., & Chen, X. (2009). Footpaths: Fusion of mobile outdoor personal advisor for walking route and health fitness. In Proceedings of the 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC2009) (pp. 5155-5158). IEEE Xplore Digital Library. doi: 10.1109/IEMBS.2009.5332736.

Waluyo, A., Yeoh, W.-S., Pek, I., Yong, Y., & Chen, X. (2010). MobiSense: Mobile body sensor network for ambulatory monitoring. ACM Transactions on Embedded Computing Systems, 10, Art. 13. ACM Digital Library.

Webb, G. (2000). Multiboosting: a technique for combining boosting and wagging. Machine Learn-ing, 40, 159–196. doi:10.1023/A:1007659514849

Williams, P. K., Soares, C. V., & Gilbert, J. E. (2012). A clustering rule based approach for clas-sification problems. International Journal of Data Warehousing and Mining, 8(1), 1–23. doi:10.4018/jdwm.2012010101

Wireless iPhone, Android, iPad heart rate chest belt with receiver. (2013), Retrieved January 21, 2013, from www.alibaba.com/product-gs/570468219/wireless_iPhone_Android_ipad_matched_heart.html



Witten, I., & Frank, E. (2011). Data mining: Practical machine learning tools and techniques. Amsterdam, Netherlands: Elsevier/Morgan Kaufman.

Wolpert, D. (1992). Stacked generalization. Neu-ral Networks, 5, 241–259. doi:10.1016/S0893-6080(05)80023-1

Yearwood, J., Bagirov, A., & Kelarev, A. (2012). Machine learning algorithms for analysis of DNA data sets. In S. Kulkarni (Ed.), Machine learning algorithms for problem solving in computational applications: Intelligent techniques (pp. 47–58). Hershey, PA: IGI Global. doi:10.4018/978-1-4666-1833-6.ch004

Yearwood, J., Kang, B., & Kelarev, A. (2008). Ex-perimental investigation of classification algorithms for ITS dataset. In Proceedings of the Pacific Rim Knowledge Acquisition Workshop (PKAW 2008) (pp. 262-272).

Yearwood, J., Webb, D., Ma, L., Vamplew, P., Ofoghi, B., & Kelarev, A. (2009). Applying clustering and ensemble clustering approaches to phishing profiling. In Proceedings of the 8th Australasian Data Mining Conference (AusDM 2009).

Yearwood, J. L., Kang, B. H., & Kelarev, A. V. (2009). Experimental investigation of three machine learning algorithms for ITS dataset. In Proceedings of the 1st International Conference on Future Generation Information Technology (FGIT ‘09) (Lecture Notes in Computer Science Volume 5899, pp. 308-316). doi: 10.1007/978-3-642-10509-8_34.

Zhang, S. (2010). Estimating semi-parametric miss-ing values with iterative imputation. International Journal of Data Warehousing and Mining, 6(3), 1–10. doi:10.4018/jdwm.2010070101

Andrei Kelarev is an author of two books, a volume of refereed conference proceedings and over 180 journal articles. Andrei has ten years of full-time teaching experience in the University of Wisconsin, University of Nebraska and University of Tasmania, and supervised to completion two PhD students. Andrei Kelarev was a Chief Investigator of a large Discovery grant from Australian Research Council, was a member of the program committees of several conferences and worked for many research grants at the University of Ballarat, Charles Sturt University and Deakin University.

Jemal H. Abawajy is a Professor and the Director of the Parallel and Distributing Comput-ing Lab at Deakin University, Australia. Prof. Abawajy is a senior member of IEEE and was a member of the organizing committees for over 100 international conferences serving in various capacities including chair, general co-chair, vice-chair, best paper award chair, publication chair, session chair and program committee member. Prof. Abawajy has published more than 200 refereed articles, supervised numerous PhD students to completion and is on the editorial boards of many journals.

Andrew Stranieri is an Associate Professor and the Director of the Centre for Informatics and Applied Optimisation at the University of Ballarat. His research into cognitive models of argu-mentation and artificial intelligence was instrumental in modelling decision making in refugee law, copyright law, eligibility for legal aid and sentencing. His research in health informatics spans data mining in health, complementary and alternative medicine informatics, telemedicine and intelligent decision support systems. Andrew Stranieri is the author of over 120 peer reviewed journal and conference articles and has published two books.



Herbert F. Jelinek is a Clinical Associate Professor with the Australian School of Advanced Medicine, Macquarie University, Sydney, Australia, and a member of the Centre for Research in Complex Systems, Charles Sturt University, Albury, Australia. Dr Jelinek is currently a visiting Associate Professor at Khalifa University of Science, Technolgy and Research, Abu Dhabi, UAE. Herbert Jelinek received the B.Sc. (Hons.) degree in human genetics from the University of New South Wales, Sydney, Australia, in 1984, followed by the Graduate Diploma in neuroscience from the Australian National University, Canberra, Australia, in 1986 and the Ph.D. degree in medicine from the University of Sydney, Sydney, Australia, in 1996. He is a member of the IEEE Biomedical Engineering Society and the Australian Diabetes Association.

Date post:	01-Jun-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Empirical Investigation of Decision Tree Ensembles for ...Herbert F. Jelinek, Khalifa University,...

Documents