Clinical charge profiles prediction for patients diagnosed with chronic diseases using Multi-level...

Expert Systems with Applications 39 (2012) 1474–1483

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Clinical charge profiles prediction for patients diagnosed with chronic diseases usingMulti-level Support Vector Machine

Wei Zhong a,⇑, Rick Chow a, Jieyue He b

a Division of Mathematics and Computer Science, University of South Carolina Upstate, SC 29303, USAb School of Computer Science and Engineering, Southeast University, Nanjing 210096, China

a r t i c l e i n f o

Keywords:Support Vector MachineClassification problemMulti-level clustering algorithmChronic disease and parallel algorithm

0957-4174/$ - see front matter � 2011 Elsevier Ltd. Adoi:10.1016/j.eswa.2011.08.036

⇑ Corresponding author. Tel.: +1 864 503 5785.E-mail address: [email protected] (W. Zhon

a b s t r a c t

This research utilizes the national Healthcare Cost & Utilization Project (HCUP-3) databases to constructSupport Vector Machine (SVM) classifiers to predict clinical charge profiles, including hospital chargesand length of stay (LOS), for patients diagnosed with heart and circulatory disease, diabetes and cancer,respectively. Clinical charge profiles predictions can provides relevant clinical knowledge for healthcarepolicy makers to effectively manage healthcare services and costs at the national, state, and local levels.Despite its solid mathematical foundation and promising experimental results, SVM is not favorable forlarge-scale data mining tasks since its training time complexity is at least quadratic to the number ofsamples. Furthermore, traditional SVM classification algorithms cannot build an effective SVM when dif-ferent data distribution patterns are intermingled in a large dataset. In order to enhance SVM training forlarge, complex and noisy healthcare datasets, we propose the Multi-level Support Vector Machine(MLSVM) that organizes the dataset as clusters in a tree to produce better partitions for more effectiveSVM classification. The MLSVM model utilizes multiple SVMs, each of which learns the local data distri-bution patterns in a cluster efficiently. A decision fusion algorithm is used to generate an effective globaldecision that incorporates local SVM decisions at different levels of the tree. Consequently, MLSVM canhandle complex and often conflicting data distributions in large datasets more effectively than the sin-gle-SVM based approaches and the multiple SVM systems. Both the combined 5 � 2-fold cross validationF test and the independent test show that classification performance of MLSVM is much superior to thatof a CVM, ACSVM and CSVM based on three popular performance evaluation metrics. In this work, CSVMand MLSVM are parallelized to speed up the slow SVM training process for very large and complex data-sets. Running time analysis shows that MLSVM can accelerate SVM’s training process noticeably whenthe parallel algorithm is employed.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction

Chronic diseases are among the leading causes of disability anddeath in the United States. This project focuses on the three mostprevalent chronic diseases: heart and circulatory disease, diabetesand cancer. Chronic diseases account for 70% of deaths and approx-imately 78% of total healthcare spending. Despite dramaticimprovements in therapies and treatments, the rate of chronic dis-eases has risen dramatically. The rising rate of chronic diseases is acrucial but frequently ignored contributor to rising medical expen-ditures. Current strategies to address the escalating costs in health-care for chronic diseases are based on small and localized data sets.Healthcare models developed from such localized data sets areused by individual healthcare system to compare costs and to ap-ply cost avoidance/reduction protocols. Typically, only local bench-

ll rights reserved.

g).

marks are used in these models, reducing their applicability to thelarger and more general population (Breault, Goodall, & Fos, 2002).These localized approaches for predicting comprehensive costs andoutcomes within a single healthcare system often fail to producevalid and robust results at the national level. In contrast, this re-search utilizes the national Healthcare Cost & Utilization Project(HCUP-3) databases (http://www.ahrq.gov/data/hcup/#hcup) toconstruct Support Vector Machine (SVM) (Vapnik, 1998) classifiersto predict clinical charge profiles, including hospital charges andlength of stay (LOS), for patients diagnosed with heart disease, dia-betes and cancer respectively. Prediction results generated fromthis research can provide relevant clinical knowledge for health-care policy makers to effectively manage healthcare services andcosts at the national, state and local levels.

SVM (Vapnik, 1998) has shown superior classification perfor-mance in various bioinformatics applications as compared to otherclassifiers. Despite its solid mathematical foundation and promis-ing experimental results, SVM is not favorable for large-scale data

http://www.ahrq.gov/data/hcup/#hcup

http://dx.doi.org/10.1016/j.eswa.2011.08.036

mailto:[email protected]

http://dx.doi.org/10.1016/j.eswa.2011.08.036

http://www.sciencedirect.com/science/journal/09574174

http://www.elsevier.com/locate/eswa

W. Zhong et al. / Expert Systems with Applications 39 (2012) 1474–1483 1475

mining tasks since its training time complexity is at least quadraticto the number of samples (Vapnik, 1998). The task of building aneffective SVM becomes more challenging when different data dis-tribution patterns are intermingled in real world large healthcaredatasets. These large healthcare datasets are usually noisy witherroneous information caused by human errors. Many SVM train-ing algorithms have been proposed to enhance the efficiency ofSVM trainings for large datasets while maintaining reasonable per-formance. These algorithms can be categorized into two majorclasses. The first class of algorithms is decomposition algorithms.The decomposition algorithms divide a large Quadratic Program-ming (QP) problem into a series of smaller QP subproblems, whichcan be easily and efficiently solved. Sequential Minimal Optimiza-tion (SMO) (Platt, 1999) chunking (Vapnik, 1998), SVMlight

(Joachims, 1999), Core Vector Machine (Tsang, Kwok, & Cheung,2005) and a fast SVM training algorithm for large datasets as pro-posed in Dong, Krzyzak, and Suen (2005) are five representativedecomposition algorithms. The success of such decompositionalgorithms depends on appropriate criteria for active working setselection and an efficient strategy to cache the kernel matrix.Although the decomposition algorithms can speed up the trainingprocess, they do not scale well with large and complex datasets.The kernel matrix may grow beyond the available memory duringthe optimization process because the selection of an active work-ing set becomes increasingly more difficult as the dataset becomesvery large and complex. Consequently, both effectiveness andefficiency of decomposition algorithms may be reduced greatly.

The second class of algorithms to deal with large datasets isselective sampling techniques, which select a small number of highquality training samples intelligently from the entire dataset tomaximize the learning performance of SVM. For example, thehierarchical clustering and adaptive clustering are utilized to choosea number of important samples, which is later fed into a single SVMfor efficient training (Award et al., 2004; Daniael, 2004; Khan, Awad,& Thuraisingham, 2007; Yu, Yang, & Han, 2003). In another selectivesampling approach, a small subset is randomly selected to identifypossible samples in the boundary of each class. These initiallyidentified important samples and their neighbors are later fed toSVM for fast training (Li, Cervante, & Yu, 2007; Li, Cervante, & Yu,2008). Additionally, adaptive recursive partitioning algorithm wasproposed to recursively subdivide a large dataset into smallersubsets in order to yield smaller subsets of refined prototypes(Kim & Oommen, 2004). The effectiveness of selective samplingtechniques relies on efficient identification of the support vectors,which define the decision boundary of a SVM. The selective samplingtechniques may hurt the classification performance of SVM when asingle effective classification decision boundary is difficult to form incomplex datasets with different data distribution patterns.

In our previous attempt to enhance SVM training for a largedataset, a one-level Clustering Support Vector Machines (CSVM)was proposed (Zhong, He, Harrison, Tai, & Pan, 2007). The one-levelCSVM divides a large dataset into multiple clusters and trains aSVM for each cluster in order to reduce the size of training samplesfor each SVM. CSVM has shown promising results for some data-sets. However, one-level clustering produced by CSVM may not re-flect optimal partitioning especially for very large and complexdatasets so that CSVM trainings for some clusters may be hindereddue to improper clustering.

In this work, a new Multi-level Support Vector Machine(MLSVM) is proposed to overcome the deficiencies of the previousapproaches. The construction of MLSVM is divided into threephases. In the first phase, a scalable multi-level clustering algo-rithm is used to partition a large dataset into different layers ofclusters that capture distribution patterns in different partitions.Each level of partitions can capture different data distribution pat-terns for different subspaces of the data. In the second phase, a

SVM is trained for each cluster in the multi-level tree. Each SVMfocuses on its cluster at a particular level of the tree so that a spe-cific classifier is trained utilizing a particular high dimensionalhyperspace. In the final phase, SVMs from different levels of thetree, operating in different hyperspaces, cooperatively decide theclass assignment of a given sample based on an advanced decisionfusion algorithm.

The MLSVM model utilizes multiple SVMs, each of which learnsthe local data distribution patterns in a cluster effectively. A deci-sion fusion algorithm is used to generate an effective global deci-sion that incorporates local SVM decisions at different levels ofthe tree. Consequently, MLSVM can handle complex and often con-flicting data distributions in large datasets more effectively thanthe single-SVM based approaches such as decomposition algo-rithms and selective sampling techniques. Furthermore, MLSVMresolves deficiency of other multiple-SVM based systems, such asCSVM and SVM ensembles (Valentini, 2005), by using a multi-levelcluster tree to produce better partitions for more effective SVMclassifications.

In order to evaluate the effectiveness of MLSVM to handle largedatasets, performance of MLSVM is compared to three existingSVM models: (1) Core Vector Machine (CVM) (Tsang et al., 2005),a single SVM designed to handle a very large dataset, (2) adaptiveclustering based SVM (ACSVM) (Daniael, 2004), which is based onselective sampling techniques, and (3) CSVM, which is based onone-level clustering. The classification performance is measuredby three metrics, including accuracy, the Area Under the ReceiverOperating Characteristic Curve (AUC) (Baldi, Brunak, Chauvin,Andersen, & Nielsen, 2000) and Matthews Correlation Coefficient(MCC) (Baldi et al., 2000). Both the combined 5 � 2-fold cross val-idation F test (Alpaydin, 1999) and independent test show thataccuracy, AUC and MCC of MLSVM are superior to those of CVM,ACSVM and CSVM on these three datasets. Improved AUC andMCC values for MLSVM indicate MLSVM is more capable to dealwith imbalanced datasets. Furthermore, the combined 5 � 2-foldcross validation F test (Alpaydin, 1999) is conducted to demon-strate the performance improvement of MLSVM is statistically sig-nificant as compared to the other three models. To speed up theSVM training process for very large healthcare datasets, CSVMand MLSVM are parallelized with Pthread (Butenhof, 1997) andtheir running times are compared with those of CVM and ACSVM.

This paper has been divided into seven sections. In Section 2, anadaptive clustering based SVM is described. In Section 3, the one-level CSVM is introduced. In Section 4, different phases of theMLSVM construction are described. In Section 5, experimental set-up and datasets are presented. In Section 6, experimental resultsare presented to show that MLSVM is better than a single SVM,ACSVM and CSVM. In Section 7, the major highlights of thisresearch are summarized.

2. Adaptive clustering based SVM (ACSVM)

Adaptive Clustering based SVM (ACSVM) (Daniael, 2004) is arepresentative selective sampling technique to handle large SVMdataset training. In this work, ACSVM is compared with MLSVMin terms of classification performance for large datasets. ACSVMis based on the idea that only support vectors determine the clas-sification decision boundary and non support vectors have no ef-fect on formation of the decision boundary. ACSVM firstpartitions the training dataset into several pair-wise disjoint clus-ters. The initial Support Vector Machine (SVM) using representa-tives of these pair-wise disjoint clusters is constructed toapproximately identify the support vectors and non-support vec-tors (Daniael, 2004). After samples in the clusters containing onlynon-support vectors are replaced by their cluster representative,

1476 W. Zhong et al. / Expert Systems with Applications 39 (2012) 1474–1483

the number of training samples can be reduced substantially andSVM training process can be speed up greatly.

The multi-level clustering algorithm proposed in this work isutilized to generate pair-wise disjoint clusters for positive samplesand negative samples separately. Determining the number of ini-tial clusters and the representative for each cluster is importantfor the training of the initial SVM. The number of initial clustersneeds to be large enough so that the initial SVM can reasonablyapproximate the original SVM trained with the full dataset. Atthe same time, the initial number of clusters should not be toolarge to lose benefits of SVM training speedup. As a result, thesquare root heuristic is suggested for determining the initial num-ber of clusters. The number of initial clusters for positive samplesand the number of initial clusters for negative samples are definedas:

kþ ¼ roundðffiffiffiffiffiffinþpÞ and k� ¼ roundð

ffiffiffiffiffiffin�pÞ ð1Þ

respectively, where k+ is the number of clusters for positive sam-ples, n+ is the number of positive samples, k� is the number of clus-ters for negative samples and n� is the number of negative samples.The cluster representative for clusteri is defined as:

represenativei ¼ arg minx2clusteri

distðx; centroidiÞ ð2Þ

where x is one of the samples in the clusteri and centroidi for clusteri

is defined as:

centroidi ¼1ni

Xni

k¼1

xk ð3Þ

where ni is the number of samples in clusteri and xk is one of thesamples in clusteri (Daniael, 2004).

3. One-level clustering SVM (CSVM)

Although ACSVM has shown promising results for large datasettraining, this single SVM based model may not capture complexdata distribution patterns effectively. In order to further improveSVM’s capability to deal with large and complex datasets, we pro-posed the one-level CSVM in the previous work (Zhong et al.,2007). CSVM partitions a large dataset into multiple clusters andindividual SVMs are trained for each cluster. Localized data distri-bution patterns captured by a cluster can potentially facilitate thelocal SVM training.

In the first step of constructing this model, a clustering algo-rithm is used to divide the large dataset into multiple clusters. Sub-sequently, a SVM is trained to learn unique data distributionpatterns for each cluster. After the entire CSVM is built, a givensample is assigned to the closest cluster based on the sample-cluster distance, which is the distance between a sample x and agiven cluster Ci. The sample-cluster distance is defined as:

distðCi; xÞ ¼1ni

Xq2Ci

distðx; qÞ ð4Þ

where Ci is the cluster i, x is the given data sample, q is one of thesamples in Ci, ni is the number of samples in cluster Ci, anddistðx; qÞ is the distance between sample x and sample q. Samplesin three datasets from the HCUP databases are encoded as binaryvectors. For example, an encoding value of one indicates a givenprocedure is present in the sample whereas an encoding value ofzero indicates a given procedure is absent in the sample. Since allfeatures of sample x and sample q are coded as binary numbersfor three datasets, distðx; qÞ is formulated as:

distðx; qÞ ¼ Match01 þMatch10

Match11 þMatch01 þMatch10ð5Þ

where Match11 is the number of features where sample x is 1 andsample q is 1, Match01 is the number of features where sample xis 0 and sample q is 1 and match10 is the number of features wheresample x is 1 and sample q is 0 (Han & Kamber, 2006). distðx; qÞ isequivalent to asymmetric binary dissimilarity. Hence, the functionfor assigning a testing sample x to a selected cluster Cj is formulatedas:

distðCj; xÞ ¼ mini¼1;:::;n

distðCi; xÞ ð6Þ

where n is the number of clusters. The SVM classification functionfor the selected cluster Cj to classify the samples x is formulated as:

fsvm jðxÞ ¼Xsvi¼1

aiyiKsvm jðx; xiÞ þ b

!ð7Þ

where sv is the number of support vectors and Ksvm jðx; xiÞ is the ker-nel function from svm_j trained for cluster Cj.

4. Multi-level SVM (MLSVM)

The success of the one-level CSVM depends heavily on appro-priate partitioning of the large dataset. It is conjectured that thepartitioning by the one-level CSVM reflects the underlying distri-bution patterns of large datasets. However, it is quite challengingand difficult for one-level clustering to identify all the potentialdata subsets that can enhance SVMs training since the number ofclusters and the cluster sizes are determined by a particular parti-tioning process. Some clusters may hinder SVM trainings becausesamples in these clusters could be non-separable. To overcomethe weakness of the one-level CSVM, the Multi-level SVM (MLSVM)is proposed to capture the complex underlying distribution pat-terns more effectively and to enhance the classification perfor-mance of SVM for large healthcare datasets by utilizing multipleSVMs embedded in a multi-level cluster tree. The main advantageof multi-level clusters is that in certain scenarios, clusters at lowerlevels may provide more suitable sample subspaces for SVM train-ings than clusters at the upper levels while in other scenarios, theopposite could be true. As a result, intelligent cooperation of SVMsin the lower levels and upper levels in the tree structure can pro-vide an effective classification decision for a given sample.

The construction of the MLSVM model includes three phases forclassifying samples in large datasets. In the first phase, the largedataset is divided into multi-level partitions within a tree structureusing a newly designed multi-level clustering algorithm. In thesecond phase, SVM is trained for each cluster in the tree-structure.In the third phase, an advanced decision fusion algorithm is used tomerge the SVMs decisions from different levels of the treeintelligently.

4.1. Scalable multi-level clustering algorithm for large datasets

The hierarchical clustering algorithm can generate a tree-likestructure preserving both the cluster–subcluster relationshipsand the order in which clusters are combined (Han &Kamber,2006). Thus, the hierarchical clustering algorithm is theideal tool for generating the desired multi-level clustering struc-ture for MLSVM. Since time complexity of a typical hierarchicalclustering algorithm is Oðn2 log nÞwhere n is the size of the dataset,the hierarchical clustering algorithm is not scalable for largedatasets. In order to process large datasets efficiently, a three-stepmutli-level clustering algorithm based on domain information isdesigned. The pseudo code for the three step algorithm is givenin the following:

1. Partition the whole dataset into multiple data subsets using animproved K-means clustering algorithm.


2. Apply the agglomerative hierarchical clustering algorithm toeach of the data subsets in parallel using proximity measurebased on centroids of clusters. The hierarchical clustering stopswhen the desired qualities of the clusters have been reached ineach data subset. In the end, this step produces a forest of clus-ter trees for each data subset.

3. Merge two closest cluster trees among all cluster trees fromStep 2 using proximity measures based on centroids of the rootsof the cluster trees until the resulting clusters reach the desiredcluster quality. This step essentially transforms forests of clus-ter trees into a single cluster tree for the entire dataset.

In Step 1 of this scalable clustering algorithm, the whole datasetis divided into multiple data subsets using the improved K-meansalgorithm proposed in our previous work (Zhong, Altun, Harrison,Tai, & Pan, 2005). The improved K-means clustering algorithmadopts a new greedy initialization method, which selects suitableinitial points so that final partitions can represent the underlyingdistribution of the data samples more consistently and accurately(Zhong et al., 2005). Experimental results indicate that this newgreedy initialization method can overcome potential problems ofrandom initialization. Initial number of clusters for the improvedK-means algorithm is determined by the number of hospital re-gions in the dataset. In Step 2, the agglomerative hierarchical clus-tering algorithm is conducted in each data subset in parallel tospeed up the processing time. In Step 3, cluster trees produced inStep 2 are merged hierarchically using a proximity measure basedon the cluster centroids of the tree roots only. Since this hierarchi-cal clustering step utilizes only centroids, instead of every samplein a cluster, the processing time for large datasets is much reduced.

The newly designed multi-level clustering algorithm can pro-duce a multi-level clustering structure effectively while avoidingthe scalability issue of the traditional hierarchical clustering algo-rithm. Clusters at different levels are capable of capturing differentlevels of abstractions of the sample space. As demonstrated in theexperimental results, the multi-level clustering algorithm is moreeffective in capturing the complex distribution patterns in largehealthcare datasets than the one-level clustering approach.

Fig. 1. Clusters organized in a tree structure.

4.2. Advanced decision fusion algorithm

After SVM for each cluster in the multi-level cluster tree istrained, a new recursive tree decision fusion algorithm called tree-Classify is designed to take advantage of the strength of each SVMin the tree. The treeClassify algorithm adopts an adaptive strategyto select the most suitable SVM in the tree structure to perform ro-bust and effective classifications.

First, the classification value, fsvm_j(x), of a SVM, svm_j, is nor-malized using the z-score for fair comparison of classification val-ues from different SVMs. This normalization step is necessarybecause decision boundaries of SVMs for different clusters in thetree structure are obtained in different high-dimensional featurespaces for tackling the classification problem in different samplesubspaces. The decision value of svm_j for a sample x is defined asthe z-score of svm_j’s classification value for a given sample x:

Decision valuesvm jðxÞ ¼ðfsvm jðxÞ �meansvm jÞ

rsvm jð8Þ

where meansvm_j is the mean classification values for svm_j in clus-ter j and rsvm j is the standard deviation of classification values forsvm_j in cluster j. The higher the magnitude of the decision value of asvm_j, |Decsion_valuesvm_j(x)|, the higher the SVM’s confidence levelfor classifying a sample x will be.

The confidence of the SVM decision value can be strongly af-fected by the distance between the sample and the cluster associ-

ated with this SVM. Hence, the SVM decision value is weighted bythe cluster-sample distance as defined in Eq. (9). In Eq. (9), thecluster-sample distance between the sample x and cluster Ci issmoothed by the logistic function:

smooth distðCi; xÞ ¼1

1þ e�distðCixÞð9Þ

where Ci is the cluster i and x is the given data sample. As a result,the weighted decision value for svm_j for a sample x is defined as:

wðsvm j; xÞ ¼ Decision valuesvm jðxÞ � smooth distðCi; xÞ ð10Þ

The tree classification process, treeClassify, is a recursive, bottom-upprocess that operates on tree structures. Recall that the clusters areorganized in a tree structure as shown in Fig. 1. The classification pro-cess treats a subtree of clusters as a computing group that consists ofa root cluster, Croot, and its children clusters, Ci’s. In turn, each of thechildren clusters is the root for its own subtree or computing group.-The treeClassify algorithm is shown in Fig. 2. The following notationsare used in the pseudo codes in Fig. 2. Given a sample x, the weightedSVM decision value for the root cluster is defined as w(svm_root,x).The magnitude of the most confident weighted SVM decision valuefor the children clusters is defined as |Children_decisionk|. The inputparameters of the recursive function treeClassify( ) is the root cluster,Croot, and a sample x whereas the output of this function is the mostconfident weighted SVMs’ decision value.When a cluster Croot thathas no children receives a sample x to be classified, it simply reportsto its parent cluster its own weighted decision value, w(svm_root,x),as the decision value for its subtree. However, if Croot has some chil-dren clusters, the decision value for its subtree is the most confidentweighted decision value, w(�), among all children clusters and Croot.This process is recursive because the children clusters generateweighted decision values for their own subtrees using the same steps.The whole process terminates when the topmost cluster generatesthe final decision value.

4.3. Parallel algorithm for CSVM and MLSVM

Training SVM is a slow and time consuming task especially for alarge dataset because optimal parameters for SVM need besearched in the large space. However, the multi-level clusteringalgorithm is inherently parallelizable since the agglomerativehierarchical clustering algorithm to each of the data subsets canbe applied in parallel. Furthermore, SVM for each cluster in themulti-level cluster tree can be trained in parallel. As a result, con-struction of MLSVM can be speed up substantially based on parallel

TreeDecision treeClassify(TreeNode Croot , Sample x) { if(Croot has no children) return ψ(svm_root, x); else { for each of the children cluster Cj

Children_decisionj = treeClassify(Cj , x);

Select Children _decisionk s.t. )_(max_ j

jk decisionChildrendecisionChildren =

if )),_(_( xrootsvmdecisionChildren k ψ≥

return Children_decisionk ; else return ψ(svm_root, x); } }

Fig. 2. treeClassify: a recursive tree classification algorithm.


computational techniques since the clustering process and SVMtraining are the most time-consuming step. The parallel algorithmcan also be applied to CSVM since SVM for each cluster can betrained in parallel as well. In this work, the running times ofSVM, ACSVM, CSVM and MLSVM are compared.

5. Datasets and experimental setup

In this section, two SVM classifiers and datasets for the crossvalidation test and the independent test are described first. Then,details of features for each samples, performance evaluation met-rics and programming environments are also explained.

5.1. SVM classifiers to predict length of stay and clinical charge

Two SVM classifiers, including the LOS classifier and clinicalcharge classifier, are used to predict the clinical charge profile of pa-tient samples. The LOS SVM classifier is used to classify patient sam-ples based on diagnosis, treatment and personal profile with respectto the length of stay. For the LOS classifier, samples with length ofstay less than six days are labeled as ‘‘positive’’; otherwise, theyare labeled as ‘‘negative’’. The clinical charge SVM classifier is usedto classify patient samples based on diagnosis, treatment and per-sonal profile with respect to the clinical charge. For the clinicalcharge classifier, samples with the charge less than $10,000 are la-beled as ‘‘positive’’; otherwise, they are labeled as ‘‘negative’’.

5.2. Three Datasets for combined 5 � 2-fold cross validation F-test

2004 HCUP-3 databases are used to generate three datasets forthe combined 5 � 2-fold cross validation test. HCUP-3 databasesare the largest US inpatient databases (http://www.ahrq.gov/data/hcup/#hcup). These three datasets include heart and circula-tory disease dataset, cancer dataset and diabetes dataset. In thiscross validation F test set, the heart and circulatory disease datasethas 693,632 samples. The diabetes dataset has 104,164 samplesand the cancer dataset has 289,310 samples. Table 1 shows thenumber of positive samples and the number of negative samples

Table 1Statistics for three datasets for combined 5 � 2-fold cross validation F test.

Datasets No. of positive sample No. of negative sample

Heart-LOS 497,486 196,146Cancer-LOS 192,691 96,619Diabetes-LOS 64,851 39,313Heart-Charge 242,771 450,861Cancer-Charge 86,793 202,517Diabetes-Charge 41,665 62,499

for each dataset based on the LOS Cutoff and the clinical chargeCutoff. For example, Heart-LOS indicates that labeling the sampleas the positive or negative based on the LOS Cutoff for the heartand circulatory disease. Heart-Charge indicates that labeling thesample as the positive or negative based on the clinical charge Cut-off for the heart and circulatory disease.

5.3. Three datasets for independent testing

To evaluate the performance of the new model more rigorously,datasets generated from 2004 HCUP-3 databases for the combined5 � 2-fold cross validation F Test are used as the training sets anddatasets generated from 2005 HCUP-3 databases are used as theindependent testing sets. In the independent testing set, the heartand circulatory disease dataset has 745,742 samples. The diabetesdataset has 144,258 samples and the cancer dataset has 338,234samples. Table 2 shows the number of positive samples and thenumber of negative samples for each dataset based on the LOS Cutoffand the clinical charge Cutoff for the independent testing datasets.

5.4. Encoding scheme for features of patient sample

Each patient sample has maximum 15 diagnosis code, maxi-mum 15 procedure code and personal profile including age, admis-sion type, sex and race. For diagnosis code and procedure code, anencoding value of one indicates a given procedure is present in thesample whereas an encoding value of zero indicates a given proce-dure is absent in the sample. Since age, admission type, sex andrace are categorical, these features can also be encoded as the bin-ary code. Detailed explanation of each feature for samples areavailable at http://faculty.uscupstate.edu/wzhong/feature_expla-nation.pdf. The sample diabetes dataset is available at http://fac-ulty.uscupstate.edu/wzhong/sample_dataset.txt.

5.5. Performance evaluation metrics

The 5 � 2 cross-validation test (Alpaydin, 1999) and indepen-dent test are used to compare the performance of a single CVM,

Table 2Statistics for three datasets for independent testing.

Datasets No. of positive sample No. of negative sample

Heart-LOS 559,306 186,436Cancer-LOS 236,763 101,471Diabetes-LOS 93,767 50,491Heart-Charge 216,265 529,477Cancer-Charge 118,381 219,853Diabetes-Charge 51,932 92,326



http://faculty.uscupstate.edu/wzhong/feature_explanation.pdf

http://faculty.uscupstate.edu/wzhong/feature_explanation.pdf

http://faculty.uscupstate.edu/wzhong/sample_dataset.txt

http://faculty.uscupstate.edu/wzhong/sample_dataset.txt

Fig. 4. Average accuracy of the models on three datasets.


ACSVM, CSVM and MLSVM. In the 5 � 2-cross-validation test, fivereplications of twofold cross-validation are performed (Alpaydin,1999). In each replication, the dataset is randomly divided intotwo equal-sized sets. One subset is used for training and the othersubset is used for testing. This process is repeated until all the sub-sets are used for testing. This project uses accuracy, the Area Underthe Receiver Operating Characteristic Curve (AUC) and MatthewsCorrelation Coefficient (MCC) as performance measures. AUC andMCC are effective in evaluating the performance of a binary classi-fier for imbalanced datasets.

The first performance measure, Accuracy, is defined as:

Accuracy ¼ TP þ TNTP þ FP þ TN þ FN

ð11Þ

where TP is the number of positive samples that are classified cor-rectly, TN is the number of negative samples that are classified cor-rectly, FP is the number of negative samples that are misclassified aspositive and FN is the number of positive samples that are misclas-sified as negative. Accuracy indicates that how many samples arecorrectly classified.

Accuracy could be misleading when the dataset is imbalanced;therefore, the second measure for classification evaluation is MCC,which is defined in Baldi et al. (2000) as:

MCC ¼ TP � TN � FP � FNffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðTP þ FPÞðTP þ FNÞðTN þ FPÞðTN þ TNÞ

p ð12Þ

MCC returns a value between �1 and +1. A coefficient of +1 repre-sents a perfect prediction, 0 an average random prediction and �1an inverse prediction. MCC takes TP, TN, FP and FN into consider-ation. MCC is an effective performance measure for imbalanceddatasets.The third measure is AUC, which is another effective per-formance measure for imbalanced datasets and is defined in Baldiet al. (2000):

AUC ¼PNþ

i¼1ri � Nþ � ðNþ þ 1Þ=2Nþ � N�

ð13Þ

where ri is the rank of the ith positive sample in the ranking list,which is generated by sorting the classifier outputs for the samplesin ascending order using the entire dataset. The variable N+ is thenumber of positive samples and N� is the number of negative sam-ples. AUC calculates the probability that a randomly chosen positivesample is ranked higher than a randomly chosen negative sampleby a specific classifier (Baldi et al., 2000). Finally, a fourth measure,the combined 5 � 2-fold cross validation F test, is also conducted toevaluate significance of performance improvement of MLSVM ascompared with other models.

5.6. Programming environment

An Intel� OpenMP C++/Fortran compiler for Hyper-Threadingtechnology (Marr et al., 2002) is used to parallelize CSVM andMLSVM using Pthreads in our experiments. The ‘‘poweredge6600server’’ with four processors from Dell is adopted for this project.Because of the Hyper-Threading technology, it behaves like eightlogical processors. Eight threads are used as shown in Fig. 3.

Thread Thread

Thread Thread

IDLE IDLE

IDLE IDLE

CPU1 CPU2 CPU3 CPU4

Fig. 3. Four physical processors behaving like eight logical processors.

6. Experimental results

In this section, experimental results for the 5 � 2-cross-valida-tion test and independent test are presented. The running time offour models is also compared.

6.1. Experimental results for the 5 � 2-cross-validation test

6.1.1. Comparing accuracies of the models on three datasetsAt first, experimental results for comparing accuracies of the

four models are discussed. In Fig. 4, average accuracies on test foldsfor four models using the 5 � 2 cross-validation test are compared.The label ‘‘Heart-L’’ denotes the LOS classifier for the heart and cir-culatory disease dataset, ‘‘Diabetes-L’’ denotes the LOS classifier forthe diabetes dataset, ‘‘Cancer-L’’ denotes the LOS classifier for thecancer dataset, ‘‘Heart-C’’ denotes the clinical charge classifier forthe heart and circulatory disease dataset, ‘‘Diabetes-C’’ denotesthe clinical charge classifier for the diabetes dataset, ‘‘Cancer-C’’denotes the clinical charge classifier for the cancer dataset. Com-pared with the best of the other three models, MLSVM improvesthe average accuracy by 9, 6 and 5 percentage points for Heart-L,Diabetes-L and Cancer-L respectively. Similar patterns can be ob-served for the clinical charge classifier.

Fig. 5 is for comparing the standard deviation of accuracies ofthe models for the 5 � 2 cross-validation test. Fig. 5 indicatesMLSVM is more stable than the other models since standard devi-ation of MLSVM is the smallest in each case.

The combined 5 � 2-fold cross validation F test is conducted toverify that the performance improvement of MLSVM over the othermodels is statistically significant. The combined 5 � 2-fold crossvalidation F test generates a p-value, which represents the signifi-cance level at which the null hypothesis that algorithms have the

Fig. 5. Standard deviation of accuracy for the models on three datasets.


same error rate can be rejected. A lower p-value implies a highersignificance level for rejecting the null hypothesis or a more statis-tically significant improvement of MLSVM over the other models.In this paper, the significant level for p-value is set to 1%, whichis more stringent than the 5% commonly used by statisticians.Tables 3 and 4 compare the p-values by F-test in term of accuracyfor the models on three datasets. Results in Tables 3 and 4 indicatethat the accuracy improvement of MLSVM over the other threemodels is significant on all three datasets.

Fig. 7. Standard deviation of AUC of the models on three datasets.

6.1.2. Comparing AUC of the models on three datasetsFig. 6 compares the average AUC for the models. Compared with

the best of the other three models, MLSVM improves average AUCby 13, 4 and 6 percentage points for Heart-L, Diabetes-L andCancer-L, respectively. Similar patterns can be observed for theclinical charge classifier.

Fig. 7 compares the standard deviation of the AUC for the mod-els. Since the standard deviation of AUC for MLSVM is the smallestin all but one case, we can conclude that MLSVM is generally morestable than the other three models in terms of AUC.

Tables 3 and 4 compare the p-value by F-test of the AUC for themodels. Tables 3 and 4 indicate that improvement on the AUC forMLSVM is very significant as compared to the other three modelsfor most cases. In the diabetes test comparing MLSVM and CSVM,the p-value of 4.3% is still statistically significant as mostly statis-ticians consider a p-value of 5% or lower is adequate.

Fig. 8 compares the ROC curve of four models for the heart andcirculatory disease dataset based on clinical charge classification.Fig. 9 compares the ROC curve of four models for the heart and

Table 3‘‘P value by F test’’ in terms of accuracy for the models on three datasets.

Models Heart-L

Heart-C

Diabetes-L

Diabetes-C

Cancer-L

Cancer-C

CVM <0.1% <0.1% <0.1% 0.3% <0.1% <0.1%ACSVM <0.1% <0.1% <0.1% <0.1% <0.1% <0.1%CSVM <0.1% 0.5% 0.3% <0.1% 0.8% 0.6%MLSVM N/A N/A N/A N/A N/A N/A

Table 4‘‘P value by F test’’ in terms of AUC of the models on three datasets.

Models Heart-L

Heart-C

Diabetes-L

Diabetes-C

Cancer-L

Cancer-C

CVM <0.1% 0.3% <0.1% 0.1% 0.3% <0.1%ACSVM <0.1% 0.2% <0.1% 0.9% <0.1% <0.1%CSVM <0.1% 0.4% 4.3% 0.8% <0.1% <0.1%MLSVM N/A N/A N/A N/A N/A N/A

Fig. 6. Average AUC of the models on three datasets.

Fig. 8. Comparison of the ROC curve of four models for heart and circulatory diseasedataset based on clinical charge classification.

Fig. 9. Comparison of the ROC curve of four models for heart and circulatory diseasedataset based on LOS classification.

circulatory disease dataset based on LOS Classification. Both figuresshow that the ROC curve of MLSVM dominates that of other threemodels.

6.1.3. Comparing MCC of the models on three datasetsFig. 10 compares the average MCC of the models. Compared

with the best of other three models, MLSVM improves averageMCC by 14, 11 and 6 percentage points for Heart-L, Diabetes-Land Cancer-L, respectively. Fig. 11 compares the standard deviationof MCC of the models. The smaller standard deviation of MCC forMLSVM indicates MLSVM is more stable than the other threemodels.

Fig. 10. Average MCC of the models on three datasets.

Fig. 11. Standard deviation of MCC of the models on three datasets.

Fig. 13. Accuracy of the models on independent testing datasets.

Fig. 14. AUC of the models on independent testing datasets.


6.1.4. Comparing average performance of the modelsTo assess the overall performance of the models, the perfor-

mance metrics, including accuracy, AUC, and MCC, are averagedover the heart and circulatory disease, diabetes and cancer datasetsfor LOS classifier and clinical charge classifier.

Fig. 12 indicates that when compared to CVM, MLSVM improvesaverage accuracy, AUC and MCC by 8, 7 and 14 percentage points,respectively for the three datasets. When compared to ACSVM,MLSVM improves average accuracy, AUC and MCC by 11, 10 and16 percentage points, respectively. Finally, MLSVM improves aver-age accuracy, AUC and MCC by 5, 5 and 8 percentage points,respectively over CSVM.

6.2. Experimental results for independent testing datasets

To assess the performance of our new model rigorously, threedatasets generated from 2004 HCUP databases are used as the

Fig. 12. Average performance of the models on three datasets

training set and three datasets from 2005 HCUP databases are usedas the independent testing datasets. Fig. 13 compares accuracies ofmodels on the independent testing set. Compared with the best ofthe other three models, MLSVM improves the accuracy by 6, 6 and7 percentage points for Heart-L, Diabetes-L and Cancer-L,respectively.

Fig. 14 compares AUC of models on the independent testing set.Compared with the best of the other three models, MLSVM im-proves the AUC by 8, 5 and 6 percentage points for Heart-L, Diabe-tes-L and Cancer-L, respectively.

7. Comparing running times for four models

Since SVM training is the slow process for very large datasets,CSVM and MLSVM are parallelized to speed up the training pro-cess. In this section, both running time and speedup values for fourmodels are presented.

Fig. 15 indicates the total program execution time (in hours)when different numbers of threads are used for the heart and cir-culatory disease dataset. The label ‘‘HC-1’’ denotes the clinicalcharge classifier for the heart and circulatory disease dataset whenone thread is used. The label ‘‘HC-4’’ denotes the clinical chargeclassifier for the heart and circulatory disease dataset when fourthreads are used. The label ‘‘HC-8’’ denotes the clinical charge clas-sifier for the heart and circulatory disease dataset when eightthreads are used. The label ‘‘HL-1’’ denotes the LOS classifier forthe heart and circulatory disease dataset when one thread is used.The label ‘‘HL-4’’ denotes LOS classifier for the heart and circulatorydisease dataset when four threads are used. The label ‘‘HL-8’’ de-notes LOS classifier for the heart and circulatory disease datasetwhen eight threads are used. For CVM and ACSVM, only one threadis used since no parallel algorithm is needed. For CSVM andMLSVM, performances of four threads and eight threads are

Fig. 15. Execution time of two classifiers for heart and circulatory disease dataset. Fig. 18. Speedup values of two classifiers for cancer dataset.


compared. The running time for CVM and ACSVM is 230 h and108 h, respectively for the heart and circulatory disease datasetbased on the clinical charge classifier. The running time for CSVMand MLSVM is 71 h and 84 h, respectively for the clinical chargeclassifier when four threads are used. The running time for MLSVMis higher than that for CSVM since the parallel clustering algorithmis more costly. Fig. 16 shows speedup values when different num-bers of threads are used for the heart and circulatory disease data-set. Compared with CVM, the speedup value for ACSVM is 2.1. Thespeedup value for CSVM and MLSVM is 3.2 and 2.7 for the clinicalcharge classifier when four threads are used.

Fig. 17 indicates the total program execution time (in hours)when different numbers of threads are used for the cancer dataset.The label ‘‘CC-1’’ denotes the clinical charge classifier for the cancerdataset when one thread is used. The label ‘‘CC-4’’ denotes the clin-

Fig. 16. Speedup values of two classifiers for heart and circulatory disease dataset.

Fig. 17. Execution time of two classifiers for cancer dataset.

ical charge classifier for the cancer dataset when four threads areused. The label ‘‘CC-8’’ denotes the clinical charge classifier forthe cancer dataset when eight threads are used. The label ‘‘CL-1’’denotes the LOS classifier for the cancer dataset when one threadis used. The label ‘‘CL-4’’ denotes the LOS classifier for the cancerdataset when four threads are used. The label ‘‘CL-8’’ denotes theLOS classifier for the cancer dataset when eight threads are used.The running time for CVM and ACSVM is 97 h and 52 h, respec-tively for the cancer dataset based on the clinical charge classifier.The running time for CSVM and MLSVM is 33 h and 37 h, respec-tively based on the clinical charge classifier when four threadsare used. Fig. 18 shows speedup values when different numbersof threads are used. Compared with CVM, the speedup value forACSVM is 1.8. The speedup value for CSVM and MLSVM is 2.9and 2.6 for the clinical charge classifier when four threads are used.Our experimental results indicate that the LOS classifier and theclinical charge classifier can predict clinical charge profiles for pa-tients diagnosed with chronic diseases with high accuracy andshort running time. Clinical charge profiles are critical indicatorsof the level and quality of patient care. Accurate and efficient pre-dictions by the MLSVM model are very important for healthcareproviders and third party payers to contain costs while supportingthe highest quality of inpatient hospital care.

8. Conclusion

In this study, four SVM modeling techniques including our no-vel MLSVM model are implemented and evaluated. Performanceof these four models is compared on three large and complex realhealthcare datasets for 5 � 2-fold cross validation testing andindependent testing. MLSVM utilizes multiple SVM organized ina multi-level clustering tree to learn local data distribution in eachcluster. Since multi-level clustering is capable of capturing com-plex distribution patterns by using multiple classifiers in differenthyperspaces, MLSVM is more effective in capturing complexdistribution patterns for large datasets than the single CVM, ACS-VM and CSVM models. Experimental results demonstrate that clas-sification performance of MLSVM is much superior to that of thesingle CVM, ACSVM and CSVM models. Experimental results alsoindicate that MLSVM is more stable and capable of handling largeimbalanced datasets than the other three models. Consequently,MLSVM is more effective in dealing with large and complex health-care datasets than the traditional single SVM approaches and themultiple SVM systems. Furthermore, CSVM and MLSVM are parall-elized to speed up the training process. The experimental analysisindicates that the running time of the traditional SVM has been re-duced significantly for large healthcare datasets when the parallelalgorithm is applied to CSVM and MLSVM.


Funding policies at national, state, and local levels depend onaccurate predictions of costs of hospital stays. This research willcontribute to the body of knowledge of accurately predicting bothlength of stay and the charges associated with those stays for pa-tients diagnosed with chronic diseases. The newly proposedMLSVM can be applied to explore important patterns in other largeand complex healthcare datasets.

Acknowledgements

This research was supported in part by Student Research Assis-tant Program and Research Incentive Award from University ofSouth Carolina Upstate and Science Foundation of Jiangsu Provinceof China (BK2007105). This research was also supported in part byHealthy Living Initiative Faculty Research Grant from the ReGene-sis Community Health Center (RCHC) and Magellan Scholars Awardthrough USC Columbia.

References

Alpaydin, E. (1999). Combined 5 � 2 cv F test for comparing supervisedclassification learning algorithms. Neural Computation, 11(8), 1885–1892.

Award, M., Khan, L., Bastani, F., & Yen, I. (2004). An effective support vectormachines (SVMs) performance using hierarchical clustering. In Proceedings ofthe 16th IEEE international conference on tools with artificial intelligence (pp. 663–667).

Baldi, P., Brunak, S., Chauvin, Y., Andersen, C., & Nielsen, H. (2000). Assessing theaccuracy of prediction algorithms for classification: An overview. Bioinformatics,16(5), 412–424.

Breault, J., Goodall, C., & Fos, P. (2002). Data mining a diabetic data warehouse.Artificial Intelligence in Medicine, 26(1), 37–54.

Butenhof, D. (1997). Programming with POSIX threads. Addison-Wesley professionalcomputing series.

Daniael, B., & Cao, D. (2004). Training support vector machines using adaptiveclustering. In Proceedings of the SIAM international conference on data mining(pp. 126–137).

Dong, J. X., Krzyzak, A., & Suen, C. Y. (2005). Fast SVM training algorithm withdecomposition on very large datasets. IEEE Transactions on Pattern Analysis andMachine Intelligence, 27(4), 603–618.

Han, J. W., & Kamber, M. (2006). Data mining: Concepts and techniques (2nd ed.).Morgan Kaufmann.

Joachims, T. (1999). Making large-scale support vector machine learning practical.Advances in kernel methods: Support vector machines. MIT Press (pp. 169–184).

Khan, L., Awad, M., & Thuraisingham, B. (2007). A new intrusion detection systemusing support vector machines and hierarchical clustering. International Journalon Very Large Data Bases, 16(4), 507–521.

Kim, S. W., & Oommen, B. J. (2004). Enhancing prototype reduction schemes withrecursion: A method applicable for large data sets. IEEE Transactions on Systems,Man and Cybernetics Part B: Cybernetics, 34(3), 1184–1397.

Li, X. O., Cervante, J., & Yu, W. (2007). Two-stage SVM classification for large datasets via randomly reducing and recovering training data. In Proceedings of the2007 IEEE international conference on systems, man and cybernetics (pp. 3633–3638).

Li, X. O., Cervante, J. & Yu, W. (2008). Support vector classification for large data setsby reducing training data with change of classes. In Proceedings of the 2008 IEEEinternational conference on systems, man and cybernetics (pp. 2609–2614).

Marr, D. T., Binns, F., Hill, D. L., Hinton, G., Koufaty, D. A., Miller J. A., & Upton, M.(2002). Hyper-threading technology architecture and microarchitecture. IntelTechnology Journal.

Platt, J. (1999). Fast Training of support vector machines using sequential minimaloptimization. Kernel Methods: Support Vector Learning (pp. 185-208).

Tsang, W., Kwok, J. T., & Cheung, P. (2005). Core vector machines: Fast SVM trainingon very large data sets. Journal of Machine Learning Research, 6, 363–392.

Valentini, G. (2005). An experimental bias-variance analysis of SVM ensemblesbased on resampling techniques. IEEE Transactions on Systems, Man andCybernetics Part B: Cybernetics, 35(6), 1252–1271.

Vapnik, V. (1998). Statistical learning theory. New York: John Wiley&Sons, Inc.Yu, H., Yang, J., & Han, J. (2003). Classifying large data sets using SVMs with

hierarchical clusters. In Proceedings of the ninth ACM SIGKDD internationalconference on knowledge discovery and data mining (pp. 306–315).

Zhong, W., Altun, G., Harrison, R., Tai, P. C., & Pan, Y. (2005). Improved K-meansclustering algorithm for exploring local protein sequence motifs representingcommon structural property. IEEE Transactions on NanoBioscience, 4(3),255–265.

Zhong, W., He, J., Harrison, R., Tai, P. C., & Pan, Y. (2007). Clustering support vectormachines for protein local structure prediction. Expert Systems with Applications,32(2), 518–526.

Date post:	05-Sep-2016
Category:	Documents
Upload:	wei-zhong
View:	214 times
Download:	2 times

Clinical charge profiles prediction for patients diagnosed with chronic diseases using Multi-level...

Documents