Research ArticleMissing Values and Optimal Selection of an ImputationMethod and Classification Algorithm to Improve the Accuracy ofUbiquitous Computing Applications
Jaemun Sim1 Jonathan Sangyun Lee2 and Ohbyung Kwon2
1 SKKU Business School Sungkyunkwan University Seoul 110734 Republic of Korea2 School of Management Kyung Hee University Seoul 130701 Republic of Korea
Correspondence should be addressed to Ohbyung Kwon obkwonkhuackr
Received 18 June 2014 Revised 29 September 2014 Accepted 11 October 2014
Academic Editor Jong-Hyuk Park
Copyright copy 2015 Jaemun Sim et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
In a ubiquitous environment high-accuracy data analysis is essential because it affects real-world decision-making However in thereal world user-related data from information systems are often missing due to usersrsquo concerns about privacy or lack of obligationto provide complete data This data incompleteness can impair the accuracy of data analysis using classification algorithms whichcan degrade the value of the data Many studies have attempted to overcome these data incompleteness issues and to improvethe quality of data analysis using classification algorithms The performance of classification algorithms may be affected by thecharacteristics and patterns of the missing data such as the ratio of missing data to complete data We perform a concrete causalanalysis of differences in performance of classification algorithms based on various factors The characteristics of missing valuesdatasets and imputation methods are examined We also propose imputation and classification algorithms appropriate to differentdatasets and circumstances
1 Introduction
Ubiquitous computing has been the central focus of researchand development in many studies it is considered to be thethird wave in the evolution of computer technology [1] Inubiquitous computing data must be collected and analyzedaccurately in real time For this process to be successful datamust be well organized and uncorrupted Data preprocessingis an essential but time- and effort-consuming step in theprocess of data mining Several preprocessing methods havebeen developed to overcome data inconsistencies [2]
Data incompleteness due to missing values is very com-mon in datasets collected in real settings [3] it presentsa challenge in the data preprocessing phase Data is oftenmissing when user input is required For example in human-centric computing systems often require user profile data forthe purpose of personalization [4] In the case of Twitter textdata is used for sentiment analysis in order to analyze userbehaviors and attitudes [5] As a final example in ubiquitouscommerce customer data has been used to personalize
services for users [6] Values may be missing when usersare reluctant to provide their personal data due to privacyconcerns or lack of motivation This is especially true foroptional data requested by the system
Missing values can also be present in sensor data Sensordata is usually in quantitative form Sensors provide physicalinformation regarding temperature sound or trajectorySensor technology has advanced over the years it is anessential source of data for ubiquitous computing and isused for situation awareness and circumstantial decision-making For example human interaction sensors read andreact to current situations [7] Analysis of image files forface recognition and object detection using sensors is widelyused in ubiquitous computing [8] However incorrect dataand missing values are possible even using advanced sensortechnology due to mechanical and network errors Missingvalues can interfere with decision-making and personaliza-tion which can ultimately lead to user dissatisfaction Inmany cases the impact of missing data is costly to users ofdata analysis methods such as classification algorithms
Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2015 Article ID 538613 14 pageshttpdxdoiorg1011552015538613
2 Mathematical Problems in Engineering
Data incompleteness may have negative effects on datapreprocessing and decision-making accuracy Extra timeand effort are required to compensate for missing dataUsing uncertain or null data results in fatal errors in theclassification algorithm and deleting all records that containmissing data (ie using the listwise deletionmethod) reducesthe sample size which might decrease statistical powerand introduce potential bias to the estimation [9] Finallyunless the researcher can be sure that the data values aremissing completely at random (MCAR) then the conclusionsresulting from a complete-case analysis are most likely to bebiased
In order to overcome issues related to data incomplete-ness many researchers have suggested methods of supple-menting or compensating for missing data The missing dataimputation method is the most frequently used statisticalmethod developed to deal with missing data problems It isdefined as ldquoa procedure that replaces the missing values in adataset by some plausible valuesrdquo [3] Missing values occurwhen no data is stored for a given variable in the currentobservation
Many studies have attempted to validate the missingdata imputation method of supplementing or compensatingfor missing data by testing it with different types of dataother studies have attempted to develop the method furtherStudies have also compared the performance of variousimputation methods based on benchmark data For exampleKang investigated the ratio of missing to complete datain various datasets and compared the average accuracy ofseveral imputation methods such as MNR 119896-NN CARTANN and LLR [10] The results demonstrated that 119896-NNperformed best on datasets with less than 10 of datamissingand LLR performed best on those withmore than 10 of datamissing
However after multiple tests using complete datasets notmuch difference in performance was observed and somedatasets were linearly inferior In Kangrsquos study [10] manydatasets with equivalent conditions yielded different resultsThus the fit between the dataset characteristics and theimputationmethodmust also be considered Previous studieshave compared imputation methods by varying the ratio ofmissing to complete data or evaluating performance differ-ences between complete and incomplete datasets Howeverthe reasons for these different results between datasets underequivalent conditions remain unexplained Various factorsmay affect the performance of classification algorithms Forexample the interrelationship or fitness between the datasetimputation method and characteristics of the missing valuesmay be important to the success or failure of the analyticalprocess
The purpose of this study is to examine the influenceof dataset characteristics and patterns of missing data onthe performance of classification algorithms using variousdatasets The moderating effects of different imputationmethods classification algorithms and data characteristicson performance are also analyzed The results are importantbecause they can suggest which imputation method or clas-sification algorithm to use depending on the data conditions
The goal is to improve the performance accuracy and timerequired for ubiquitous computing
2 Treating Datasets Containing Missing Data
Missing information is an unavoidable aspect of data analysisFor example responses may be missing to items on surveyinstruments intended to measure cognitive and affectivefactors Various imputation methods have been developedand used for treatment of datasets containing missing dataSome popular methods are listed below
(1) Listwise Deletion Listwise deletion (LD) involves theremoval of all individuals with incomplete responses for anyitems However LD reduces the effective sample size (some-times greatly resulting in large amounts of missing data)which can in turn reduce statistical power for hypothesistesting to unacceptably low levels LD assumes that the dataare MCAR (ie their omission is unrelated to all measuredvariables) When the MCAR assumption is violated as isoften the case in real research settings the resulting estimateswill be biased
(2) Zero Imputation When data are omitted as incorrect thezero imputation method is used in which missing responsesare assigned an incorrect value (or zero in the case ofdichotomously scored items)
(3) Mean Imputation In this method the mean of all valueswithin the same attribute is calculated and then imputed inthe missing data cells The method works only if the attributeexamined is not nominal
(4) Multiple Imputations Multiple imputations can incor-porate information from all variables in a dataset to deriveimputed values for those that are missing This methodhas been shown to be an effective tool in a variety ofscenarios involving missing data [11] including incompleteitem responses [12]
(5) Regression Imputation The linear regression function iscalculated from the values within the same attribute and thenused as the dependent variable The other attributes (exceptthe decision attribute) are then used as independent variablesThen the estimated dependent variable is imputed in themissing data cells This method works only if all consideredattributes are not nominal
(6) Stochastic Regression Imputation Stochastic regressionimputation involves a two-step process in which the dis-tribution of relative frequencies for each response categoryfor each member of the sample is first obtained from theobserved data
In this paper the details of the seven imputationmethodsused herein are as follows
(i) Listwise Deletion All instances are deleted that containmore than one missing cell in their attributes
Mathematical Problems in Engineering 3
(ii) Mean Imputation The missing values from each attribute(column or feature) are replaced with the mean of all knownvalues of that attribute That is let 119883119895119894 be the 119895th missingattribute of the 119894th instance which is imputed by
119883119895
119894 = sum
119896isin119868(complete)
119883119895
119896
119899|119868(complete)| (1)
where 119868(complete) is a set of indices that are not missing in119883119894 and 119899|119868(complete)| is the total number of instances where the119895th attribute is not missing
(iii) Group Mean Imputation The process for this method isthe same as that for mean imputation However the missingvalues are replaced with the group (or class) mean of allknown values of that attribute Each group represents a targetclass from among the instances (recorded) that have missingvalues Let119883119895119898119894 be the 119895thmissing attribute of the 119894th instanceof the119898th class which is imputed by
119883119895
119898119894 = sum
119896isin119868(119898th class incomplete)
119883119895
119898119896
119899|119868(119898th class incomplete)| (2)
where 119868(119898th class incomplete) is a set of indices that are notmissing in 119883119895119898119894 and 119899|119868(119898th class incomplete)| is the total numberof instances where the 119895th attribute of the 119898th class is notmissing
(iv) Predictive Mean Imputation In this method the func-tional relationship between multiple input variables andsingle or multiple target variables of the given data isrepresented in the form of a linear equationThismethod setsattributes that havemissing values as dependent variables andother attributes as independent variables in order to allowprediction of missing values by creating a regression modelusing those variables For a regression target 119910119894 the MLRequation with 119889 predictors and 119899 training instances can bewritten as
119910119894 = 1205730 + 12057311199091198941 + 12057321199091198942 + sdot sdot sdot + 120573119889119909119894119889 for 119894 = 1 119899 (3)
This can be rewritten in matrix form such that 119910 = 119883120573and the coefficient 120573 can be obtained explicitly by taking aderivative of the squared error function as follows
min119864 (120573) = 12(119910 minus 119883120573)
119879(119910 minus 119883120573)
120597119864 (120573)
120597120573= 119883119879119883120573 minus 119883
119879119910 = 0
120573 = (119883119879119883)minus1sdot 119883119879119910
(4)
(v) Hot-Deck This method is the same in principle as case-based reasoning In order for attributes that contain missingvalues to be utilized values must be found from among themost similar instances of nonmissing values and used toreplace the missing values Therefore each missing value is
replaced with the value of an attribute with the most similarinstance as follows
119883119895
119894 = 119883119895
119896 119896 = argmin
119875radic sum
119895isin119868(complete)Std119895 (119883
119895
119894 minus 119883119895
119875)2 (5)
where Std119895 is the standard deviation of the 119895th attribute whichis not missing
(vi) 119896-NN Attributes are found via a search among nonmiss-ing attributes using the 3-NN method Missing values areimputed based on the values of the attributes of the 119896 mostsimilar instances as follows
119883119895
119894 = sum
119875isin119896-NN(119883119894)119896 (119883119868(complete)119894 119883
119868(complete)119875 ) sdot 119883
119895
119875 (6)
where 119896-NN(119883119894) is the index set of the 119896th nearest neighborsof 119883119894 based on the nonmissing attributes and 119896(119883119894 119883119895) is akernel function that is proportional to the similarity betweenthe two instances119883119894 and119883119895 (119896 = 4)
(vii) 119896-Means Clustering Attributes are found through forma-tion of 119896-clusters from nonmissing data after which missingvalues are imputed The entire dataset is partitioned into 119896clusters by maximizing the homogeneity within each clusterand the heterogeneity between clusters as follows
arg min119862ℎ(complete)
119896
sum
119894=1
sum
119883119868(complete)119895
isin119862ℎ(complete)119894
10038171003817100381710038171003817119883119868(complete)119895 minus 119862
119868(complete)119894
10038171003817100381710038171003817
2
(7)
where 119862119868(complete)119894 is the centroid of 119862119868(complete)
119894 and 119862119868(complete)
is the union of all clusters (119862119868(complete)= 119862119868(complete)1 cup sdot sdot sdot cup
119862119868(complete)119896
) For a missing value 119883119895119894 the mean value of theattribute for the instances in the same cluster with119883119868(complete)
119894
is imputed thus as follows
119883119895
119894 =1
10038161003816100381610038161003816119862119868(complete)119896
10038161003816100381610038161003816
sdot sum
119883119868(complete)119875
isin119862119868(complete)119896
119883119895
119875
st 119896 = argmin119894
10038161003816100381610038161003816119883119868(complete)119895 minus 119862
119868(complete)119894
10038161003816100381610038161003816
(8)
3 Model
In this paper we hypothesize an association between the per-formance of classification algorithms and the characteristicsof missing data and datasets Moreover we assume that thechosen imputation method moderates the causality betweenthese factors Figure 1 illustrates the posited relationships
31 Missing Data Characteristics Table 1 describes the char-acteristics of missing data and how to calculate them Thepattern of missing data characteristics may be univariatemonotone or arbitrary [11] A univariate pattern of missingdata occurs when missing values are observed for a singlevariable only all other data are complete for all variables
4 Mathematical Problems in Engineering
Table 1 The characteristics of missing data
Variables Meaning Calculation
Missing data ratioThe number of missing values in theentire dataset as compared to thenumber of nonmissing values
The number of empty data cellstotal cells
Patterns of missingdata
UnivariateRatio of missing to complete values for an existing feature comparedto the values for all featuresMonotone
ArbitraryHorizontalscatteredness
Distribution of missing values withineach data record
Determine the number of missing cells in each record and calculatethe standard deviation
Verticalscatteredness
Distribution of missing values for eachattribute
Determine the number of missing cells in each feature and calculatethe standard deviation
Missing dataspread
Larger standard deviations indicatestronger effects of missing data
Determine the weighted average of the standard deviations of featureswith missing data (weight the ratio of missing to complete data foreach feature)
Missing datacharacteristics
Dataset feature
Imputation method
Classificationperformance
Figure 1 Research model
A monotone pattern occurs if variables can be arranged suchthat all 119884119895+1 119884119896 are missing for cases where 119884119895 is missingAnother characteristic missing data spread is importantbecause larger standard deviations for missing values withinan existing feature indicate that the missing data has greaterinfluence on the results of the analysis (Figure 2)
32 Dataset Features Table 2 lists the features of datasetsBased on the research of Kwon and Sim [15] in which char-acteristics of datasets that influence classification algorithmswere identified we considered the following statisticallysignificant features in this study missing values the numberof cases the number of attributes and the degree of classimbalance However the discussion of missing values isomitted here because it has already been analyzed in detailby Kwon and Sim [15]
33 Imputation Methods Table 3 lists the imputation meth-ods used in this study Since datasets with categorical decisionattributes are included imputation methods that do notaccommodate categorical attributes (eg regression imputa-tion) are excluded from this paper
Table 2 Dataset features
Variables DescriptionNumber of cases Number of records in the dataset
Number of attributes Number of features characteristicof the dataset
Degree of class imbalance Ratio
34 Classification Algorithms Many studies have comparedclassification algorithms in various areas For example thedecision tree is known as the best algorithm for arrhythmiaclassification [16] In Table 4 six types of representative clas-sification algorithms for supervised learning are describedC45 SVM (support vector machine) Bayesian networklogistic classifier 119896-nearest neighbor classifier and regres-sion
4 Method
We conducted a performance evaluation of the imputationmethods and classification algorithms described in the previ-ous section using actual datasets taken from the UCI datasetarchive To ensure the accuracy of each method in caseswith no missing values datasets with missing values werenot included Among the selected datasets six (Iris WineGlass Liver Disorder Ionosphere and Statlog Shuttle) wereincluded for comparison with the results of Kang [10] Thesedatasets are popular and frequently utilized benchmarks inthe literature whichmakes themuseful for demonstrating thesuperiority of the proposed idea
Table 5 provides the names of the datasets the numbersof cases and the descriptions of features and classes Thenumbers in parentheses in the last two columns represent thenumber of features and classes for the decision attributes Forexample in dataset Iris ldquoNumeric (4)rdquo indicates that thereare four numeric attributes and ldquoCategorical (3)rdquo means thatthere are three classes in the decision attribute
Since UCI datasets have no missing data target valuesin each dataset were randomly omitted [10] Based on
Mathematical Problems in Engineering 5
Observed valuesMissing values
Observed valuesMissing values
Observed valuesMissing values
Univariate pattern Monotone pattern Arbitrary pattern
All missing values arein the last feature
n2n
3n
All missing values arein the last feature
and last-1 and last-2
Missing values are in random feature and record
Figure 2 Missing data patterns
Table 3 Imputation methods
Imputation methods Description
Listwise deletion Perhaps the most basic traditional technique for dealing with missing data Cases with missingvalues are discarded restricting the analyses to cases for which complete data are available
Mean imputation Involves replacing missing data with the overall mean for the observed dataGroup meanimputation
A missing value is replaced by the mean of a subset of the data based on other observed variable(s)in the data
Predictive meanimputation
Also called regression imputation Predictive mean imputation involves imputing a missing valueusing an ordinary least-squares regression method to estimate missing data
Hot-deck Most similar records are imputed to missing values119896-NN The attribute value of 119896 is imputed to the most similar instance from nonmissing data119896-means clustering 119896 numbers of sets are created that are homogeneous on the inside and heterogeneous on the outside
Table 4 Classification algorithms
Algorithms Description
C45Estimates the known data using learning rules C45 gradually expands the conditions of thealgorithm splitting the upper node into subnodes using a divide-and-conquer method until it comesto the end node
SVM Classifies the unknown class by finding the optimal hyperplane with the maximum margin thatreduces the estimation error
Bayesian network A probability network with a high posterior probability given the instances Such a network canprovide insight into probabilistic dependencies among the variables in the training dataset
Logistic classifierTakes the functional form of logistic CDF (cumulative distribution function) This function relatesthe probability of some event to attribute variables through regression coefficients and alpha andbeta parameters which are estimated from training data [13]
119896-nearest neighborclassifier
Simple instance-based learner that uses the class of the nearest 119896 training instances for the class ofthe test instances
Regression The class is binarized and one regression model is built for each class value [14]
6 Mathematical Problems in Engineering
Table 5 Datasets used in the experiments
Dataset Number of cases Features Decision attributesIris 150 Numeric (4) Categorical (3)Wine 178 Numeric (13) Categorical (3)Glass 214 Numeric (9) Categorical (7)Liver disorder 345 Numeric (6) Categorical (2)Ionosphere 351 Numeric (34) Categorical (2)Statlog Shuttle 57999 Numeric (7) Categorical (7)
the list of missing data characteristics three datasets withthree different missing data ratios (5 10 and 15) andthree sets representing each of the missing data patterns(univariate monotone and arbitrary) were created for atotal of nine variations for each dataset In total 54 datasetswere imputed for each imputation method as 6 datasetswere available We repeated the experiment for each dataset1000 times in order to minimize errors and bias Thus5400 datasets were imputed in total for our experimentAll imputation methods were implemented using packageswritten in Java In order to measure the performance of eachimputation method we applied imputed datasets to the sixclassification algorithms listed in Table 4
There are various indicators to measure performancesuch as accuracy relative accuracy MAE (mean absoluteerror) and RMSE (root mean square error) However RMSEis one of the most representative and widely used per-formance indicators in the imputation research Thereforewe also adopted RMSE as the performance indicator inthis study The performance of the selected classificationalgorithms was evaluated using SPSS 170
RMSE measures the difference between predicted andobserved values The term ldquorelative prediction accuracyrdquorefers to the relative ratio of accuracy which is equivalentto 1 when there are no missing data [10] The no-missing-data condition was used as a baseline of performance As thenext step we generated a missing dataset from the originalno-missing-dataset and then applied an imputation methodto replace the null data Then a classification algorithm wasconducted to estimate the results of the imputed datasetWithall combinations of imputation methods and classificationalgorithms a multiple regression analysis was conductedusing the following equation to understand the input factorsthe characteristics of missing data and those of the datasetsin order to determine how the selected classification algo-rithms affected performance
119910119901 = sum
forall119895isinM120573119901119895119909119895 + sum
forall119896isinD120594119901119896119911119896 + 120576119901 (9)
In this equation 119909 is the value of the characteristics ofthe missing data (M) 119911 is the value of each datasetrsquos char-acteristics in the set of dataset (D) and 119910 is a performanceparameter Note that M = missing data ratio patterns ofmissing data horizontal scatteredness vertical scatterednessmissing data spread and D = number of cases numberof attributes degree of class imbalance In addition 119901 =1 indicates relative prediction accuracy 119901 = 2 represents
RMSE and 119901 = 3 means elapsed time We performed theexperiment using the Weka library source software (release36) to determine the reliability of the implementation ofthe algorithms [17] We did not use the Weka GUI toolbut developed a Weka library-based performance evaluationprogram in order to conduct the automatized experimentrepeatedly
5 Results
In total 32400 datasets (3 missing ratios times 3 imputationpatterns times 6 imputation methods times 100 trials) were imputedfor each of the 6 classifiers Thus in total we tested 226800datasets (32400 imputed dataset times 7 classifier methods) Theresults were divided by those for each dataset classificationalgorithm and imputation method for comparison in termsof performance
51 Datasets Figure 3 shows the performance of each impu-tation method for the six different datasets On the 119909-axisthree missing ratios represent the characteristics of missingdata and on the 119910-axis performance is indicated usingthe RMSE All results of three different variations of themissing data patterns and tested classification algorithmswere merged for each imputation method
For Iris data (Figure 3(a)) the mean imputation methodyielded the worst results and the group mean imputationmethod the best results
For Glass Identification data (Figure 3(b)) hot-deckimputation was the least effective method and predictivemean imputation was the best
For Liver Disorder data (Figure 3(c)) 119896-NN was the leasteffective and once again the predictive mean imputationmethod yielded the best results
For Ionosphere data (Figure 3(d)) hot-deckwas theworstand 119896-NN the best
For Wine data (Figure 3(e)) hot-deck was once again theleast effective method and predictive mean imputation thebest
For Statlog data (Figure 3(f)) unlike the other datasetsthe results varied based on the missing data ratio Howeverpredictive mean imputation was still the best method overalland hot-deck the worst
Figure 3 illustrates that the predictive mean imputa-tion method yielded the best results overall and hot-deckimputation the worst However no imputation method wasgenerally superior in all cases with any given dataset Forexample the 119896-NN method yielded the best performancefor the Ionosphere dataset but for the Liver Disordersdataset its performance was lowest In another example thegroup mean imputation method performed best for the Irisand Wine datasets but its performance was only averagefor other datasets Therefore the results were inconsistentand determining the best imputation method is impossibleThus the imputation method cannot be used as an accuratepredictor of performance Rather the performance must beinfluenced by other factors such as the interaction betweenthe characteristics of the dataset in terms of missing data andthe chosen imputation method
Mathematical Problems in Engineering 7
10000000000
08000000000
06000000000
04000000000
02000000000
00000000000
001 005 010
(a) Iris
001 005 010
06000000000
05000000000
04000000000
03000000000
02000000000
01000000000
00000000000
(b) Glass Identification
001 005 010
6000000000000
4000000000000
2000000000000
00000000000
(c) Liver Disorders
001 005 010
06000000000
05000000000
04000000000
03000000000
02000000000
01000000000
00000000000
(d) Ionosphere
1250000000000000
1000000000000000
750000000000000
500000000000000
250000000000000
00000000000
k-NN
001 005 010
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
k-MEANS CLUSTERING
LISTWISE DELETION
imputation
(e) Wine
60000000000000
50000000000000
40000000000000
30000000000000
20000000000000
10000000000000
00000000000
001 005 010
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-NNk-MEANS CLUSTERING
(f) Statlog Shuttle
Figure 3 Comparison of performances of imputation methods for each dataset
8 Mathematical Problems in Engineering
Table 6 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) mean imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus075lowastlowast minus178lowastlowast minus072lowastlowast 115lowastlowast 007N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus032 minus048lowastlowast
C imbalance 117lowastlowast 239lowastlowast 264lowastlowast 525lowastlowast 163lowastlowast 198lowastlowast
R missing 051lowast 078lowastlowast 040 080lowastlowast 076lowastlowast 068lowastlowast
SE HS 249lowastlowast 285lowastlowast 186lowastlowast 277lowastlowast 335lowastlowast 245lowastlowast
SE VS minus009 minus013 minus006 minus013 minus016 minus010Spread minus382lowastlowast minus430lowastlowast minus261lowastlowast minus436lowastlowast minus452lowastlowast minus363lowastlowast
P missing dum1 minus049 minus038 minus038 minus037 minus045 minus038P missing dum2 minus002 014 002 011 001 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 7 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) group mean imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus068lowastlowast minus072lowastlowast minus179lowastlowast minus068lowastlowast 115lowastlowast 010N cases minus082lowastlowast minus050lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 115lowastlowast 228lowastlowast 260lowastlowast 517lowastlowast 156lowastlowast 197lowastlowast
R missing 050lowastlowast 085lowastlowast 043 084lowastlowast 095lowastlowast 066lowastlowast
SE HS 230lowastlowast 268lowastlowast 178lowastlowast 273lowastlowast 300lowastlowast 248lowastlowast
SE VS minus008 minus012 minus006 minus013 minus013 minus010Spread minus296lowastlowast minus439lowastlowast minus264lowastlowast minus443lowastlowast minus476lowastlowast minus382lowastlowast
P missing dum1 minus043 minus032 minus034 minus035 minus035 minus041P missing dum2 002 024 004 016 021 013Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
52 Classification Algorithm Figure 4 shows the perfor-mance of the classification algorithms by imputation methodand ratio of missing data As shown in the figure theperformance of each imputation method was similar and didnot vary depending on the ratio of missing data except forlistwise deletion For listwise deletion as the ratio of missingto complete data increased the performance deterioratedIn the listwise deletion method all records are deletedthat contain missing data therefore the number of deletedrecords increases as the ratio of missing data increases Thelow performance of this method can be explained based onthis fact
The differences in performance between imputationmethods were minor The figure displays these differencesby classification algorithm Using the Bayesian network andlogistic classifier methods significantly improved perfor-mance compared to other classifiers However the relation-ships among missing data imputation methods and classi-fiers remained to be explainedThus a regression analysis wasconducted
In Figure 4 the results suggest the following rules
(i) IF themissing rate increases AND IBK is used THENuse the GROUP MEAN IMPUTATION method
(ii) IF the missing rate increases AND the logistic clas-sifier method is used THEN use the HOT DECKmethod
(iii) IF the missing rate increases AND the regressionmethod is used THEN use the GROUP MEAN IM-PUTATION method
(iv) IF the missing rate increases AND the BayesNetmethod is used THEN use the GROUP MEAN IM-PUTATION method
(v) IF the missing rate increases AND the treesJ48method is used THEN use the 119896-NN method
53 Regression The results of the regression analysis arepresented in Tables 6 7 8 9 10 and 11 The analysis wasconducted using 900 datasets (3 missing ratios times 3 missing
Mathematical Problems in Engineering 9
05000000000
04500000000
04000000000
03500000000
03000000000
001 005 010
(a) Decision tree (J48)
001 005 010
04000000000
03500000000
03000000000
(b) BayesNet
001 005 010
04400000000
04200000000
04000000000
03800000000
(c) SMO (SVM)
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
(d) Regression
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
02800000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-NNk-MEANS CLUSTERING
(e) Logistic
001 005 010
04250000000
04000000000
03750000000
03500000000
03250000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-MEANS CLUSTERINGk-NN
(f) IBk (119896-nearest neighbor classifier)
Figure 4 Comparison of classifiers in terms of classification performance
10 Mathematical Problems in Engineering
Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast
C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast
R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast
SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast
SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast
P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast
C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast
R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast
SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast
P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast
R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast
SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast
SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast
P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Mathematical Problems in Engineering 11
Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast
C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast
R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast
SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast
P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected
(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic
Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance
except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance
Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect
The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance
Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used
Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3
12 Mathematical Problems in Engineering
0
01
02
03At
trib
ute
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
Miss
ing
p1
Miss
ing
p2
minus04
minus03
minus02
minus01
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN
Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)
0
02
04
06
08
Attr
ibut
e
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
minus06
minus04
minus02
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATION
k-NNk-MEANS CLUSTERING
Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)
missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm
Method of imputation
0490
0480
0470
0460
0450
0440
0430
0420
0410
0400
0390
0380
0370
0360
0350
0340
0330
0320
0310
0300
005 010 015 020 025 030 035 040 045 050
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETIONk-NNk-MEANS CLUSTERING
Figure 7 RMSE by ratio of missing data
Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms
Data characteristic 119861 Data characteristic 119861
(constant) 060lowastlowast M imputation dum1 012lowastlowast
R missing 083lowastlowast M imputation dum2 minus001lowast
SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast
N attributes minus008lowastlowast M imputation dum7 minus001lowast
C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast
N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005
6 Conclusion
So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based
Mathematical Problems in Engineering 13
on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based
In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not
A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply
The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications
Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values
This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance
In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)
References
[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013
[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013
[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003
[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011
[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013
[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013
[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013
[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013
14 Mathematical Problems in Engineering
[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013
[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013
[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008
[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978
[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998
[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013
[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013
[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005
[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012
[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006
[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
2 Mathematical Problems in Engineering
Data incompleteness may have negative effects on datapreprocessing and decision-making accuracy Extra timeand effort are required to compensate for missing dataUsing uncertain or null data results in fatal errors in theclassification algorithm and deleting all records that containmissing data (ie using the listwise deletionmethod) reducesthe sample size which might decrease statistical powerand introduce potential bias to the estimation [9] Finallyunless the researcher can be sure that the data values aremissing completely at random (MCAR) then the conclusionsresulting from a complete-case analysis are most likely to bebiased
In order to overcome issues related to data incomplete-ness many researchers have suggested methods of supple-menting or compensating for missing data The missing dataimputation method is the most frequently used statisticalmethod developed to deal with missing data problems It isdefined as ldquoa procedure that replaces the missing values in adataset by some plausible valuesrdquo [3] Missing values occurwhen no data is stored for a given variable in the currentobservation
Many studies have attempted to validate the missingdata imputation method of supplementing or compensatingfor missing data by testing it with different types of dataother studies have attempted to develop the method furtherStudies have also compared the performance of variousimputation methods based on benchmark data For exampleKang investigated the ratio of missing to complete datain various datasets and compared the average accuracy ofseveral imputation methods such as MNR 119896-NN CARTANN and LLR [10] The results demonstrated that 119896-NNperformed best on datasets with less than 10 of datamissingand LLR performed best on those withmore than 10 of datamissing
However after multiple tests using complete datasets notmuch difference in performance was observed and somedatasets were linearly inferior In Kangrsquos study [10] manydatasets with equivalent conditions yielded different resultsThus the fit between the dataset characteristics and theimputationmethodmust also be considered Previous studieshave compared imputation methods by varying the ratio ofmissing to complete data or evaluating performance differ-ences between complete and incomplete datasets Howeverthe reasons for these different results between datasets underequivalent conditions remain unexplained Various factorsmay affect the performance of classification algorithms Forexample the interrelationship or fitness between the datasetimputation method and characteristics of the missing valuesmay be important to the success or failure of the analyticalprocess
The purpose of this study is to examine the influenceof dataset characteristics and patterns of missing data onthe performance of classification algorithms using variousdatasets The moderating effects of different imputationmethods classification algorithms and data characteristicson performance are also analyzed The results are importantbecause they can suggest which imputation method or clas-sification algorithm to use depending on the data conditions
The goal is to improve the performance accuracy and timerequired for ubiquitous computing
2 Treating Datasets Containing Missing Data
Missing information is an unavoidable aspect of data analysisFor example responses may be missing to items on surveyinstruments intended to measure cognitive and affectivefactors Various imputation methods have been developedand used for treatment of datasets containing missing dataSome popular methods are listed below
(1) Listwise Deletion Listwise deletion (LD) involves theremoval of all individuals with incomplete responses for anyitems However LD reduces the effective sample size (some-times greatly resulting in large amounts of missing data)which can in turn reduce statistical power for hypothesistesting to unacceptably low levels LD assumes that the dataare MCAR (ie their omission is unrelated to all measuredvariables) When the MCAR assumption is violated as isoften the case in real research settings the resulting estimateswill be biased
(2) Zero Imputation When data are omitted as incorrect thezero imputation method is used in which missing responsesare assigned an incorrect value (or zero in the case ofdichotomously scored items)
(3) Mean Imputation In this method the mean of all valueswithin the same attribute is calculated and then imputed inthe missing data cells The method works only if the attributeexamined is not nominal
(4) Multiple Imputations Multiple imputations can incor-porate information from all variables in a dataset to deriveimputed values for those that are missing This methodhas been shown to be an effective tool in a variety ofscenarios involving missing data [11] including incompleteitem responses [12]
(5) Regression Imputation The linear regression function iscalculated from the values within the same attribute and thenused as the dependent variable The other attributes (exceptthe decision attribute) are then used as independent variablesThen the estimated dependent variable is imputed in themissing data cells This method works only if all consideredattributes are not nominal
(6) Stochastic Regression Imputation Stochastic regressionimputation involves a two-step process in which the dis-tribution of relative frequencies for each response categoryfor each member of the sample is first obtained from theobserved data
In this paper the details of the seven imputationmethodsused herein are as follows
(i) Listwise Deletion All instances are deleted that containmore than one missing cell in their attributes
Mathematical Problems in Engineering 3
(ii) Mean Imputation The missing values from each attribute(column or feature) are replaced with the mean of all knownvalues of that attribute That is let 119883119895119894 be the 119895th missingattribute of the 119894th instance which is imputed by
119883119895
119894 = sum
119896isin119868(complete)
119883119895
119896
119899|119868(complete)| (1)
where 119868(complete) is a set of indices that are not missing in119883119894 and 119899|119868(complete)| is the total number of instances where the119895th attribute is not missing
(iii) Group Mean Imputation The process for this method isthe same as that for mean imputation However the missingvalues are replaced with the group (or class) mean of allknown values of that attribute Each group represents a targetclass from among the instances (recorded) that have missingvalues Let119883119895119898119894 be the 119895thmissing attribute of the 119894th instanceof the119898th class which is imputed by
119883119895
119898119894 = sum
119896isin119868(119898th class incomplete)
119883119895
119898119896
119899|119868(119898th class incomplete)| (2)
where 119868(119898th class incomplete) is a set of indices that are notmissing in 119883119895119898119894 and 119899|119868(119898th class incomplete)| is the total numberof instances where the 119895th attribute of the 119898th class is notmissing
(iv) Predictive Mean Imputation In this method the func-tional relationship between multiple input variables andsingle or multiple target variables of the given data isrepresented in the form of a linear equationThismethod setsattributes that havemissing values as dependent variables andother attributes as independent variables in order to allowprediction of missing values by creating a regression modelusing those variables For a regression target 119910119894 the MLRequation with 119889 predictors and 119899 training instances can bewritten as
119910119894 = 1205730 + 12057311199091198941 + 12057321199091198942 + sdot sdot sdot + 120573119889119909119894119889 for 119894 = 1 119899 (3)
This can be rewritten in matrix form such that 119910 = 119883120573and the coefficient 120573 can be obtained explicitly by taking aderivative of the squared error function as follows
min119864 (120573) = 12(119910 minus 119883120573)
119879(119910 minus 119883120573)
120597119864 (120573)
120597120573= 119883119879119883120573 minus 119883
119879119910 = 0
120573 = (119883119879119883)minus1sdot 119883119879119910
(4)
(v) Hot-Deck This method is the same in principle as case-based reasoning In order for attributes that contain missingvalues to be utilized values must be found from among themost similar instances of nonmissing values and used toreplace the missing values Therefore each missing value is
replaced with the value of an attribute with the most similarinstance as follows
119883119895
119894 = 119883119895
119896 119896 = argmin
119875radic sum
119895isin119868(complete)Std119895 (119883
119895
119894 minus 119883119895
119875)2 (5)
where Std119895 is the standard deviation of the 119895th attribute whichis not missing
(vi) 119896-NN Attributes are found via a search among nonmiss-ing attributes using the 3-NN method Missing values areimputed based on the values of the attributes of the 119896 mostsimilar instances as follows
119883119895
119894 = sum
119875isin119896-NN(119883119894)119896 (119883119868(complete)119894 119883
119868(complete)119875 ) sdot 119883
119895
119875 (6)
where 119896-NN(119883119894) is the index set of the 119896th nearest neighborsof 119883119894 based on the nonmissing attributes and 119896(119883119894 119883119895) is akernel function that is proportional to the similarity betweenthe two instances119883119894 and119883119895 (119896 = 4)
(vii) 119896-Means Clustering Attributes are found through forma-tion of 119896-clusters from nonmissing data after which missingvalues are imputed The entire dataset is partitioned into 119896clusters by maximizing the homogeneity within each clusterand the heterogeneity between clusters as follows
arg min119862ℎ(complete)
119896
sum
119894=1
sum
119883119868(complete)119895
isin119862ℎ(complete)119894
10038171003817100381710038171003817119883119868(complete)119895 minus 119862
119868(complete)119894
10038171003817100381710038171003817
2
(7)
where 119862119868(complete)119894 is the centroid of 119862119868(complete)
119894 and 119862119868(complete)
is the union of all clusters (119862119868(complete)= 119862119868(complete)1 cup sdot sdot sdot cup
119862119868(complete)119896
) For a missing value 119883119895119894 the mean value of theattribute for the instances in the same cluster with119883119868(complete)
119894
is imputed thus as follows
119883119895
119894 =1
10038161003816100381610038161003816119862119868(complete)119896
10038161003816100381610038161003816
sdot sum
119883119868(complete)119875
isin119862119868(complete)119896
119883119895
119875
st 119896 = argmin119894
10038161003816100381610038161003816119883119868(complete)119895 minus 119862
119868(complete)119894
10038161003816100381610038161003816
(8)
3 Model
In this paper we hypothesize an association between the per-formance of classification algorithms and the characteristicsof missing data and datasets Moreover we assume that thechosen imputation method moderates the causality betweenthese factors Figure 1 illustrates the posited relationships
31 Missing Data Characteristics Table 1 describes the char-acteristics of missing data and how to calculate them Thepattern of missing data characteristics may be univariatemonotone or arbitrary [11] A univariate pattern of missingdata occurs when missing values are observed for a singlevariable only all other data are complete for all variables
4 Mathematical Problems in Engineering
Table 1 The characteristics of missing data
Variables Meaning Calculation
Missing data ratioThe number of missing values in theentire dataset as compared to thenumber of nonmissing values
The number of empty data cellstotal cells
Patterns of missingdata
UnivariateRatio of missing to complete values for an existing feature comparedto the values for all featuresMonotone
ArbitraryHorizontalscatteredness
Distribution of missing values withineach data record
Determine the number of missing cells in each record and calculatethe standard deviation
Verticalscatteredness
Distribution of missing values for eachattribute
Determine the number of missing cells in each feature and calculatethe standard deviation
Missing dataspread
Larger standard deviations indicatestronger effects of missing data
Determine the weighted average of the standard deviations of featureswith missing data (weight the ratio of missing to complete data foreach feature)
Missing datacharacteristics
Dataset feature
Imputation method
Classificationperformance
Figure 1 Research model
A monotone pattern occurs if variables can be arranged suchthat all 119884119895+1 119884119896 are missing for cases where 119884119895 is missingAnother characteristic missing data spread is importantbecause larger standard deviations for missing values withinan existing feature indicate that the missing data has greaterinfluence on the results of the analysis (Figure 2)
32 Dataset Features Table 2 lists the features of datasetsBased on the research of Kwon and Sim [15] in which char-acteristics of datasets that influence classification algorithmswere identified we considered the following statisticallysignificant features in this study missing values the numberof cases the number of attributes and the degree of classimbalance However the discussion of missing values isomitted here because it has already been analyzed in detailby Kwon and Sim [15]
33 Imputation Methods Table 3 lists the imputation meth-ods used in this study Since datasets with categorical decisionattributes are included imputation methods that do notaccommodate categorical attributes (eg regression imputa-tion) are excluded from this paper
Table 2 Dataset features
Variables DescriptionNumber of cases Number of records in the dataset
Number of attributes Number of features characteristicof the dataset
Degree of class imbalance Ratio
34 Classification Algorithms Many studies have comparedclassification algorithms in various areas For example thedecision tree is known as the best algorithm for arrhythmiaclassification [16] In Table 4 six types of representative clas-sification algorithms for supervised learning are describedC45 SVM (support vector machine) Bayesian networklogistic classifier 119896-nearest neighbor classifier and regres-sion
4 Method
We conducted a performance evaluation of the imputationmethods and classification algorithms described in the previ-ous section using actual datasets taken from the UCI datasetarchive To ensure the accuracy of each method in caseswith no missing values datasets with missing values werenot included Among the selected datasets six (Iris WineGlass Liver Disorder Ionosphere and Statlog Shuttle) wereincluded for comparison with the results of Kang [10] Thesedatasets are popular and frequently utilized benchmarks inthe literature whichmakes themuseful for demonstrating thesuperiority of the proposed idea
Table 5 provides the names of the datasets the numbersof cases and the descriptions of features and classes Thenumbers in parentheses in the last two columns represent thenumber of features and classes for the decision attributes Forexample in dataset Iris ldquoNumeric (4)rdquo indicates that thereare four numeric attributes and ldquoCategorical (3)rdquo means thatthere are three classes in the decision attribute
Since UCI datasets have no missing data target valuesin each dataset were randomly omitted [10] Based on
Mathematical Problems in Engineering 5
Observed valuesMissing values
Observed valuesMissing values
Observed valuesMissing values
Univariate pattern Monotone pattern Arbitrary pattern
All missing values arein the last feature
n2n
3n
All missing values arein the last feature
and last-1 and last-2
Missing values are in random feature and record
Figure 2 Missing data patterns
Table 3 Imputation methods
Imputation methods Description
Listwise deletion Perhaps the most basic traditional technique for dealing with missing data Cases with missingvalues are discarded restricting the analyses to cases for which complete data are available
Mean imputation Involves replacing missing data with the overall mean for the observed dataGroup meanimputation
A missing value is replaced by the mean of a subset of the data based on other observed variable(s)in the data
Predictive meanimputation
Also called regression imputation Predictive mean imputation involves imputing a missing valueusing an ordinary least-squares regression method to estimate missing data
Hot-deck Most similar records are imputed to missing values119896-NN The attribute value of 119896 is imputed to the most similar instance from nonmissing data119896-means clustering 119896 numbers of sets are created that are homogeneous on the inside and heterogeneous on the outside
Table 4 Classification algorithms
Algorithms Description
C45Estimates the known data using learning rules C45 gradually expands the conditions of thealgorithm splitting the upper node into subnodes using a divide-and-conquer method until it comesto the end node
SVM Classifies the unknown class by finding the optimal hyperplane with the maximum margin thatreduces the estimation error
Bayesian network A probability network with a high posterior probability given the instances Such a network canprovide insight into probabilistic dependencies among the variables in the training dataset
Logistic classifierTakes the functional form of logistic CDF (cumulative distribution function) This function relatesthe probability of some event to attribute variables through regression coefficients and alpha andbeta parameters which are estimated from training data [13]
119896-nearest neighborclassifier
Simple instance-based learner that uses the class of the nearest 119896 training instances for the class ofthe test instances
Regression The class is binarized and one regression model is built for each class value [14]
6 Mathematical Problems in Engineering
Table 5 Datasets used in the experiments
Dataset Number of cases Features Decision attributesIris 150 Numeric (4) Categorical (3)Wine 178 Numeric (13) Categorical (3)Glass 214 Numeric (9) Categorical (7)Liver disorder 345 Numeric (6) Categorical (2)Ionosphere 351 Numeric (34) Categorical (2)Statlog Shuttle 57999 Numeric (7) Categorical (7)
the list of missing data characteristics three datasets withthree different missing data ratios (5 10 and 15) andthree sets representing each of the missing data patterns(univariate monotone and arbitrary) were created for atotal of nine variations for each dataset In total 54 datasetswere imputed for each imputation method as 6 datasetswere available We repeated the experiment for each dataset1000 times in order to minimize errors and bias Thus5400 datasets were imputed in total for our experimentAll imputation methods were implemented using packageswritten in Java In order to measure the performance of eachimputation method we applied imputed datasets to the sixclassification algorithms listed in Table 4
There are various indicators to measure performancesuch as accuracy relative accuracy MAE (mean absoluteerror) and RMSE (root mean square error) However RMSEis one of the most representative and widely used per-formance indicators in the imputation research Thereforewe also adopted RMSE as the performance indicator inthis study The performance of the selected classificationalgorithms was evaluated using SPSS 170
RMSE measures the difference between predicted andobserved values The term ldquorelative prediction accuracyrdquorefers to the relative ratio of accuracy which is equivalentto 1 when there are no missing data [10] The no-missing-data condition was used as a baseline of performance As thenext step we generated a missing dataset from the originalno-missing-dataset and then applied an imputation methodto replace the null data Then a classification algorithm wasconducted to estimate the results of the imputed datasetWithall combinations of imputation methods and classificationalgorithms a multiple regression analysis was conductedusing the following equation to understand the input factorsthe characteristics of missing data and those of the datasetsin order to determine how the selected classification algo-rithms affected performance
119910119901 = sum
forall119895isinM120573119901119895119909119895 + sum
forall119896isinD120594119901119896119911119896 + 120576119901 (9)
In this equation 119909 is the value of the characteristics ofthe missing data (M) 119911 is the value of each datasetrsquos char-acteristics in the set of dataset (D) and 119910 is a performanceparameter Note that M = missing data ratio patterns ofmissing data horizontal scatteredness vertical scatterednessmissing data spread and D = number of cases numberof attributes degree of class imbalance In addition 119901 =1 indicates relative prediction accuracy 119901 = 2 represents
RMSE and 119901 = 3 means elapsed time We performed theexperiment using the Weka library source software (release36) to determine the reliability of the implementation ofthe algorithms [17] We did not use the Weka GUI toolbut developed a Weka library-based performance evaluationprogram in order to conduct the automatized experimentrepeatedly
5 Results
In total 32400 datasets (3 missing ratios times 3 imputationpatterns times 6 imputation methods times 100 trials) were imputedfor each of the 6 classifiers Thus in total we tested 226800datasets (32400 imputed dataset times 7 classifier methods) Theresults were divided by those for each dataset classificationalgorithm and imputation method for comparison in termsof performance
51 Datasets Figure 3 shows the performance of each impu-tation method for the six different datasets On the 119909-axisthree missing ratios represent the characteristics of missingdata and on the 119910-axis performance is indicated usingthe RMSE All results of three different variations of themissing data patterns and tested classification algorithmswere merged for each imputation method
For Iris data (Figure 3(a)) the mean imputation methodyielded the worst results and the group mean imputationmethod the best results
For Glass Identification data (Figure 3(b)) hot-deckimputation was the least effective method and predictivemean imputation was the best
For Liver Disorder data (Figure 3(c)) 119896-NN was the leasteffective and once again the predictive mean imputationmethod yielded the best results
For Ionosphere data (Figure 3(d)) hot-deckwas theworstand 119896-NN the best
For Wine data (Figure 3(e)) hot-deck was once again theleast effective method and predictive mean imputation thebest
For Statlog data (Figure 3(f)) unlike the other datasetsthe results varied based on the missing data ratio Howeverpredictive mean imputation was still the best method overalland hot-deck the worst
Figure 3 illustrates that the predictive mean imputa-tion method yielded the best results overall and hot-deckimputation the worst However no imputation method wasgenerally superior in all cases with any given dataset Forexample the 119896-NN method yielded the best performancefor the Ionosphere dataset but for the Liver Disordersdataset its performance was lowest In another example thegroup mean imputation method performed best for the Irisand Wine datasets but its performance was only averagefor other datasets Therefore the results were inconsistentand determining the best imputation method is impossibleThus the imputation method cannot be used as an accuratepredictor of performance Rather the performance must beinfluenced by other factors such as the interaction betweenthe characteristics of the dataset in terms of missing data andthe chosen imputation method
Mathematical Problems in Engineering 7
10000000000
08000000000
06000000000
04000000000
02000000000
00000000000
001 005 010
(a) Iris
001 005 010
06000000000
05000000000
04000000000
03000000000
02000000000
01000000000
00000000000
(b) Glass Identification
001 005 010
6000000000000
4000000000000
2000000000000
00000000000
(c) Liver Disorders
001 005 010
06000000000
05000000000
04000000000
03000000000
02000000000
01000000000
00000000000
(d) Ionosphere
1250000000000000
1000000000000000
750000000000000
500000000000000
250000000000000
00000000000
k-NN
001 005 010
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
k-MEANS CLUSTERING
LISTWISE DELETION
imputation
(e) Wine
60000000000000
50000000000000
40000000000000
30000000000000
20000000000000
10000000000000
00000000000
001 005 010
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-NNk-MEANS CLUSTERING
(f) Statlog Shuttle
Figure 3 Comparison of performances of imputation methods for each dataset
8 Mathematical Problems in Engineering
Table 6 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) mean imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus075lowastlowast minus178lowastlowast minus072lowastlowast 115lowastlowast 007N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus032 minus048lowastlowast
C imbalance 117lowastlowast 239lowastlowast 264lowastlowast 525lowastlowast 163lowastlowast 198lowastlowast
R missing 051lowast 078lowastlowast 040 080lowastlowast 076lowastlowast 068lowastlowast
SE HS 249lowastlowast 285lowastlowast 186lowastlowast 277lowastlowast 335lowastlowast 245lowastlowast
SE VS minus009 minus013 minus006 minus013 minus016 minus010Spread minus382lowastlowast minus430lowastlowast minus261lowastlowast minus436lowastlowast minus452lowastlowast minus363lowastlowast
P missing dum1 minus049 minus038 minus038 minus037 minus045 minus038P missing dum2 minus002 014 002 011 001 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 7 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) group mean imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus068lowastlowast minus072lowastlowast minus179lowastlowast minus068lowastlowast 115lowastlowast 010N cases minus082lowastlowast minus050lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 115lowastlowast 228lowastlowast 260lowastlowast 517lowastlowast 156lowastlowast 197lowastlowast
R missing 050lowastlowast 085lowastlowast 043 084lowastlowast 095lowastlowast 066lowastlowast
SE HS 230lowastlowast 268lowastlowast 178lowastlowast 273lowastlowast 300lowastlowast 248lowastlowast
SE VS minus008 minus012 minus006 minus013 minus013 minus010Spread minus296lowastlowast minus439lowastlowast minus264lowastlowast minus443lowastlowast minus476lowastlowast minus382lowastlowast
P missing dum1 minus043 minus032 minus034 minus035 minus035 minus041P missing dum2 002 024 004 016 021 013Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
52 Classification Algorithm Figure 4 shows the perfor-mance of the classification algorithms by imputation methodand ratio of missing data As shown in the figure theperformance of each imputation method was similar and didnot vary depending on the ratio of missing data except forlistwise deletion For listwise deletion as the ratio of missingto complete data increased the performance deterioratedIn the listwise deletion method all records are deletedthat contain missing data therefore the number of deletedrecords increases as the ratio of missing data increases Thelow performance of this method can be explained based onthis fact
The differences in performance between imputationmethods were minor The figure displays these differencesby classification algorithm Using the Bayesian network andlogistic classifier methods significantly improved perfor-mance compared to other classifiers However the relation-ships among missing data imputation methods and classi-fiers remained to be explainedThus a regression analysis wasconducted
In Figure 4 the results suggest the following rules
(i) IF themissing rate increases AND IBK is used THENuse the GROUP MEAN IMPUTATION method
(ii) IF the missing rate increases AND the logistic clas-sifier method is used THEN use the HOT DECKmethod
(iii) IF the missing rate increases AND the regressionmethod is used THEN use the GROUP MEAN IM-PUTATION method
(iv) IF the missing rate increases AND the BayesNetmethod is used THEN use the GROUP MEAN IM-PUTATION method
(v) IF the missing rate increases AND the treesJ48method is used THEN use the 119896-NN method
53 Regression The results of the regression analysis arepresented in Tables 6 7 8 9 10 and 11 The analysis wasconducted using 900 datasets (3 missing ratios times 3 missing
Mathematical Problems in Engineering 9
05000000000
04500000000
04000000000
03500000000
03000000000
001 005 010
(a) Decision tree (J48)
001 005 010
04000000000
03500000000
03000000000
(b) BayesNet
001 005 010
04400000000
04200000000
04000000000
03800000000
(c) SMO (SVM)
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
(d) Regression
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
02800000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-NNk-MEANS CLUSTERING
(e) Logistic
001 005 010
04250000000
04000000000
03750000000
03500000000
03250000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-MEANS CLUSTERINGk-NN
(f) IBk (119896-nearest neighbor classifier)
Figure 4 Comparison of classifiers in terms of classification performance
10 Mathematical Problems in Engineering
Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast
C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast
R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast
SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast
SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast
P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast
C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast
R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast
SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast
P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast
R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast
SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast
SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast
P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Mathematical Problems in Engineering 11
Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast
C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast
R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast
SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast
P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected
(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic
Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance
except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance
Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect
The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance
Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used
Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3
12 Mathematical Problems in Engineering
0
01
02
03At
trib
ute
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
Miss
ing
p1
Miss
ing
p2
minus04
minus03
minus02
minus01
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN
Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)
0
02
04
06
08
Attr
ibut
e
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
minus06
minus04
minus02
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATION
k-NNk-MEANS CLUSTERING
Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)
missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm
Method of imputation
0490
0480
0470
0460
0450
0440
0430
0420
0410
0400
0390
0380
0370
0360
0350
0340
0330
0320
0310
0300
005 010 015 020 025 030 035 040 045 050
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETIONk-NNk-MEANS CLUSTERING
Figure 7 RMSE by ratio of missing data
Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms
Data characteristic 119861 Data characteristic 119861
(constant) 060lowastlowast M imputation dum1 012lowastlowast
R missing 083lowastlowast M imputation dum2 minus001lowast
SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast
N attributes minus008lowastlowast M imputation dum7 minus001lowast
C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast
N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005
6 Conclusion
So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based
Mathematical Problems in Engineering 13
on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based
In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not
A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply
The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications
Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values
This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance
In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)
References
[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013
[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013
[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003
[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011
[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013
[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013
[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013
[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013
14 Mathematical Problems in Engineering
[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013
[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013
[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008
[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978
[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998
[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013
[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013
[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005
[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012
[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006
[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 3
(ii) Mean Imputation The missing values from each attribute(column or feature) are replaced with the mean of all knownvalues of that attribute That is let 119883119895119894 be the 119895th missingattribute of the 119894th instance which is imputed by
119883119895
119894 = sum
119896isin119868(complete)
119883119895
119896
119899|119868(complete)| (1)
where 119868(complete) is a set of indices that are not missing in119883119894 and 119899|119868(complete)| is the total number of instances where the119895th attribute is not missing
(iii) Group Mean Imputation The process for this method isthe same as that for mean imputation However the missingvalues are replaced with the group (or class) mean of allknown values of that attribute Each group represents a targetclass from among the instances (recorded) that have missingvalues Let119883119895119898119894 be the 119895thmissing attribute of the 119894th instanceof the119898th class which is imputed by
119883119895
119898119894 = sum
119896isin119868(119898th class incomplete)
119883119895
119898119896
119899|119868(119898th class incomplete)| (2)
where 119868(119898th class incomplete) is a set of indices that are notmissing in 119883119895119898119894 and 119899|119868(119898th class incomplete)| is the total numberof instances where the 119895th attribute of the 119898th class is notmissing
(iv) Predictive Mean Imputation In this method the func-tional relationship between multiple input variables andsingle or multiple target variables of the given data isrepresented in the form of a linear equationThismethod setsattributes that havemissing values as dependent variables andother attributes as independent variables in order to allowprediction of missing values by creating a regression modelusing those variables For a regression target 119910119894 the MLRequation with 119889 predictors and 119899 training instances can bewritten as
119910119894 = 1205730 + 12057311199091198941 + 12057321199091198942 + sdot sdot sdot + 120573119889119909119894119889 for 119894 = 1 119899 (3)
This can be rewritten in matrix form such that 119910 = 119883120573and the coefficient 120573 can be obtained explicitly by taking aderivative of the squared error function as follows
min119864 (120573) = 12(119910 minus 119883120573)
119879(119910 minus 119883120573)
120597119864 (120573)
120597120573= 119883119879119883120573 minus 119883
119879119910 = 0
120573 = (119883119879119883)minus1sdot 119883119879119910
(4)
(v) Hot-Deck This method is the same in principle as case-based reasoning In order for attributes that contain missingvalues to be utilized values must be found from among themost similar instances of nonmissing values and used toreplace the missing values Therefore each missing value is
replaced with the value of an attribute with the most similarinstance as follows
119883119895
119894 = 119883119895
119896 119896 = argmin
119875radic sum
119895isin119868(complete)Std119895 (119883
119895
119894 minus 119883119895
119875)2 (5)
where Std119895 is the standard deviation of the 119895th attribute whichis not missing
(vi) 119896-NN Attributes are found via a search among nonmiss-ing attributes using the 3-NN method Missing values areimputed based on the values of the attributes of the 119896 mostsimilar instances as follows
119883119895
119894 = sum
119875isin119896-NN(119883119894)119896 (119883119868(complete)119894 119883
119868(complete)119875 ) sdot 119883
119895
119875 (6)
where 119896-NN(119883119894) is the index set of the 119896th nearest neighborsof 119883119894 based on the nonmissing attributes and 119896(119883119894 119883119895) is akernel function that is proportional to the similarity betweenthe two instances119883119894 and119883119895 (119896 = 4)
(vii) 119896-Means Clustering Attributes are found through forma-tion of 119896-clusters from nonmissing data after which missingvalues are imputed The entire dataset is partitioned into 119896clusters by maximizing the homogeneity within each clusterand the heterogeneity between clusters as follows
arg min119862ℎ(complete)
119896
sum
119894=1
sum
119883119868(complete)119895
isin119862ℎ(complete)119894
10038171003817100381710038171003817119883119868(complete)119895 minus 119862
119868(complete)119894
10038171003817100381710038171003817
2
(7)
where 119862119868(complete)119894 is the centroid of 119862119868(complete)
119894 and 119862119868(complete)
is the union of all clusters (119862119868(complete)= 119862119868(complete)1 cup sdot sdot sdot cup
119862119868(complete)119896
) For a missing value 119883119895119894 the mean value of theattribute for the instances in the same cluster with119883119868(complete)
119894
is imputed thus as follows
119883119895
119894 =1
10038161003816100381610038161003816119862119868(complete)119896
10038161003816100381610038161003816
sdot sum
119883119868(complete)119875
isin119862119868(complete)119896
119883119895
119875
st 119896 = argmin119894
10038161003816100381610038161003816119883119868(complete)119895 minus 119862
119868(complete)119894
10038161003816100381610038161003816
(8)
3 Model
In this paper we hypothesize an association between the per-formance of classification algorithms and the characteristicsof missing data and datasets Moreover we assume that thechosen imputation method moderates the causality betweenthese factors Figure 1 illustrates the posited relationships
31 Missing Data Characteristics Table 1 describes the char-acteristics of missing data and how to calculate them Thepattern of missing data characteristics may be univariatemonotone or arbitrary [11] A univariate pattern of missingdata occurs when missing values are observed for a singlevariable only all other data are complete for all variables
4 Mathematical Problems in Engineering
Table 1 The characteristics of missing data
Variables Meaning Calculation
Missing data ratioThe number of missing values in theentire dataset as compared to thenumber of nonmissing values
The number of empty data cellstotal cells
Patterns of missingdata
UnivariateRatio of missing to complete values for an existing feature comparedto the values for all featuresMonotone
ArbitraryHorizontalscatteredness
Distribution of missing values withineach data record
Determine the number of missing cells in each record and calculatethe standard deviation
Verticalscatteredness
Distribution of missing values for eachattribute
Determine the number of missing cells in each feature and calculatethe standard deviation
Missing dataspread
Larger standard deviations indicatestronger effects of missing data
Determine the weighted average of the standard deviations of featureswith missing data (weight the ratio of missing to complete data foreach feature)
Missing datacharacteristics
Dataset feature
Imputation method
Classificationperformance
Figure 1 Research model
A monotone pattern occurs if variables can be arranged suchthat all 119884119895+1 119884119896 are missing for cases where 119884119895 is missingAnother characteristic missing data spread is importantbecause larger standard deviations for missing values withinan existing feature indicate that the missing data has greaterinfluence on the results of the analysis (Figure 2)
32 Dataset Features Table 2 lists the features of datasetsBased on the research of Kwon and Sim [15] in which char-acteristics of datasets that influence classification algorithmswere identified we considered the following statisticallysignificant features in this study missing values the numberof cases the number of attributes and the degree of classimbalance However the discussion of missing values isomitted here because it has already been analyzed in detailby Kwon and Sim [15]
33 Imputation Methods Table 3 lists the imputation meth-ods used in this study Since datasets with categorical decisionattributes are included imputation methods that do notaccommodate categorical attributes (eg regression imputa-tion) are excluded from this paper
Table 2 Dataset features
Variables DescriptionNumber of cases Number of records in the dataset
Number of attributes Number of features characteristicof the dataset
Degree of class imbalance Ratio
34 Classification Algorithms Many studies have comparedclassification algorithms in various areas For example thedecision tree is known as the best algorithm for arrhythmiaclassification [16] In Table 4 six types of representative clas-sification algorithms for supervised learning are describedC45 SVM (support vector machine) Bayesian networklogistic classifier 119896-nearest neighbor classifier and regres-sion
4 Method
We conducted a performance evaluation of the imputationmethods and classification algorithms described in the previ-ous section using actual datasets taken from the UCI datasetarchive To ensure the accuracy of each method in caseswith no missing values datasets with missing values werenot included Among the selected datasets six (Iris WineGlass Liver Disorder Ionosphere and Statlog Shuttle) wereincluded for comparison with the results of Kang [10] Thesedatasets are popular and frequently utilized benchmarks inthe literature whichmakes themuseful for demonstrating thesuperiority of the proposed idea
Table 5 provides the names of the datasets the numbersof cases and the descriptions of features and classes Thenumbers in parentheses in the last two columns represent thenumber of features and classes for the decision attributes Forexample in dataset Iris ldquoNumeric (4)rdquo indicates that thereare four numeric attributes and ldquoCategorical (3)rdquo means thatthere are three classes in the decision attribute
Since UCI datasets have no missing data target valuesin each dataset were randomly omitted [10] Based on
Mathematical Problems in Engineering 5
Observed valuesMissing values
Observed valuesMissing values
Observed valuesMissing values
Univariate pattern Monotone pattern Arbitrary pattern
All missing values arein the last feature
n2n
3n
All missing values arein the last feature
and last-1 and last-2
Missing values are in random feature and record
Figure 2 Missing data patterns
Table 3 Imputation methods
Imputation methods Description
Listwise deletion Perhaps the most basic traditional technique for dealing with missing data Cases with missingvalues are discarded restricting the analyses to cases for which complete data are available
Mean imputation Involves replacing missing data with the overall mean for the observed dataGroup meanimputation
A missing value is replaced by the mean of a subset of the data based on other observed variable(s)in the data
Predictive meanimputation
Also called regression imputation Predictive mean imputation involves imputing a missing valueusing an ordinary least-squares regression method to estimate missing data
Hot-deck Most similar records are imputed to missing values119896-NN The attribute value of 119896 is imputed to the most similar instance from nonmissing data119896-means clustering 119896 numbers of sets are created that are homogeneous on the inside and heterogeneous on the outside
Table 4 Classification algorithms
Algorithms Description
C45Estimates the known data using learning rules C45 gradually expands the conditions of thealgorithm splitting the upper node into subnodes using a divide-and-conquer method until it comesto the end node
SVM Classifies the unknown class by finding the optimal hyperplane with the maximum margin thatreduces the estimation error
Bayesian network A probability network with a high posterior probability given the instances Such a network canprovide insight into probabilistic dependencies among the variables in the training dataset
Logistic classifierTakes the functional form of logistic CDF (cumulative distribution function) This function relatesthe probability of some event to attribute variables through regression coefficients and alpha andbeta parameters which are estimated from training data [13]
119896-nearest neighborclassifier
Simple instance-based learner that uses the class of the nearest 119896 training instances for the class ofthe test instances
Regression The class is binarized and one regression model is built for each class value [14]
6 Mathematical Problems in Engineering
Table 5 Datasets used in the experiments
Dataset Number of cases Features Decision attributesIris 150 Numeric (4) Categorical (3)Wine 178 Numeric (13) Categorical (3)Glass 214 Numeric (9) Categorical (7)Liver disorder 345 Numeric (6) Categorical (2)Ionosphere 351 Numeric (34) Categorical (2)Statlog Shuttle 57999 Numeric (7) Categorical (7)
the list of missing data characteristics three datasets withthree different missing data ratios (5 10 and 15) andthree sets representing each of the missing data patterns(univariate monotone and arbitrary) were created for atotal of nine variations for each dataset In total 54 datasetswere imputed for each imputation method as 6 datasetswere available We repeated the experiment for each dataset1000 times in order to minimize errors and bias Thus5400 datasets were imputed in total for our experimentAll imputation methods were implemented using packageswritten in Java In order to measure the performance of eachimputation method we applied imputed datasets to the sixclassification algorithms listed in Table 4
There are various indicators to measure performancesuch as accuracy relative accuracy MAE (mean absoluteerror) and RMSE (root mean square error) However RMSEis one of the most representative and widely used per-formance indicators in the imputation research Thereforewe also adopted RMSE as the performance indicator inthis study The performance of the selected classificationalgorithms was evaluated using SPSS 170
RMSE measures the difference between predicted andobserved values The term ldquorelative prediction accuracyrdquorefers to the relative ratio of accuracy which is equivalentto 1 when there are no missing data [10] The no-missing-data condition was used as a baseline of performance As thenext step we generated a missing dataset from the originalno-missing-dataset and then applied an imputation methodto replace the null data Then a classification algorithm wasconducted to estimate the results of the imputed datasetWithall combinations of imputation methods and classificationalgorithms a multiple regression analysis was conductedusing the following equation to understand the input factorsthe characteristics of missing data and those of the datasetsin order to determine how the selected classification algo-rithms affected performance
119910119901 = sum
forall119895isinM120573119901119895119909119895 + sum
forall119896isinD120594119901119896119911119896 + 120576119901 (9)
In this equation 119909 is the value of the characteristics ofthe missing data (M) 119911 is the value of each datasetrsquos char-acteristics in the set of dataset (D) and 119910 is a performanceparameter Note that M = missing data ratio patterns ofmissing data horizontal scatteredness vertical scatterednessmissing data spread and D = number of cases numberof attributes degree of class imbalance In addition 119901 =1 indicates relative prediction accuracy 119901 = 2 represents
RMSE and 119901 = 3 means elapsed time We performed theexperiment using the Weka library source software (release36) to determine the reliability of the implementation ofthe algorithms [17] We did not use the Weka GUI toolbut developed a Weka library-based performance evaluationprogram in order to conduct the automatized experimentrepeatedly
5 Results
In total 32400 datasets (3 missing ratios times 3 imputationpatterns times 6 imputation methods times 100 trials) were imputedfor each of the 6 classifiers Thus in total we tested 226800datasets (32400 imputed dataset times 7 classifier methods) Theresults were divided by those for each dataset classificationalgorithm and imputation method for comparison in termsof performance
51 Datasets Figure 3 shows the performance of each impu-tation method for the six different datasets On the 119909-axisthree missing ratios represent the characteristics of missingdata and on the 119910-axis performance is indicated usingthe RMSE All results of three different variations of themissing data patterns and tested classification algorithmswere merged for each imputation method
For Iris data (Figure 3(a)) the mean imputation methodyielded the worst results and the group mean imputationmethod the best results
For Glass Identification data (Figure 3(b)) hot-deckimputation was the least effective method and predictivemean imputation was the best
For Liver Disorder data (Figure 3(c)) 119896-NN was the leasteffective and once again the predictive mean imputationmethod yielded the best results
For Ionosphere data (Figure 3(d)) hot-deckwas theworstand 119896-NN the best
For Wine data (Figure 3(e)) hot-deck was once again theleast effective method and predictive mean imputation thebest
For Statlog data (Figure 3(f)) unlike the other datasetsthe results varied based on the missing data ratio Howeverpredictive mean imputation was still the best method overalland hot-deck the worst
Figure 3 illustrates that the predictive mean imputa-tion method yielded the best results overall and hot-deckimputation the worst However no imputation method wasgenerally superior in all cases with any given dataset Forexample the 119896-NN method yielded the best performancefor the Ionosphere dataset but for the Liver Disordersdataset its performance was lowest In another example thegroup mean imputation method performed best for the Irisand Wine datasets but its performance was only averagefor other datasets Therefore the results were inconsistentand determining the best imputation method is impossibleThus the imputation method cannot be used as an accuratepredictor of performance Rather the performance must beinfluenced by other factors such as the interaction betweenthe characteristics of the dataset in terms of missing data andthe chosen imputation method
Mathematical Problems in Engineering 7
10000000000
08000000000
06000000000
04000000000
02000000000
00000000000
001 005 010
(a) Iris
001 005 010
06000000000
05000000000
04000000000
03000000000
02000000000
01000000000
00000000000
(b) Glass Identification
001 005 010
6000000000000
4000000000000
2000000000000
00000000000
(c) Liver Disorders
001 005 010
06000000000
05000000000
04000000000
03000000000
02000000000
01000000000
00000000000
(d) Ionosphere
1250000000000000
1000000000000000
750000000000000
500000000000000
250000000000000
00000000000
k-NN
001 005 010
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
k-MEANS CLUSTERING
LISTWISE DELETION
imputation
(e) Wine
60000000000000
50000000000000
40000000000000
30000000000000
20000000000000
10000000000000
00000000000
001 005 010
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-NNk-MEANS CLUSTERING
(f) Statlog Shuttle
Figure 3 Comparison of performances of imputation methods for each dataset
8 Mathematical Problems in Engineering
Table 6 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) mean imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus075lowastlowast minus178lowastlowast minus072lowastlowast 115lowastlowast 007N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus032 minus048lowastlowast
C imbalance 117lowastlowast 239lowastlowast 264lowastlowast 525lowastlowast 163lowastlowast 198lowastlowast
R missing 051lowast 078lowastlowast 040 080lowastlowast 076lowastlowast 068lowastlowast
SE HS 249lowastlowast 285lowastlowast 186lowastlowast 277lowastlowast 335lowastlowast 245lowastlowast
SE VS minus009 minus013 minus006 minus013 minus016 minus010Spread minus382lowastlowast minus430lowastlowast minus261lowastlowast minus436lowastlowast minus452lowastlowast minus363lowastlowast
P missing dum1 minus049 minus038 minus038 minus037 minus045 minus038P missing dum2 minus002 014 002 011 001 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 7 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) group mean imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus068lowastlowast minus072lowastlowast minus179lowastlowast minus068lowastlowast 115lowastlowast 010N cases minus082lowastlowast minus050lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 115lowastlowast 228lowastlowast 260lowastlowast 517lowastlowast 156lowastlowast 197lowastlowast
R missing 050lowastlowast 085lowastlowast 043 084lowastlowast 095lowastlowast 066lowastlowast
SE HS 230lowastlowast 268lowastlowast 178lowastlowast 273lowastlowast 300lowastlowast 248lowastlowast
SE VS minus008 minus012 minus006 minus013 minus013 minus010Spread minus296lowastlowast minus439lowastlowast minus264lowastlowast minus443lowastlowast minus476lowastlowast minus382lowastlowast
P missing dum1 minus043 minus032 minus034 minus035 minus035 minus041P missing dum2 002 024 004 016 021 013Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
52 Classification Algorithm Figure 4 shows the perfor-mance of the classification algorithms by imputation methodand ratio of missing data As shown in the figure theperformance of each imputation method was similar and didnot vary depending on the ratio of missing data except forlistwise deletion For listwise deletion as the ratio of missingto complete data increased the performance deterioratedIn the listwise deletion method all records are deletedthat contain missing data therefore the number of deletedrecords increases as the ratio of missing data increases Thelow performance of this method can be explained based onthis fact
The differences in performance between imputationmethods were minor The figure displays these differencesby classification algorithm Using the Bayesian network andlogistic classifier methods significantly improved perfor-mance compared to other classifiers However the relation-ships among missing data imputation methods and classi-fiers remained to be explainedThus a regression analysis wasconducted
In Figure 4 the results suggest the following rules
(i) IF themissing rate increases AND IBK is used THENuse the GROUP MEAN IMPUTATION method
(ii) IF the missing rate increases AND the logistic clas-sifier method is used THEN use the HOT DECKmethod
(iii) IF the missing rate increases AND the regressionmethod is used THEN use the GROUP MEAN IM-PUTATION method
(iv) IF the missing rate increases AND the BayesNetmethod is used THEN use the GROUP MEAN IM-PUTATION method
(v) IF the missing rate increases AND the treesJ48method is used THEN use the 119896-NN method
53 Regression The results of the regression analysis arepresented in Tables 6 7 8 9 10 and 11 The analysis wasconducted using 900 datasets (3 missing ratios times 3 missing
Mathematical Problems in Engineering 9
05000000000
04500000000
04000000000
03500000000
03000000000
001 005 010
(a) Decision tree (J48)
001 005 010
04000000000
03500000000
03000000000
(b) BayesNet
001 005 010
04400000000
04200000000
04000000000
03800000000
(c) SMO (SVM)
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
(d) Regression
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
02800000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-NNk-MEANS CLUSTERING
(e) Logistic
001 005 010
04250000000
04000000000
03750000000
03500000000
03250000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-MEANS CLUSTERINGk-NN
(f) IBk (119896-nearest neighbor classifier)
Figure 4 Comparison of classifiers in terms of classification performance
10 Mathematical Problems in Engineering
Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast
C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast
R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast
SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast
SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast
P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast
C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast
R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast
SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast
P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast
R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast
SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast
SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast
P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Mathematical Problems in Engineering 11
Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast
C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast
R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast
SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast
P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected
(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic
Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance
except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance
Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect
The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance
Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used
Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3
12 Mathematical Problems in Engineering
0
01
02
03At
trib
ute
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
Miss
ing
p1
Miss
ing
p2
minus04
minus03
minus02
minus01
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN
Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)
0
02
04
06
08
Attr
ibut
e
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
minus06
minus04
minus02
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATION
k-NNk-MEANS CLUSTERING
Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)
missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm
Method of imputation
0490
0480
0470
0460
0450
0440
0430
0420
0410
0400
0390
0380
0370
0360
0350
0340
0330
0320
0310
0300
005 010 015 020 025 030 035 040 045 050
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETIONk-NNk-MEANS CLUSTERING
Figure 7 RMSE by ratio of missing data
Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms
Data characteristic 119861 Data characteristic 119861
(constant) 060lowastlowast M imputation dum1 012lowastlowast
R missing 083lowastlowast M imputation dum2 minus001lowast
SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast
N attributes minus008lowastlowast M imputation dum7 minus001lowast
C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast
N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005
6 Conclusion
So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based
Mathematical Problems in Engineering 13
on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based
In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not
A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply
The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications
Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values
This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance
In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)
References
[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013
[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013
[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003
[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011
[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013
[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013
[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013
[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013
14 Mathematical Problems in Engineering
[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013
[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013
[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008
[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978
[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998
[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013
[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013
[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005
[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012
[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006
[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
4 Mathematical Problems in Engineering
Table 1 The characteristics of missing data
Variables Meaning Calculation
Missing data ratioThe number of missing values in theentire dataset as compared to thenumber of nonmissing values
The number of empty data cellstotal cells
Patterns of missingdata
UnivariateRatio of missing to complete values for an existing feature comparedto the values for all featuresMonotone
ArbitraryHorizontalscatteredness
Distribution of missing values withineach data record
Determine the number of missing cells in each record and calculatethe standard deviation
Verticalscatteredness
Distribution of missing values for eachattribute
Determine the number of missing cells in each feature and calculatethe standard deviation
Missing dataspread
Larger standard deviations indicatestronger effects of missing data
Determine the weighted average of the standard deviations of featureswith missing data (weight the ratio of missing to complete data foreach feature)
Missing datacharacteristics
Dataset feature
Imputation method
Classificationperformance
Figure 1 Research model
A monotone pattern occurs if variables can be arranged suchthat all 119884119895+1 119884119896 are missing for cases where 119884119895 is missingAnother characteristic missing data spread is importantbecause larger standard deviations for missing values withinan existing feature indicate that the missing data has greaterinfluence on the results of the analysis (Figure 2)
32 Dataset Features Table 2 lists the features of datasetsBased on the research of Kwon and Sim [15] in which char-acteristics of datasets that influence classification algorithmswere identified we considered the following statisticallysignificant features in this study missing values the numberof cases the number of attributes and the degree of classimbalance However the discussion of missing values isomitted here because it has already been analyzed in detailby Kwon and Sim [15]
33 Imputation Methods Table 3 lists the imputation meth-ods used in this study Since datasets with categorical decisionattributes are included imputation methods that do notaccommodate categorical attributes (eg regression imputa-tion) are excluded from this paper
Table 2 Dataset features
Variables DescriptionNumber of cases Number of records in the dataset
Number of attributes Number of features characteristicof the dataset
Degree of class imbalance Ratio
34 Classification Algorithms Many studies have comparedclassification algorithms in various areas For example thedecision tree is known as the best algorithm for arrhythmiaclassification [16] In Table 4 six types of representative clas-sification algorithms for supervised learning are describedC45 SVM (support vector machine) Bayesian networklogistic classifier 119896-nearest neighbor classifier and regres-sion
4 Method
We conducted a performance evaluation of the imputationmethods and classification algorithms described in the previ-ous section using actual datasets taken from the UCI datasetarchive To ensure the accuracy of each method in caseswith no missing values datasets with missing values werenot included Among the selected datasets six (Iris WineGlass Liver Disorder Ionosphere and Statlog Shuttle) wereincluded for comparison with the results of Kang [10] Thesedatasets are popular and frequently utilized benchmarks inthe literature whichmakes themuseful for demonstrating thesuperiority of the proposed idea
Table 5 provides the names of the datasets the numbersof cases and the descriptions of features and classes Thenumbers in parentheses in the last two columns represent thenumber of features and classes for the decision attributes Forexample in dataset Iris ldquoNumeric (4)rdquo indicates that thereare four numeric attributes and ldquoCategorical (3)rdquo means thatthere are three classes in the decision attribute
Since UCI datasets have no missing data target valuesin each dataset were randomly omitted [10] Based on
Mathematical Problems in Engineering 5
Observed valuesMissing values
Observed valuesMissing values
Observed valuesMissing values
Univariate pattern Monotone pattern Arbitrary pattern
All missing values arein the last feature
n2n
3n
All missing values arein the last feature
and last-1 and last-2
Missing values are in random feature and record
Figure 2 Missing data patterns
Table 3 Imputation methods
Imputation methods Description
Listwise deletion Perhaps the most basic traditional technique for dealing with missing data Cases with missingvalues are discarded restricting the analyses to cases for which complete data are available
Mean imputation Involves replacing missing data with the overall mean for the observed dataGroup meanimputation
A missing value is replaced by the mean of a subset of the data based on other observed variable(s)in the data
Predictive meanimputation
Also called regression imputation Predictive mean imputation involves imputing a missing valueusing an ordinary least-squares regression method to estimate missing data
Hot-deck Most similar records are imputed to missing values119896-NN The attribute value of 119896 is imputed to the most similar instance from nonmissing data119896-means clustering 119896 numbers of sets are created that are homogeneous on the inside and heterogeneous on the outside
Table 4 Classification algorithms
Algorithms Description
C45Estimates the known data using learning rules C45 gradually expands the conditions of thealgorithm splitting the upper node into subnodes using a divide-and-conquer method until it comesto the end node
SVM Classifies the unknown class by finding the optimal hyperplane with the maximum margin thatreduces the estimation error
Bayesian network A probability network with a high posterior probability given the instances Such a network canprovide insight into probabilistic dependencies among the variables in the training dataset
Logistic classifierTakes the functional form of logistic CDF (cumulative distribution function) This function relatesthe probability of some event to attribute variables through regression coefficients and alpha andbeta parameters which are estimated from training data [13]
119896-nearest neighborclassifier
Simple instance-based learner that uses the class of the nearest 119896 training instances for the class ofthe test instances
Regression The class is binarized and one regression model is built for each class value [14]
6 Mathematical Problems in Engineering
Table 5 Datasets used in the experiments
Dataset Number of cases Features Decision attributesIris 150 Numeric (4) Categorical (3)Wine 178 Numeric (13) Categorical (3)Glass 214 Numeric (9) Categorical (7)Liver disorder 345 Numeric (6) Categorical (2)Ionosphere 351 Numeric (34) Categorical (2)Statlog Shuttle 57999 Numeric (7) Categorical (7)
the list of missing data characteristics three datasets withthree different missing data ratios (5 10 and 15) andthree sets representing each of the missing data patterns(univariate monotone and arbitrary) were created for atotal of nine variations for each dataset In total 54 datasetswere imputed for each imputation method as 6 datasetswere available We repeated the experiment for each dataset1000 times in order to minimize errors and bias Thus5400 datasets were imputed in total for our experimentAll imputation methods were implemented using packageswritten in Java In order to measure the performance of eachimputation method we applied imputed datasets to the sixclassification algorithms listed in Table 4
There are various indicators to measure performancesuch as accuracy relative accuracy MAE (mean absoluteerror) and RMSE (root mean square error) However RMSEis one of the most representative and widely used per-formance indicators in the imputation research Thereforewe also adopted RMSE as the performance indicator inthis study The performance of the selected classificationalgorithms was evaluated using SPSS 170
RMSE measures the difference between predicted andobserved values The term ldquorelative prediction accuracyrdquorefers to the relative ratio of accuracy which is equivalentto 1 when there are no missing data [10] The no-missing-data condition was used as a baseline of performance As thenext step we generated a missing dataset from the originalno-missing-dataset and then applied an imputation methodto replace the null data Then a classification algorithm wasconducted to estimate the results of the imputed datasetWithall combinations of imputation methods and classificationalgorithms a multiple regression analysis was conductedusing the following equation to understand the input factorsthe characteristics of missing data and those of the datasetsin order to determine how the selected classification algo-rithms affected performance
119910119901 = sum
forall119895isinM120573119901119895119909119895 + sum
forall119896isinD120594119901119896119911119896 + 120576119901 (9)
In this equation 119909 is the value of the characteristics ofthe missing data (M) 119911 is the value of each datasetrsquos char-acteristics in the set of dataset (D) and 119910 is a performanceparameter Note that M = missing data ratio patterns ofmissing data horizontal scatteredness vertical scatterednessmissing data spread and D = number of cases numberof attributes degree of class imbalance In addition 119901 =1 indicates relative prediction accuracy 119901 = 2 represents
RMSE and 119901 = 3 means elapsed time We performed theexperiment using the Weka library source software (release36) to determine the reliability of the implementation ofthe algorithms [17] We did not use the Weka GUI toolbut developed a Weka library-based performance evaluationprogram in order to conduct the automatized experimentrepeatedly
5 Results
In total 32400 datasets (3 missing ratios times 3 imputationpatterns times 6 imputation methods times 100 trials) were imputedfor each of the 6 classifiers Thus in total we tested 226800datasets (32400 imputed dataset times 7 classifier methods) Theresults were divided by those for each dataset classificationalgorithm and imputation method for comparison in termsof performance
51 Datasets Figure 3 shows the performance of each impu-tation method for the six different datasets On the 119909-axisthree missing ratios represent the characteristics of missingdata and on the 119910-axis performance is indicated usingthe RMSE All results of three different variations of themissing data patterns and tested classification algorithmswere merged for each imputation method
For Iris data (Figure 3(a)) the mean imputation methodyielded the worst results and the group mean imputationmethod the best results
For Glass Identification data (Figure 3(b)) hot-deckimputation was the least effective method and predictivemean imputation was the best
For Liver Disorder data (Figure 3(c)) 119896-NN was the leasteffective and once again the predictive mean imputationmethod yielded the best results
For Ionosphere data (Figure 3(d)) hot-deckwas theworstand 119896-NN the best
For Wine data (Figure 3(e)) hot-deck was once again theleast effective method and predictive mean imputation thebest
For Statlog data (Figure 3(f)) unlike the other datasetsthe results varied based on the missing data ratio Howeverpredictive mean imputation was still the best method overalland hot-deck the worst
Figure 3 illustrates that the predictive mean imputa-tion method yielded the best results overall and hot-deckimputation the worst However no imputation method wasgenerally superior in all cases with any given dataset Forexample the 119896-NN method yielded the best performancefor the Ionosphere dataset but for the Liver Disordersdataset its performance was lowest In another example thegroup mean imputation method performed best for the Irisand Wine datasets but its performance was only averagefor other datasets Therefore the results were inconsistentand determining the best imputation method is impossibleThus the imputation method cannot be used as an accuratepredictor of performance Rather the performance must beinfluenced by other factors such as the interaction betweenthe characteristics of the dataset in terms of missing data andthe chosen imputation method
Mathematical Problems in Engineering 7
10000000000
08000000000
06000000000
04000000000
02000000000
00000000000
001 005 010
(a) Iris
001 005 010
06000000000
05000000000
04000000000
03000000000
02000000000
01000000000
00000000000
(b) Glass Identification
001 005 010
6000000000000
4000000000000
2000000000000
00000000000
(c) Liver Disorders
001 005 010
06000000000
05000000000
04000000000
03000000000
02000000000
01000000000
00000000000
(d) Ionosphere
1250000000000000
1000000000000000
750000000000000
500000000000000
250000000000000
00000000000
k-NN
001 005 010
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
k-MEANS CLUSTERING
LISTWISE DELETION
imputation
(e) Wine
60000000000000
50000000000000
40000000000000
30000000000000
20000000000000
10000000000000
00000000000
001 005 010
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-NNk-MEANS CLUSTERING
(f) Statlog Shuttle
Figure 3 Comparison of performances of imputation methods for each dataset
8 Mathematical Problems in Engineering
Table 6 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) mean imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus075lowastlowast minus178lowastlowast minus072lowastlowast 115lowastlowast 007N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus032 minus048lowastlowast
C imbalance 117lowastlowast 239lowastlowast 264lowastlowast 525lowastlowast 163lowastlowast 198lowastlowast
R missing 051lowast 078lowastlowast 040 080lowastlowast 076lowastlowast 068lowastlowast
SE HS 249lowastlowast 285lowastlowast 186lowastlowast 277lowastlowast 335lowastlowast 245lowastlowast
SE VS minus009 minus013 minus006 minus013 minus016 minus010Spread minus382lowastlowast minus430lowastlowast minus261lowastlowast minus436lowastlowast minus452lowastlowast minus363lowastlowast
P missing dum1 minus049 minus038 minus038 minus037 minus045 minus038P missing dum2 minus002 014 002 011 001 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 7 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) group mean imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus068lowastlowast minus072lowastlowast minus179lowastlowast minus068lowastlowast 115lowastlowast 010N cases minus082lowastlowast minus050lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 115lowastlowast 228lowastlowast 260lowastlowast 517lowastlowast 156lowastlowast 197lowastlowast
R missing 050lowastlowast 085lowastlowast 043 084lowastlowast 095lowastlowast 066lowastlowast
SE HS 230lowastlowast 268lowastlowast 178lowastlowast 273lowastlowast 300lowastlowast 248lowastlowast
SE VS minus008 minus012 minus006 minus013 minus013 minus010Spread minus296lowastlowast minus439lowastlowast minus264lowastlowast minus443lowastlowast minus476lowastlowast minus382lowastlowast
P missing dum1 minus043 minus032 minus034 minus035 minus035 minus041P missing dum2 002 024 004 016 021 013Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
52 Classification Algorithm Figure 4 shows the perfor-mance of the classification algorithms by imputation methodand ratio of missing data As shown in the figure theperformance of each imputation method was similar and didnot vary depending on the ratio of missing data except forlistwise deletion For listwise deletion as the ratio of missingto complete data increased the performance deterioratedIn the listwise deletion method all records are deletedthat contain missing data therefore the number of deletedrecords increases as the ratio of missing data increases Thelow performance of this method can be explained based onthis fact
The differences in performance between imputationmethods were minor The figure displays these differencesby classification algorithm Using the Bayesian network andlogistic classifier methods significantly improved perfor-mance compared to other classifiers However the relation-ships among missing data imputation methods and classi-fiers remained to be explainedThus a regression analysis wasconducted
In Figure 4 the results suggest the following rules
(i) IF themissing rate increases AND IBK is used THENuse the GROUP MEAN IMPUTATION method
(ii) IF the missing rate increases AND the logistic clas-sifier method is used THEN use the HOT DECKmethod
(iii) IF the missing rate increases AND the regressionmethod is used THEN use the GROUP MEAN IM-PUTATION method
(iv) IF the missing rate increases AND the BayesNetmethod is used THEN use the GROUP MEAN IM-PUTATION method
(v) IF the missing rate increases AND the treesJ48method is used THEN use the 119896-NN method
53 Regression The results of the regression analysis arepresented in Tables 6 7 8 9 10 and 11 The analysis wasconducted using 900 datasets (3 missing ratios times 3 missing
Mathematical Problems in Engineering 9
05000000000
04500000000
04000000000
03500000000
03000000000
001 005 010
(a) Decision tree (J48)
001 005 010
04000000000
03500000000
03000000000
(b) BayesNet
001 005 010
04400000000
04200000000
04000000000
03800000000
(c) SMO (SVM)
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
(d) Regression
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
02800000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-NNk-MEANS CLUSTERING
(e) Logistic
001 005 010
04250000000
04000000000
03750000000
03500000000
03250000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-MEANS CLUSTERINGk-NN
(f) IBk (119896-nearest neighbor classifier)
Figure 4 Comparison of classifiers in terms of classification performance
10 Mathematical Problems in Engineering
Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast
C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast
R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast
SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast
SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast
P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast
C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast
R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast
SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast
P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast
R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast
SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast
SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast
P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Mathematical Problems in Engineering 11
Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast
C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast
R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast
SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast
P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected
(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic
Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance
except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance
Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect
The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance
Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used
Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3
12 Mathematical Problems in Engineering
0
01
02
03At
trib
ute
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
Miss
ing
p1
Miss
ing
p2
minus04
minus03
minus02
minus01
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN
Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)
0
02
04
06
08
Attr
ibut
e
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
minus06
minus04
minus02
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATION
k-NNk-MEANS CLUSTERING
Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)
missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm
Method of imputation
0490
0480
0470
0460
0450
0440
0430
0420
0410
0400
0390
0380
0370
0360
0350
0340
0330
0320
0310
0300
005 010 015 020 025 030 035 040 045 050
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETIONk-NNk-MEANS CLUSTERING
Figure 7 RMSE by ratio of missing data
Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms
Data characteristic 119861 Data characteristic 119861
(constant) 060lowastlowast M imputation dum1 012lowastlowast
R missing 083lowastlowast M imputation dum2 minus001lowast
SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast
N attributes minus008lowastlowast M imputation dum7 minus001lowast
C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast
N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005
6 Conclusion
So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based
Mathematical Problems in Engineering 13
on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based
In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not
A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply
The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications
Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values
This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance
In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)
References
[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013
[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013
[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003
[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011
[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013
[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013
[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013
[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013
14 Mathematical Problems in Engineering
[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013
[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013
[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008
[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978
[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998
[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013
[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013
[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005
[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012
[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006
[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 5
Observed valuesMissing values
Observed valuesMissing values
Observed valuesMissing values
Univariate pattern Monotone pattern Arbitrary pattern
All missing values arein the last feature
n2n
3n
All missing values arein the last feature
and last-1 and last-2
Missing values are in random feature and record
Figure 2 Missing data patterns
Table 3 Imputation methods
Imputation methods Description
Listwise deletion Perhaps the most basic traditional technique for dealing with missing data Cases with missingvalues are discarded restricting the analyses to cases for which complete data are available
Mean imputation Involves replacing missing data with the overall mean for the observed dataGroup meanimputation
A missing value is replaced by the mean of a subset of the data based on other observed variable(s)in the data
Predictive meanimputation
Also called regression imputation Predictive mean imputation involves imputing a missing valueusing an ordinary least-squares regression method to estimate missing data
Hot-deck Most similar records are imputed to missing values119896-NN The attribute value of 119896 is imputed to the most similar instance from nonmissing data119896-means clustering 119896 numbers of sets are created that are homogeneous on the inside and heterogeneous on the outside
Table 4 Classification algorithms
Algorithms Description
C45Estimates the known data using learning rules C45 gradually expands the conditions of thealgorithm splitting the upper node into subnodes using a divide-and-conquer method until it comesto the end node
SVM Classifies the unknown class by finding the optimal hyperplane with the maximum margin thatreduces the estimation error
Bayesian network A probability network with a high posterior probability given the instances Such a network canprovide insight into probabilistic dependencies among the variables in the training dataset
Logistic classifierTakes the functional form of logistic CDF (cumulative distribution function) This function relatesthe probability of some event to attribute variables through regression coefficients and alpha andbeta parameters which are estimated from training data [13]
119896-nearest neighborclassifier
Simple instance-based learner that uses the class of the nearest 119896 training instances for the class ofthe test instances
Regression The class is binarized and one regression model is built for each class value [14]
6 Mathematical Problems in Engineering
Table 5 Datasets used in the experiments
Dataset Number of cases Features Decision attributesIris 150 Numeric (4) Categorical (3)Wine 178 Numeric (13) Categorical (3)Glass 214 Numeric (9) Categorical (7)Liver disorder 345 Numeric (6) Categorical (2)Ionosphere 351 Numeric (34) Categorical (2)Statlog Shuttle 57999 Numeric (7) Categorical (7)
the list of missing data characteristics three datasets withthree different missing data ratios (5 10 and 15) andthree sets representing each of the missing data patterns(univariate monotone and arbitrary) were created for atotal of nine variations for each dataset In total 54 datasetswere imputed for each imputation method as 6 datasetswere available We repeated the experiment for each dataset1000 times in order to minimize errors and bias Thus5400 datasets were imputed in total for our experimentAll imputation methods were implemented using packageswritten in Java In order to measure the performance of eachimputation method we applied imputed datasets to the sixclassification algorithms listed in Table 4
There are various indicators to measure performancesuch as accuracy relative accuracy MAE (mean absoluteerror) and RMSE (root mean square error) However RMSEis one of the most representative and widely used per-formance indicators in the imputation research Thereforewe also adopted RMSE as the performance indicator inthis study The performance of the selected classificationalgorithms was evaluated using SPSS 170
RMSE measures the difference between predicted andobserved values The term ldquorelative prediction accuracyrdquorefers to the relative ratio of accuracy which is equivalentto 1 when there are no missing data [10] The no-missing-data condition was used as a baseline of performance As thenext step we generated a missing dataset from the originalno-missing-dataset and then applied an imputation methodto replace the null data Then a classification algorithm wasconducted to estimate the results of the imputed datasetWithall combinations of imputation methods and classificationalgorithms a multiple regression analysis was conductedusing the following equation to understand the input factorsthe characteristics of missing data and those of the datasetsin order to determine how the selected classification algo-rithms affected performance
119910119901 = sum
forall119895isinM120573119901119895119909119895 + sum
forall119896isinD120594119901119896119911119896 + 120576119901 (9)
In this equation 119909 is the value of the characteristics ofthe missing data (M) 119911 is the value of each datasetrsquos char-acteristics in the set of dataset (D) and 119910 is a performanceparameter Note that M = missing data ratio patterns ofmissing data horizontal scatteredness vertical scatterednessmissing data spread and D = number of cases numberof attributes degree of class imbalance In addition 119901 =1 indicates relative prediction accuracy 119901 = 2 represents
RMSE and 119901 = 3 means elapsed time We performed theexperiment using the Weka library source software (release36) to determine the reliability of the implementation ofthe algorithms [17] We did not use the Weka GUI toolbut developed a Weka library-based performance evaluationprogram in order to conduct the automatized experimentrepeatedly
5 Results
In total 32400 datasets (3 missing ratios times 3 imputationpatterns times 6 imputation methods times 100 trials) were imputedfor each of the 6 classifiers Thus in total we tested 226800datasets (32400 imputed dataset times 7 classifier methods) Theresults were divided by those for each dataset classificationalgorithm and imputation method for comparison in termsof performance
51 Datasets Figure 3 shows the performance of each impu-tation method for the six different datasets On the 119909-axisthree missing ratios represent the characteristics of missingdata and on the 119910-axis performance is indicated usingthe RMSE All results of three different variations of themissing data patterns and tested classification algorithmswere merged for each imputation method
For Iris data (Figure 3(a)) the mean imputation methodyielded the worst results and the group mean imputationmethod the best results
For Glass Identification data (Figure 3(b)) hot-deckimputation was the least effective method and predictivemean imputation was the best
For Liver Disorder data (Figure 3(c)) 119896-NN was the leasteffective and once again the predictive mean imputationmethod yielded the best results
For Ionosphere data (Figure 3(d)) hot-deckwas theworstand 119896-NN the best
For Wine data (Figure 3(e)) hot-deck was once again theleast effective method and predictive mean imputation thebest
For Statlog data (Figure 3(f)) unlike the other datasetsthe results varied based on the missing data ratio Howeverpredictive mean imputation was still the best method overalland hot-deck the worst
Figure 3 illustrates that the predictive mean imputa-tion method yielded the best results overall and hot-deckimputation the worst However no imputation method wasgenerally superior in all cases with any given dataset Forexample the 119896-NN method yielded the best performancefor the Ionosphere dataset but for the Liver Disordersdataset its performance was lowest In another example thegroup mean imputation method performed best for the Irisand Wine datasets but its performance was only averagefor other datasets Therefore the results were inconsistentand determining the best imputation method is impossibleThus the imputation method cannot be used as an accuratepredictor of performance Rather the performance must beinfluenced by other factors such as the interaction betweenthe characteristics of the dataset in terms of missing data andthe chosen imputation method
Mathematical Problems in Engineering 7
10000000000
08000000000
06000000000
04000000000
02000000000
00000000000
001 005 010
(a) Iris
001 005 010
06000000000
05000000000
04000000000
03000000000
02000000000
01000000000
00000000000
(b) Glass Identification
001 005 010
6000000000000
4000000000000
2000000000000
00000000000
(c) Liver Disorders
001 005 010
06000000000
05000000000
04000000000
03000000000
02000000000
01000000000
00000000000
(d) Ionosphere
1250000000000000
1000000000000000
750000000000000
500000000000000
250000000000000
00000000000
k-NN
001 005 010
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
k-MEANS CLUSTERING
LISTWISE DELETION
imputation
(e) Wine
60000000000000
50000000000000
40000000000000
30000000000000
20000000000000
10000000000000
00000000000
001 005 010
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-NNk-MEANS CLUSTERING
(f) Statlog Shuttle
Figure 3 Comparison of performances of imputation methods for each dataset
8 Mathematical Problems in Engineering
Table 6 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) mean imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus075lowastlowast minus178lowastlowast minus072lowastlowast 115lowastlowast 007N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus032 minus048lowastlowast
C imbalance 117lowastlowast 239lowastlowast 264lowastlowast 525lowastlowast 163lowastlowast 198lowastlowast
R missing 051lowast 078lowastlowast 040 080lowastlowast 076lowastlowast 068lowastlowast
SE HS 249lowastlowast 285lowastlowast 186lowastlowast 277lowastlowast 335lowastlowast 245lowastlowast
SE VS minus009 minus013 minus006 minus013 minus016 minus010Spread minus382lowastlowast minus430lowastlowast minus261lowastlowast minus436lowastlowast minus452lowastlowast minus363lowastlowast
P missing dum1 minus049 minus038 minus038 minus037 minus045 minus038P missing dum2 minus002 014 002 011 001 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 7 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) group mean imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus068lowastlowast minus072lowastlowast minus179lowastlowast minus068lowastlowast 115lowastlowast 010N cases minus082lowastlowast minus050lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 115lowastlowast 228lowastlowast 260lowastlowast 517lowastlowast 156lowastlowast 197lowastlowast
R missing 050lowastlowast 085lowastlowast 043 084lowastlowast 095lowastlowast 066lowastlowast
SE HS 230lowastlowast 268lowastlowast 178lowastlowast 273lowastlowast 300lowastlowast 248lowastlowast
SE VS minus008 minus012 minus006 minus013 minus013 minus010Spread minus296lowastlowast minus439lowastlowast minus264lowastlowast minus443lowastlowast minus476lowastlowast minus382lowastlowast
P missing dum1 minus043 minus032 minus034 minus035 minus035 minus041P missing dum2 002 024 004 016 021 013Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
52 Classification Algorithm Figure 4 shows the perfor-mance of the classification algorithms by imputation methodand ratio of missing data As shown in the figure theperformance of each imputation method was similar and didnot vary depending on the ratio of missing data except forlistwise deletion For listwise deletion as the ratio of missingto complete data increased the performance deterioratedIn the listwise deletion method all records are deletedthat contain missing data therefore the number of deletedrecords increases as the ratio of missing data increases Thelow performance of this method can be explained based onthis fact
The differences in performance between imputationmethods were minor The figure displays these differencesby classification algorithm Using the Bayesian network andlogistic classifier methods significantly improved perfor-mance compared to other classifiers However the relation-ships among missing data imputation methods and classi-fiers remained to be explainedThus a regression analysis wasconducted
In Figure 4 the results suggest the following rules
(i) IF themissing rate increases AND IBK is used THENuse the GROUP MEAN IMPUTATION method
(ii) IF the missing rate increases AND the logistic clas-sifier method is used THEN use the HOT DECKmethod
(iii) IF the missing rate increases AND the regressionmethod is used THEN use the GROUP MEAN IM-PUTATION method
(iv) IF the missing rate increases AND the BayesNetmethod is used THEN use the GROUP MEAN IM-PUTATION method
(v) IF the missing rate increases AND the treesJ48method is used THEN use the 119896-NN method
53 Regression The results of the regression analysis arepresented in Tables 6 7 8 9 10 and 11 The analysis wasconducted using 900 datasets (3 missing ratios times 3 missing
Mathematical Problems in Engineering 9
05000000000
04500000000
04000000000
03500000000
03000000000
001 005 010
(a) Decision tree (J48)
001 005 010
04000000000
03500000000
03000000000
(b) BayesNet
001 005 010
04400000000
04200000000
04000000000
03800000000
(c) SMO (SVM)
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
(d) Regression
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
02800000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-NNk-MEANS CLUSTERING
(e) Logistic
001 005 010
04250000000
04000000000
03750000000
03500000000
03250000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-MEANS CLUSTERINGk-NN
(f) IBk (119896-nearest neighbor classifier)
Figure 4 Comparison of classifiers in terms of classification performance
10 Mathematical Problems in Engineering
Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast
C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast
R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast
SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast
SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast
P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast
C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast
R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast
SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast
P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast
R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast
SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast
SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast
P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Mathematical Problems in Engineering 11
Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast
C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast
R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast
SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast
P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected
(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic
Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance
except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance
Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect
The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance
Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used
Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3
12 Mathematical Problems in Engineering
0
01
02
03At
trib
ute
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
Miss
ing
p1
Miss
ing
p2
minus04
minus03
minus02
minus01
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN
Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)
0
02
04
06
08
Attr
ibut
e
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
minus06
minus04
minus02
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATION
k-NNk-MEANS CLUSTERING
Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)
missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm
Method of imputation
0490
0480
0470
0460
0450
0440
0430
0420
0410
0400
0390
0380
0370
0360
0350
0340
0330
0320
0310
0300
005 010 015 020 025 030 035 040 045 050
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETIONk-NNk-MEANS CLUSTERING
Figure 7 RMSE by ratio of missing data
Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms
Data characteristic 119861 Data characteristic 119861
(constant) 060lowastlowast M imputation dum1 012lowastlowast
R missing 083lowastlowast M imputation dum2 minus001lowast
SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast
N attributes minus008lowastlowast M imputation dum7 minus001lowast
C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast
N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005
6 Conclusion
So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based
Mathematical Problems in Engineering 13
on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based
In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not
A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply
The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications
Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values
This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance
In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)
References
[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013
[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013
[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003
[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011
[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013
[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013
[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013
[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013
14 Mathematical Problems in Engineering
[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013
[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013
[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008
[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978
[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998
[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013
[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013
[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005
[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012
[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006
[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
6 Mathematical Problems in Engineering
Table 5 Datasets used in the experiments
Dataset Number of cases Features Decision attributesIris 150 Numeric (4) Categorical (3)Wine 178 Numeric (13) Categorical (3)Glass 214 Numeric (9) Categorical (7)Liver disorder 345 Numeric (6) Categorical (2)Ionosphere 351 Numeric (34) Categorical (2)Statlog Shuttle 57999 Numeric (7) Categorical (7)
the list of missing data characteristics three datasets withthree different missing data ratios (5 10 and 15) andthree sets representing each of the missing data patterns(univariate monotone and arbitrary) were created for atotal of nine variations for each dataset In total 54 datasetswere imputed for each imputation method as 6 datasetswere available We repeated the experiment for each dataset1000 times in order to minimize errors and bias Thus5400 datasets were imputed in total for our experimentAll imputation methods were implemented using packageswritten in Java In order to measure the performance of eachimputation method we applied imputed datasets to the sixclassification algorithms listed in Table 4
There are various indicators to measure performancesuch as accuracy relative accuracy MAE (mean absoluteerror) and RMSE (root mean square error) However RMSEis one of the most representative and widely used per-formance indicators in the imputation research Thereforewe also adopted RMSE as the performance indicator inthis study The performance of the selected classificationalgorithms was evaluated using SPSS 170
RMSE measures the difference between predicted andobserved values The term ldquorelative prediction accuracyrdquorefers to the relative ratio of accuracy which is equivalentto 1 when there are no missing data [10] The no-missing-data condition was used as a baseline of performance As thenext step we generated a missing dataset from the originalno-missing-dataset and then applied an imputation methodto replace the null data Then a classification algorithm wasconducted to estimate the results of the imputed datasetWithall combinations of imputation methods and classificationalgorithms a multiple regression analysis was conductedusing the following equation to understand the input factorsthe characteristics of missing data and those of the datasetsin order to determine how the selected classification algo-rithms affected performance
119910119901 = sum
forall119895isinM120573119901119895119909119895 + sum
forall119896isinD120594119901119896119911119896 + 120576119901 (9)
In this equation 119909 is the value of the characteristics ofthe missing data (M) 119911 is the value of each datasetrsquos char-acteristics in the set of dataset (D) and 119910 is a performanceparameter Note that M = missing data ratio patterns ofmissing data horizontal scatteredness vertical scatterednessmissing data spread and D = number of cases numberof attributes degree of class imbalance In addition 119901 =1 indicates relative prediction accuracy 119901 = 2 represents
RMSE and 119901 = 3 means elapsed time We performed theexperiment using the Weka library source software (release36) to determine the reliability of the implementation ofthe algorithms [17] We did not use the Weka GUI toolbut developed a Weka library-based performance evaluationprogram in order to conduct the automatized experimentrepeatedly
5 Results
In total 32400 datasets (3 missing ratios times 3 imputationpatterns times 6 imputation methods times 100 trials) were imputedfor each of the 6 classifiers Thus in total we tested 226800datasets (32400 imputed dataset times 7 classifier methods) Theresults were divided by those for each dataset classificationalgorithm and imputation method for comparison in termsof performance
51 Datasets Figure 3 shows the performance of each impu-tation method for the six different datasets On the 119909-axisthree missing ratios represent the characteristics of missingdata and on the 119910-axis performance is indicated usingthe RMSE All results of three different variations of themissing data patterns and tested classification algorithmswere merged for each imputation method
For Iris data (Figure 3(a)) the mean imputation methodyielded the worst results and the group mean imputationmethod the best results
For Glass Identification data (Figure 3(b)) hot-deckimputation was the least effective method and predictivemean imputation was the best
For Liver Disorder data (Figure 3(c)) 119896-NN was the leasteffective and once again the predictive mean imputationmethod yielded the best results
For Ionosphere data (Figure 3(d)) hot-deckwas theworstand 119896-NN the best
For Wine data (Figure 3(e)) hot-deck was once again theleast effective method and predictive mean imputation thebest
For Statlog data (Figure 3(f)) unlike the other datasetsthe results varied based on the missing data ratio Howeverpredictive mean imputation was still the best method overalland hot-deck the worst
Figure 3 illustrates that the predictive mean imputa-tion method yielded the best results overall and hot-deckimputation the worst However no imputation method wasgenerally superior in all cases with any given dataset Forexample the 119896-NN method yielded the best performancefor the Ionosphere dataset but for the Liver Disordersdataset its performance was lowest In another example thegroup mean imputation method performed best for the Irisand Wine datasets but its performance was only averagefor other datasets Therefore the results were inconsistentand determining the best imputation method is impossibleThus the imputation method cannot be used as an accuratepredictor of performance Rather the performance must beinfluenced by other factors such as the interaction betweenthe characteristics of the dataset in terms of missing data andthe chosen imputation method
Mathematical Problems in Engineering 7
10000000000
08000000000
06000000000
04000000000
02000000000
00000000000
001 005 010
(a) Iris
001 005 010
06000000000
05000000000
04000000000
03000000000
02000000000
01000000000
00000000000
(b) Glass Identification
001 005 010
6000000000000
4000000000000
2000000000000
00000000000
(c) Liver Disorders
001 005 010
06000000000
05000000000
04000000000
03000000000
02000000000
01000000000
00000000000
(d) Ionosphere
1250000000000000
1000000000000000
750000000000000
500000000000000
250000000000000
00000000000
k-NN
001 005 010
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
k-MEANS CLUSTERING
LISTWISE DELETION
imputation
(e) Wine
60000000000000
50000000000000
40000000000000
30000000000000
20000000000000
10000000000000
00000000000
001 005 010
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-NNk-MEANS CLUSTERING
(f) Statlog Shuttle
Figure 3 Comparison of performances of imputation methods for each dataset
8 Mathematical Problems in Engineering
Table 6 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) mean imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus075lowastlowast minus178lowastlowast minus072lowastlowast 115lowastlowast 007N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus032 minus048lowastlowast
C imbalance 117lowastlowast 239lowastlowast 264lowastlowast 525lowastlowast 163lowastlowast 198lowastlowast
R missing 051lowast 078lowastlowast 040 080lowastlowast 076lowastlowast 068lowastlowast
SE HS 249lowastlowast 285lowastlowast 186lowastlowast 277lowastlowast 335lowastlowast 245lowastlowast
SE VS minus009 minus013 minus006 minus013 minus016 minus010Spread minus382lowastlowast minus430lowastlowast minus261lowastlowast minus436lowastlowast minus452lowastlowast minus363lowastlowast
P missing dum1 minus049 minus038 minus038 minus037 minus045 minus038P missing dum2 minus002 014 002 011 001 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 7 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) group mean imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus068lowastlowast minus072lowastlowast minus179lowastlowast minus068lowastlowast 115lowastlowast 010N cases minus082lowastlowast minus050lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 115lowastlowast 228lowastlowast 260lowastlowast 517lowastlowast 156lowastlowast 197lowastlowast
R missing 050lowastlowast 085lowastlowast 043 084lowastlowast 095lowastlowast 066lowastlowast
SE HS 230lowastlowast 268lowastlowast 178lowastlowast 273lowastlowast 300lowastlowast 248lowastlowast
SE VS minus008 minus012 minus006 minus013 minus013 minus010Spread minus296lowastlowast minus439lowastlowast minus264lowastlowast minus443lowastlowast minus476lowastlowast minus382lowastlowast
P missing dum1 minus043 minus032 minus034 minus035 minus035 minus041P missing dum2 002 024 004 016 021 013Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
52 Classification Algorithm Figure 4 shows the perfor-mance of the classification algorithms by imputation methodand ratio of missing data As shown in the figure theperformance of each imputation method was similar and didnot vary depending on the ratio of missing data except forlistwise deletion For listwise deletion as the ratio of missingto complete data increased the performance deterioratedIn the listwise deletion method all records are deletedthat contain missing data therefore the number of deletedrecords increases as the ratio of missing data increases Thelow performance of this method can be explained based onthis fact
The differences in performance between imputationmethods were minor The figure displays these differencesby classification algorithm Using the Bayesian network andlogistic classifier methods significantly improved perfor-mance compared to other classifiers However the relation-ships among missing data imputation methods and classi-fiers remained to be explainedThus a regression analysis wasconducted
In Figure 4 the results suggest the following rules
(i) IF themissing rate increases AND IBK is used THENuse the GROUP MEAN IMPUTATION method
(ii) IF the missing rate increases AND the logistic clas-sifier method is used THEN use the HOT DECKmethod
(iii) IF the missing rate increases AND the regressionmethod is used THEN use the GROUP MEAN IM-PUTATION method
(iv) IF the missing rate increases AND the BayesNetmethod is used THEN use the GROUP MEAN IM-PUTATION method
(v) IF the missing rate increases AND the treesJ48method is used THEN use the 119896-NN method
53 Regression The results of the regression analysis arepresented in Tables 6 7 8 9 10 and 11 The analysis wasconducted using 900 datasets (3 missing ratios times 3 missing
Mathematical Problems in Engineering 9
05000000000
04500000000
04000000000
03500000000
03000000000
001 005 010
(a) Decision tree (J48)
001 005 010
04000000000
03500000000
03000000000
(b) BayesNet
001 005 010
04400000000
04200000000
04000000000
03800000000
(c) SMO (SVM)
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
(d) Regression
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
02800000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-NNk-MEANS CLUSTERING
(e) Logistic
001 005 010
04250000000
04000000000
03750000000
03500000000
03250000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-MEANS CLUSTERINGk-NN
(f) IBk (119896-nearest neighbor classifier)
Figure 4 Comparison of classifiers in terms of classification performance
10 Mathematical Problems in Engineering
Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast
C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast
R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast
SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast
SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast
P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast
C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast
R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast
SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast
P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast
R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast
SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast
SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast
P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Mathematical Problems in Engineering 11
Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast
C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast
R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast
SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast
P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected
(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic
Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance
except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance
Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect
The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance
Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used
Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3
12 Mathematical Problems in Engineering
0
01
02
03At
trib
ute
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
Miss
ing
p1
Miss
ing
p2
minus04
minus03
minus02
minus01
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN
Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)
0
02
04
06
08
Attr
ibut
e
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
minus06
minus04
minus02
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATION
k-NNk-MEANS CLUSTERING
Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)
missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm
Method of imputation
0490
0480
0470
0460
0450
0440
0430
0420
0410
0400
0390
0380
0370
0360
0350
0340
0330
0320
0310
0300
005 010 015 020 025 030 035 040 045 050
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETIONk-NNk-MEANS CLUSTERING
Figure 7 RMSE by ratio of missing data
Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms
Data characteristic 119861 Data characteristic 119861
(constant) 060lowastlowast M imputation dum1 012lowastlowast
R missing 083lowastlowast M imputation dum2 minus001lowast
SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast
N attributes minus008lowastlowast M imputation dum7 minus001lowast
C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast
N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005
6 Conclusion
So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based
Mathematical Problems in Engineering 13
on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based
In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not
A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply
The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications
Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values
This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance
In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)
References
[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013
[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013
[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003
[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011
[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013
[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013
[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013
[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013
14 Mathematical Problems in Engineering
[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013
[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013
[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008
[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978
[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998
[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013
[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013
[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005
[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012
[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006
[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 7
10000000000
08000000000
06000000000
04000000000
02000000000
00000000000
001 005 010
(a) Iris
001 005 010
06000000000
05000000000
04000000000
03000000000
02000000000
01000000000
00000000000
(b) Glass Identification
001 005 010
6000000000000
4000000000000
2000000000000
00000000000
(c) Liver Disorders
001 005 010
06000000000
05000000000
04000000000
03000000000
02000000000
01000000000
00000000000
(d) Ionosphere
1250000000000000
1000000000000000
750000000000000
500000000000000
250000000000000
00000000000
k-NN
001 005 010
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
k-MEANS CLUSTERING
LISTWISE DELETION
imputation
(e) Wine
60000000000000
50000000000000
40000000000000
30000000000000
20000000000000
10000000000000
00000000000
001 005 010
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-NNk-MEANS CLUSTERING
(f) Statlog Shuttle
Figure 3 Comparison of performances of imputation methods for each dataset
8 Mathematical Problems in Engineering
Table 6 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) mean imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus075lowastlowast minus178lowastlowast minus072lowastlowast 115lowastlowast 007N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus032 minus048lowastlowast
C imbalance 117lowastlowast 239lowastlowast 264lowastlowast 525lowastlowast 163lowastlowast 198lowastlowast
R missing 051lowast 078lowastlowast 040 080lowastlowast 076lowastlowast 068lowastlowast
SE HS 249lowastlowast 285lowastlowast 186lowastlowast 277lowastlowast 335lowastlowast 245lowastlowast
SE VS minus009 minus013 minus006 minus013 minus016 minus010Spread minus382lowastlowast minus430lowastlowast minus261lowastlowast minus436lowastlowast minus452lowastlowast minus363lowastlowast
P missing dum1 minus049 minus038 minus038 minus037 minus045 minus038P missing dum2 minus002 014 002 011 001 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 7 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) group mean imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus068lowastlowast minus072lowastlowast minus179lowastlowast minus068lowastlowast 115lowastlowast 010N cases minus082lowastlowast minus050lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 115lowastlowast 228lowastlowast 260lowastlowast 517lowastlowast 156lowastlowast 197lowastlowast
R missing 050lowastlowast 085lowastlowast 043 084lowastlowast 095lowastlowast 066lowastlowast
SE HS 230lowastlowast 268lowastlowast 178lowastlowast 273lowastlowast 300lowastlowast 248lowastlowast
SE VS minus008 minus012 minus006 minus013 minus013 minus010Spread minus296lowastlowast minus439lowastlowast minus264lowastlowast minus443lowastlowast minus476lowastlowast minus382lowastlowast
P missing dum1 minus043 minus032 minus034 minus035 minus035 minus041P missing dum2 002 024 004 016 021 013Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
52 Classification Algorithm Figure 4 shows the perfor-mance of the classification algorithms by imputation methodand ratio of missing data As shown in the figure theperformance of each imputation method was similar and didnot vary depending on the ratio of missing data except forlistwise deletion For listwise deletion as the ratio of missingto complete data increased the performance deterioratedIn the listwise deletion method all records are deletedthat contain missing data therefore the number of deletedrecords increases as the ratio of missing data increases Thelow performance of this method can be explained based onthis fact
The differences in performance between imputationmethods were minor The figure displays these differencesby classification algorithm Using the Bayesian network andlogistic classifier methods significantly improved perfor-mance compared to other classifiers However the relation-ships among missing data imputation methods and classi-fiers remained to be explainedThus a regression analysis wasconducted
In Figure 4 the results suggest the following rules
(i) IF themissing rate increases AND IBK is used THENuse the GROUP MEAN IMPUTATION method
(ii) IF the missing rate increases AND the logistic clas-sifier method is used THEN use the HOT DECKmethod
(iii) IF the missing rate increases AND the regressionmethod is used THEN use the GROUP MEAN IM-PUTATION method
(iv) IF the missing rate increases AND the BayesNetmethod is used THEN use the GROUP MEAN IM-PUTATION method
(v) IF the missing rate increases AND the treesJ48method is used THEN use the 119896-NN method
53 Regression The results of the regression analysis arepresented in Tables 6 7 8 9 10 and 11 The analysis wasconducted using 900 datasets (3 missing ratios times 3 missing
Mathematical Problems in Engineering 9
05000000000
04500000000
04000000000
03500000000
03000000000
001 005 010
(a) Decision tree (J48)
001 005 010
04000000000
03500000000
03000000000
(b) BayesNet
001 005 010
04400000000
04200000000
04000000000
03800000000
(c) SMO (SVM)
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
(d) Regression
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
02800000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-NNk-MEANS CLUSTERING
(e) Logistic
001 005 010
04250000000
04000000000
03750000000
03500000000
03250000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-MEANS CLUSTERINGk-NN
(f) IBk (119896-nearest neighbor classifier)
Figure 4 Comparison of classifiers in terms of classification performance
10 Mathematical Problems in Engineering
Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast
C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast
R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast
SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast
SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast
P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast
C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast
R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast
SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast
P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast
R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast
SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast
SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast
P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Mathematical Problems in Engineering 11
Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast
C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast
R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast
SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast
P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected
(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic
Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance
except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance
Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect
The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance
Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used
Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3
12 Mathematical Problems in Engineering
0
01
02
03At
trib
ute
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
Miss
ing
p1
Miss
ing
p2
minus04
minus03
minus02
minus01
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN
Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)
0
02
04
06
08
Attr
ibut
e
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
minus06
minus04
minus02
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATION
k-NNk-MEANS CLUSTERING
Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)
missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm
Method of imputation
0490
0480
0470
0460
0450
0440
0430
0420
0410
0400
0390
0380
0370
0360
0350
0340
0330
0320
0310
0300
005 010 015 020 025 030 035 040 045 050
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETIONk-NNk-MEANS CLUSTERING
Figure 7 RMSE by ratio of missing data
Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms
Data characteristic 119861 Data characteristic 119861
(constant) 060lowastlowast M imputation dum1 012lowastlowast
R missing 083lowastlowast M imputation dum2 minus001lowast
SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast
N attributes minus008lowastlowast M imputation dum7 minus001lowast
C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast
N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005
6 Conclusion
So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based
Mathematical Problems in Engineering 13
on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based
In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not
A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply
The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications
Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values
This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance
In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)
References
[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013
[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013
[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003
[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011
[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013
[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013
[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013
[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013
14 Mathematical Problems in Engineering
[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013
[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013
[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008
[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978
[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998
[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013
[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013
[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005
[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012
[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006
[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
8 Mathematical Problems in Engineering
Table 6 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) mean imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus075lowastlowast minus178lowastlowast minus072lowastlowast 115lowastlowast 007N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus032 minus048lowastlowast
C imbalance 117lowastlowast 239lowastlowast 264lowastlowast 525lowastlowast 163lowastlowast 198lowastlowast
R missing 051lowast 078lowastlowast 040 080lowastlowast 076lowastlowast 068lowastlowast
SE HS 249lowastlowast 285lowastlowast 186lowastlowast 277lowastlowast 335lowastlowast 245lowastlowast
SE VS minus009 minus013 minus006 minus013 minus016 minus010Spread minus382lowastlowast minus430lowastlowast minus261lowastlowast minus436lowastlowast minus452lowastlowast minus363lowastlowast
P missing dum1 minus049 minus038 minus038 minus037 minus045 minus038P missing dum2 minus002 014 002 011 001 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 7 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) group mean imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus068lowastlowast minus072lowastlowast minus179lowastlowast minus068lowastlowast 115lowastlowast 010N cases minus082lowastlowast minus050lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 115lowastlowast 228lowastlowast 260lowastlowast 517lowastlowast 156lowastlowast 197lowastlowast
R missing 050lowastlowast 085lowastlowast 043 084lowastlowast 095lowastlowast 066lowastlowast
SE HS 230lowastlowast 268lowastlowast 178lowastlowast 273lowastlowast 300lowastlowast 248lowastlowast
SE VS minus008 minus012 minus006 minus013 minus013 minus010Spread minus296lowastlowast minus439lowastlowast minus264lowastlowast minus443lowastlowast minus476lowastlowast minus382lowastlowast
P missing dum1 minus043 minus032 minus034 minus035 minus035 minus041P missing dum2 002 024 004 016 021 013Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
52 Classification Algorithm Figure 4 shows the perfor-mance of the classification algorithms by imputation methodand ratio of missing data As shown in the figure theperformance of each imputation method was similar and didnot vary depending on the ratio of missing data except forlistwise deletion For listwise deletion as the ratio of missingto complete data increased the performance deterioratedIn the listwise deletion method all records are deletedthat contain missing data therefore the number of deletedrecords increases as the ratio of missing data increases Thelow performance of this method can be explained based onthis fact
The differences in performance between imputationmethods were minor The figure displays these differencesby classification algorithm Using the Bayesian network andlogistic classifier methods significantly improved perfor-mance compared to other classifiers However the relation-ships among missing data imputation methods and classi-fiers remained to be explainedThus a regression analysis wasconducted
In Figure 4 the results suggest the following rules
(i) IF themissing rate increases AND IBK is used THENuse the GROUP MEAN IMPUTATION method
(ii) IF the missing rate increases AND the logistic clas-sifier method is used THEN use the HOT DECKmethod
(iii) IF the missing rate increases AND the regressionmethod is used THEN use the GROUP MEAN IM-PUTATION method
(iv) IF the missing rate increases AND the BayesNetmethod is used THEN use the GROUP MEAN IM-PUTATION method
(v) IF the missing rate increases AND the treesJ48method is used THEN use the 119896-NN method
53 Regression The results of the regression analysis arepresented in Tables 6 7 8 9 10 and 11 The analysis wasconducted using 900 datasets (3 missing ratios times 3 missing
Mathematical Problems in Engineering 9
05000000000
04500000000
04000000000
03500000000
03000000000
001 005 010
(a) Decision tree (J48)
001 005 010
04000000000
03500000000
03000000000
(b) BayesNet
001 005 010
04400000000
04200000000
04000000000
03800000000
(c) SMO (SVM)
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
(d) Regression
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
02800000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-NNk-MEANS CLUSTERING
(e) Logistic
001 005 010
04250000000
04000000000
03750000000
03500000000
03250000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-MEANS CLUSTERINGk-NN
(f) IBk (119896-nearest neighbor classifier)
Figure 4 Comparison of classifiers in terms of classification performance
10 Mathematical Problems in Engineering
Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast
C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast
R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast
SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast
SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast
P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast
C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast
R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast
SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast
P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast
R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast
SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast
SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast
P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Mathematical Problems in Engineering 11
Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast
C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast
R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast
SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast
P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected
(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic
Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance
except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance
Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect
The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance
Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used
Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3
12 Mathematical Problems in Engineering
0
01
02
03At
trib
ute
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
Miss
ing
p1
Miss
ing
p2
minus04
minus03
minus02
minus01
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN
Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)
0
02
04
06
08
Attr
ibut
e
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
minus06
minus04
minus02
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATION
k-NNk-MEANS CLUSTERING
Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)
missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm
Method of imputation
0490
0480
0470
0460
0450
0440
0430
0420
0410
0400
0390
0380
0370
0360
0350
0340
0330
0320
0310
0300
005 010 015 020 025 030 035 040 045 050
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETIONk-NNk-MEANS CLUSTERING
Figure 7 RMSE by ratio of missing data
Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms
Data characteristic 119861 Data characteristic 119861
(constant) 060lowastlowast M imputation dum1 012lowastlowast
R missing 083lowastlowast M imputation dum2 minus001lowast
SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast
N attributes minus008lowastlowast M imputation dum7 minus001lowast
C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast
N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005
6 Conclusion
So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based
Mathematical Problems in Engineering 13
on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based
In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not
A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply
The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications
Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values
This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance
In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)
References
[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013
[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013
[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003
[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011
[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013
[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013
[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013
[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013
14 Mathematical Problems in Engineering
[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013
[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013
[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008
[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978
[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998
[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013
[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013
[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005
[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012
[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006
[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 9
05000000000
04500000000
04000000000
03500000000
03000000000
001 005 010
(a) Decision tree (J48)
001 005 010
04000000000
03500000000
03000000000
(b) BayesNet
001 005 010
04400000000
04200000000
04000000000
03800000000
(c) SMO (SVM)
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
(d) Regression
001 005 010
04200000000
04000000000
03800000000
03600000000
03400000000
03200000000
03000000000
02800000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-NNk-MEANS CLUSTERING
(e) Logistic
001 005 010
04250000000
04000000000
03750000000
03500000000
03250000000
M
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
imputation
k-MEANS CLUSTERINGk-NN
(f) IBk (119896-nearest neighbor classifier)
Figure 4 Comparison of classifiers in terms of classification performance
10 Mathematical Problems in Engineering
Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast
C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast
R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast
SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast
SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast
P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast
C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast
R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast
SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast
P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast
R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast
SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast
SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast
P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Mathematical Problems in Engineering 11
Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast
C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast
R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast
SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast
P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected
(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic
Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance
except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance
Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect
The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance
Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used
Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3
12 Mathematical Problems in Engineering
0
01
02
03At
trib
ute
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
Miss
ing
p1
Miss
ing
p2
minus04
minus03
minus02
minus01
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN
Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)
0
02
04
06
08
Attr
ibut
e
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
minus06
minus04
minus02
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATION
k-NNk-MEANS CLUSTERING
Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)
missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm
Method of imputation
0490
0480
0470
0460
0450
0440
0430
0420
0410
0400
0390
0380
0370
0360
0350
0340
0330
0320
0310
0300
005 010 015 020 025 030 035 040 045 050
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETIONk-NNk-MEANS CLUSTERING
Figure 7 RMSE by ratio of missing data
Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms
Data characteristic 119861 Data characteristic 119861
(constant) 060lowastlowast M imputation dum1 012lowastlowast
R missing 083lowastlowast M imputation dum2 minus001lowast
SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast
N attributes minus008lowastlowast M imputation dum7 minus001lowast
C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast
N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005
6 Conclusion
So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based
Mathematical Problems in Engineering 13
on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based
In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not
A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply
The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications
Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values
This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance
In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)
References
[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013
[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013
[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003
[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011
[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013
[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013
[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013
[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013
14 Mathematical Problems in Engineering
[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013
[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013
[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008
[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978
[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998
[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013
[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013
[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005
[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012
[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006
[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
10 Mathematical Problems in Engineering
Table 8 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Predictive Mean Imputation
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus076lowastlowast minus076lowastlowast minus178lowastlowast minus063lowastlowast 123lowastlowast 016N cases minus084lowastlowast minus049lowastlowast 012 minus017 minus034lowast minus047lowastlowast
C imbalance 117lowastlowast 242lowastlowast 263lowastlowast 523lowastlowast 153lowastlowast 198lowastlowast
R missing 050lowast 079lowastlowast 043 085lowastlowast 080lowastlowast 068lowastlowast
SE HS 223lowastlowast 279lowastlowast 182lowastlowast 268lowastlowast 322lowastlowast 242lowastlowast
SE VS minus008 minus013 minus006 minus013 minus015 minus009Spread minus328lowastlowast minus432lowastlowast minus262lowastlowast minus434lowastlowast minus465lowastlowast minus361lowastlowast
P missing dum1 minus042 minus035 minus034 minus028 minus044 minus036P missing dum2 008 012 004 018 007 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 9 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) Hot deck
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus073lowastlowast minus176lowastlowast minus071lowastlowast 115lowastlowast 007N cases minus081lowastlowast minus049lowastlowast 012 minus018 minus034lowast minus047lowastlowast
C imbalance 135lowastlowast 237lowastlowast 261lowastlowast 524lowastlowast 133lowastlowast 211lowastlowast
R missing 062lowastlowast 083lowastlowast 044 084lowastlowast 075lowastlowast 070lowastlowast
SE HS 225lowastlowast 275lowastlowast 183lowastlowast 271lowastlowast 313lowastlowast 254lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus010Spread minus365lowastlowast minus428lowastlowast minus265lowastlowast minus427lowastlowast minus441lowastlowast minus361lowastlowast
P missing dum1 minus035 minus037 minus034 minus033 minus048 minus038P missing dum2 012 015 004 012 minus004 009Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Table 10 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-NN
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus085lowastlowast minus079lowastlowast minus181lowastlowast minus068lowastlowast 122lowastlowast 006N cases minus083lowastlowast minus049lowastlowast 011 minus018 minus034lowast minus047lowastlowast
C imbalance 143lowastlowast 249lowastlowast 260lowastlowast 521lowastlowast 152lowastlowast 211lowastlowast
R missing 054lowast 078lowastlowast 041 085lowastlowast 075lowastlowast 071lowastlowast
SE HS 234lowastlowast 290lowastlowast 182lowastlowast 269lowastlowast 328lowastlowast 255lowastlowast
SE VS minus010 minus013 minus006 minus013 minus014 minus011Spread minus332lowastlowast minus427lowastlowast minus264lowastlowast minus431lowastlowast minus450lowastlowast minus369lowastlowast
P missing dum1 minus038 minus041 minus035 minus029 minus057 minus035P missing dum2 003 008 005 017 000 011Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
Mathematical Problems in Engineering 11
Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast
C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast
R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast
SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast
P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected
(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic
Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance
except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance
Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect
The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance
Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used
Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3
12 Mathematical Problems in Engineering
0
01
02
03At
trib
ute
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
Miss
ing
p1
Miss
ing
p2
minus04
minus03
minus02
minus01
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN
Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)
0
02
04
06
08
Attr
ibut
e
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
minus06
minus04
minus02
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATION
k-NNk-MEANS CLUSTERING
Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)
missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm
Method of imputation
0490
0480
0470
0460
0450
0440
0430
0420
0410
0400
0390
0380
0370
0360
0350
0340
0330
0320
0310
0300
005 010 015 020 025 030 035 040 045 050
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETIONk-NNk-MEANS CLUSTERING
Figure 7 RMSE by ratio of missing data
Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms
Data characteristic 119861 Data characteristic 119861
(constant) 060lowastlowast M imputation dum1 012lowastlowast
R missing 083lowastlowast M imputation dum2 minus001lowast
SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast
N attributes minus008lowastlowast M imputation dum7 minus001lowast
C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast
N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005
6 Conclusion
So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based
Mathematical Problems in Engineering 13
on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based
In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not
A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply
The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications
Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values
This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance
In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)
References
[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013
[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013
[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003
[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011
[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013
[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013
[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013
[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013
14 Mathematical Problems in Engineering
[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013
[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013
[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008
[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978
[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998
[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013
[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013
[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005
[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012
[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006
[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 11
Table 11 Factors influencing accuracy (RMSE) for each algorithm (standard beta coefficient) 119896-MEANS CLUSTERING
Data characteristic treesJ48 BayesNet SMO Regression Logistic IBkN attributes minus080lowastlowast minus078lowastlowast minus181lowastlowast minus068lowastlowast 117lowastlowast 009N cases minus079lowastlowast minus049lowastlowast 012 minus017 minus033 minus047lowastlowast
C imbalance 136lowastlowast 240lowastlowast 263lowastlowast 524lowastlowast 145lowastlowast 206lowastlowast
R missing 057lowast 079lowastlowast 041 084lowastlowast 079lowastlowast 057lowast
SE HS 236lowastlowast 289lowastlowast 183lowastlowast 271lowastlowast 315lowastlowast 264lowastlowast
SE VS minus009 minus013 minus006 minus013 minus014 minus011Spread minus362lowastlowast minus439lowastlowast minus262lowastlowast minus440lowastlowast minus474lowastlowast minus363lowastlowast
P missing dum1 minus037 minus042 minus036 minus032 minus038 minus046P missing dum2 002 013 001 014 009 004Note 1 N attributes number of attributes N cases number of cases C imbalance degree of class imbalance R missing missing data ratio SE HS horizontalscatteredness SE VS vertical scatteredness spread missing data spread and missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0)monotone (P missing dum1 = 0 P missing dum2 = 1) and arbitrary (P missing dum1 = 1 P missing dum2 = 1)Note 2 RMSE indicates error therefore lower values are betterNote 3 lowast119875 lt 005 lowastlowast119875 lt 001
patternstimes 100 trials) Each dataset was generated randomly tomeet the preconditionsWe conducted the performance eval-uation by randomly assigning each dataset to testtrainingsets at a 3 7 ratio The regression analysis included thecharacteristics of the datasets and the patterns of the missingvalues as independent variables Control variables such asthe type of classifier and imputation method were alsoincludedThe effects of the various characteristics of the dataand missing values on classifier performance (RMSE) wereanalyzed Three types of missing ratios were treated as twodummy variables (P missing dum1 2 00 01 10) Tables 6ndash11illustrate the results of the regression analysis of the variousimputation methods The results suggest the following rulesregardless of which imputation method is selected
(i) IF N attributes increases THEN use SMO(ii) IF N cases increases THEN use treesJ48(iii) IF C imbalance increases THEN use treesJ48(iv) IF R missing increases THEN use SMO(v) IF SE HS increases THEN use SMO(vi) IF Spread increases THEN use Logistic
Figure 5 displays the coefficient pattern of the decisiontree classifier for each imputation method Dataset char-acteristics are illustrated on the 119909-axis and the regressioncoefficients for each imputationmethod on the 119910-axis For allimputation methods except listwise deletion the classifiersrsquocoefficient patterns seemed similar However significantdifferences were found in the coefficient patterns using otheralgorithms For example for all imputationmethods a higherbeta coefficient of the number of attributes (N attributes)was observed for the logistics algorithm than for any otheralgorithm Thus the logistics algorithm exhibited the lowestperformance (highest RMSE) in terms of the number ofattributes In terms of the number of cases (N cases) SMOperformed the worst When the data were imbalanced theregressionmethod was the least effective one For themissingratio the regression method showed the lowest performance
except in comparison to listwise deletion and mean impu-tation For the horizontal scattered standard error (SE HS)SMO had the lowest performance For missing data spreadthe logistic classifier method had the lowest performance
Moreover for each single factor (eg spread) even ifthe results for two algorithms were the same their perfor-mance differed depending on which imputation method wasapplied For example for the decision tree (J48) algorithmthe mean imputation method had the most negative effect onclassification performance for horizontal scattered standarderror (SE HS) and spread while the listwise deletion andgroupmean imputationmethods had the least negative effect
The similar coefficient patterns shown in Figure 5 indicatethat the differences in impact of each imputation method onperformance were insignificant In order to determine theimpact of the classifiers more tests were needed Figure 6illustrates the coefficient patterns when the ratio of missingto complete data is 90 Under these circumstances thedistinction between imputationmethods according to datasetcharacteristics is significant For example very high or verylow beta coefficients may be observed for most datasetcharacteristics except the number of instances and classimbalance
Figure 7 shows the RMSE based on the ratio of missingdata for each imputation method As the ratio increases theperformance drops (RMSE increases) this is not an unex-pected result However as the ratio of missing to completedata increases the differences in performance between impu-tation methods become significant These results imply thatthe characteristics of the dataset andmissing values affect theperformance of the classifier algorithms Furthermore thepatterns of these effects differ depending on the imputationmethods and classifiers used
Lastly we estimate the accuracy (RMSE) of each methodby conducting a multiple regression analysis As shownin Table 12 the results confirmed a significant associationbetween the characteristics of the missing data and themethod of imputation with the performance of each clas-sification in terms of RMSE In total 226800 datasets (3
12 Mathematical Problems in Engineering
0
01
02
03At
trib
ute
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
Miss
ing
p1
Miss
ing
p2
minus04
minus03
minus02
minus01
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN
Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)
0
02
04
06
08
Attr
ibut
e
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
minus06
minus04
minus02
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATION
k-NNk-MEANS CLUSTERING
Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)
missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm
Method of imputation
0490
0480
0470
0460
0450
0440
0430
0420
0410
0400
0390
0380
0370
0360
0350
0340
0330
0320
0310
0300
005 010 015 020 025 030 035 040 045 050
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETIONk-NNk-MEANS CLUSTERING
Figure 7 RMSE by ratio of missing data
Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms
Data characteristic 119861 Data characteristic 119861
(constant) 060lowastlowast M imputation dum1 012lowastlowast
R missing 083lowastlowast M imputation dum2 minus001lowast
SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast
N attributes minus008lowastlowast M imputation dum7 minus001lowast
C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast
N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005
6 Conclusion
So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based
Mathematical Problems in Engineering 13
on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based
In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not
A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply
The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications
Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values
This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance
In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)
References
[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013
[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013
[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003
[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011
[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013
[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013
[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013
[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013
14 Mathematical Problems in Engineering
[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013
[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013
[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008
[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978
[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998
[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013
[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013
[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005
[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012
[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006
[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
12 Mathematical Problems in Engineering
0
01
02
03At
trib
ute
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
Miss
ing
p1
Miss
ing
p2
minus04
minus03
minus02
minus01
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATIONk-MEANS CLUSTERINGk-NN
Figure 5 Coefficient pattern of the decision tree algorithm (RMSE)
0
02
04
06
08
Attr
ibut
e
Inst
ance
s
Dat
a im
bala
nce
Miss
ing
ratio
Hsc
atte
redn
ess
Vsc
atte
redn
ess
Spre
ad
minus06
minus04
minus02
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETION
GROUP MEAN IMPUTATIONHOT DECK IMPUTATION
k-NNk-MEANS CLUSTERING
Figure 6 Coefficient pattern of the decision tree algorithm basedon a 90 missing ratio (RMSE)
missing ratiostimes 3missing patternstimes 100 trialstimes 6 imputationmethods times 7 classification methods) were analyzed Theresults have at least two implications First we can predict theclassification accuracy for an unknown dataset with missingdata only if the data characteristics can be obtained Secondwe can establish general rules for selection of the optimalcombination of a classification algorithm and imputationalgorithm
Method of imputation
0490
0480
0470
0460
0450
0440
0430
0420
0410
0400
0390
0380
0370
0360
0350
0340
0330
0320
0310
0300
005 010 015 020 025 030 035 040 045 050
HOT DECKGROUP MEAN IMPUTATION
MEAN IMPUTATIONPREDICTIVE MEAN IMPUTATION
LISTWISE DELETIONk-NNk-MEANS CLUSTERING
Figure 7 RMSE by ratio of missing data
Table 12 Factors influencing accuracy (RMSE) of classifier algo-rithms
Data characteristic 119861 Data characteristic 119861
(constant) 060lowastlowast M imputation dum1 012lowastlowast
R missing 083lowastlowast M imputation dum2 minus001lowast
SE HS minus005lowastlowast M imputation dum4 000SE VS 000lowastlowast M imputation dum5 000Spread 017lowastlowast M imputation dum6 001lowastlowast
N attributes minus008lowastlowast M imputation dum7 minus001lowast
C imbalance minus003lowastlowast P missing dum1 minus006lowastlowast
N cases 002lowastlowast P missing dum3 000Note 1 Dummy variables related to imputation methods LIST-WISE DELETION (M imputation dum1 = 1 others = 0) MEAN IMPUTA-TION (M imputation dum2 = 1 others = 0) GROUP MEAN IMPUTA-TION (M imputation dum3 = 1 others = 0) PREDICTIVE MEAN IMPU-TATION (M imputation dum4 = 1 others = 0) HOT DECK (M imputa-tion dum5 = 1 others = 0) 119896-NN (M imputation dum6 = 1 others =0) and 119896-MEANS CLUSTERING (M imputation dum7 = 1 others = 0)Missing patterns univariate (P missing dum1 = 1 P missing dum2 = 0P missing dum3 = 0) monotone (P missing dum1 = 0 P missing dum2 = 1P missing dum3 = 0) and arbitrary (P missing dum1 = 1 P missing dum2= 1 P missing dum3 = 1) 119861 standard beta coefficientNote 2 lowast119875 lt 01 lowastlowast119875 lt 005
6 Conclusion
So far the prior research does not fully inform us of the fit-ness among datasets imputation methods and classificationalgorithmsTherefore this study ultimately aims to establish arule set which guides the classificationrecommender systemdevelopers to select the best classification algorithm based
Mathematical Problems in Engineering 13
on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based
In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not
A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply
The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications
Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values
This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance
In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)
References
[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013
[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013
[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003
[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011
[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013
[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013
[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013
[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013
14 Mathematical Problems in Engineering
[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013
[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013
[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008
[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978
[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998
[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013
[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013
[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005
[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012
[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006
[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 13
on the datasets and imputation method To the best of ourknowledge ours is the first study inwhich the performance ofclassification algorithms with multiple dimensions (datasetsimputation data and imputationmethods) is discussed Priorresearch examines only one dimension [15] In addition asshown in Figure 3 since the performance of each methoddiffers according to the dataset the results of prior studies onimputation methods or classification algorithms depend onthe datasets on which they are based
In this paper factors affecting the performance of classi-fication algorithms were identified as follows characteristicsof missing values dataset features and imputation methodsUsing benchmark data and thousands of variations we foundthat several factors were significantly associated with theperformance of classification algorithms First as expectedthe results show that the missing data ratio and spread arenegatively associated with the performance of the classifica-tion algorithms Second and as a new finding to our bestknowledge we observed that the number of missing cellsin each record (SE HS) was more sensitive in affecting theclassification performance than the number of missing cellsin each feature (SE VS) Further we found it interesting thatthe number of features negatively affects the performance ofthe logistic algorithm while other factors do not
A disadvantage of logistic regression is its lack of flexibil-ityThe assumption of a linear dependency between predictorvariables and the log-odds ratio results in a linear decisionboundary in the instance space which is not valid in manyapplications Hence in the case of data imputation thelogistic algorithm must be avoided Next in response toconcerns about class imbalance which has been discussed indatamining research [18 19] we found that the degree of classimbalance was the most significant data feature to decreasethe predicted performance of classification algorithms Inparticular SMO was second to none in predicting SE HSin any imputation situation that is if a dataset has a highnumber of records in which the number of missing cells islarge then SMO is the best classification algorithm to apply
The results of this study suggest that optimal selectionof the imputation method according to the characteristicsof the dataset (especially the patterns of missing values andchoice of classification algorithm) improves the accuracy ofubiquitous computing applications Also a set of optimalcombinations may be derived using the estimated resultsMoreover we established a set of general rules based on theresults of this study These rules allow us to choose a tem-porally optimal combination of classification algorithm andimputation method thus increasing the agility of ubiquitouscomputing applications
Ubiquitous environments include a variety of forms ofsensor data from limited service conditions such as locationtime and status combining various different kinds of sensorsUsing the rules deducted in this study it is possible to selectthe optimal combination of imputation method and classi-fication algorithm for environments in which data changesdynamically For practitioners these rules for selection ofthe optimal pair of imputation method and classificationalgorithm may be developed for each situation dependingon the characteristics of datasets and their missing values
This set of rules will be useful for users and developersof intelligent systems (recommenders mobile applicationsagent systems etc) to choose the imputation method andclassification algorithm according to context while maintain-ing high prediction performance
In future studies the predicted performance of variousmethods can be testedwith actual datasets Although in priorresearch on classification algorithms multiple benchmarkdatasets from the UCI laboratory have been used to demon-strate the generality of the proposed method performanceevaluations in real settings would strengthen the significanceof the results Further for brevity we used a single perfor-mance metric RMSE in this study For example FP rate aswell as TP rate is very crucial when it comes to investigatingthe effect of class imbalance which is considered in thispaper as an independent variable Although the performanceresults would be very similar when using other metrics suchasmisclassification cost and total number of errors [20]morevaluable findings may be generated from a study includingthese other metrics
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This work was supported by the National Strategic RampDProgram for Industrial Technology (10041659) and funded bythe Ministry of Trade Industry and Energy (MOTIE)
References
[1] J Augusto V Callaghan D Cook A Kameas and I SatohldquoIntelligent environments a manifestordquo Human-Centric Com-puting and Information Sciences vol 3 no 12 pp 1ndash18 2013
[2] R Y Toledo Y C Mota andM G Borroto ldquoA regularity-basedpreprocessingmethod for collaborative recommender systemsrdquoJournal of Information Processing Systems vol 9 no 3 pp 435ndash460 2013
[3] G Batista and M Monard ldquoAn analysis of four missing datatreatment methods for supervised learningrdquo Applied ArtificialIntelligence vol 17 no 5-6 pp 519ndash533 2003
[4] R Shtykh and Q Jin ldquoA human-centric integrated approach toweb information search and sharingrdquoHuman-Centric Comput-ing and Information Sciences vol 1 no 1 pp 1ndash37 2011
[5] H Ihm ldquoMining consumer attitude and behaviorrdquo Journal ofConvergence vol 4 no 2 pp 29ndash35 2013
[6] Y Cho and S Moon ldquoWeighted mining frequent patternbased customers RFM score for personalized u-commercerecommendation systemrdquo Journal of Convergence vol 4 no 4pp 36ndash40 2013
[7] N Howard and E Cambria ldquoIntention awareness improvingupon situation awareness in human-centric environmentsrdquoHuman-Centric Computing and Information Sciences vol 3 no9 pp 1ndash17 2013
[8] L Liew B Lee Y Wang and W Cheah ldquoAerial images rectifi-cation using non-parametric approachrdquo Journal of Convergencevol 4 no 2 pp 15ndash21 2013
14 Mathematical Problems in Engineering
[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013
[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013
[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008
[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978
[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998
[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013
[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013
[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005
[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012
[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006
[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
14 Mathematical Problems in Engineering
[9] K J Nishanth and V Ravi ldquoA computational intelligence basedonline data imputation method an application for bankingrdquoJournal of Information Processing Systems vol 9 no 4 pp 633ndash650 2013
[10] P Kang ldquoLocally linear reconstruction based missing valueimputation for supervised learningrdquo Neurocomputing vol 118pp 65ndash78 2013
[11] J L Schafer and J W Graham ldquoMissing data our view of thestate of the artrdquo Psychological Methods vol 7 no 2 pp 147ndash1772002
[12] H Finch ldquoEstimation of item response theory parameters in thepresence of missing datardquo Journal of Educational Measurementvol 45 no 3 pp 225ndash245 2008
[13] S J Press and S Wilson ldquoChoosing between logistic regressionand discriminant analysisrdquo Journal of the American StatisticalAssociation vol 73 no 364 pp 699ndash705 1978
[14] E Frank YWang S Inglis G Holmes and I HWitten ldquoUsingmodel trees for classificationrdquo Machine Learning vol 32 no 1pp 63ndash76 1998
[15] O Kwon and J M Sim ldquoEffects of data set features on theperformances of classification algorithmsrdquo Expert Systems withApplications vol 40 no 5 pp 1847ndash1857 2013
[16] E Namsrai T Munkhdalai M Li J-H Shin O-E Namsraiand K H Ryu ldquoA feature selection-based ensemble methodfor arrhythmia classificationrdquo Journal of Information ProcessingSystems vol 9 no 1 pp 31ndash40 2013
[17] I H Witten and E Frank Data Mining Practical MachineLearning Tools and Techniques Morgan Kaufmann San Fran-cisco Calif USA 2nd edition 2005
[18] M Galar A Fernandez E Barrenechea H Bustince and FHerrera ldquoA review on ensembles for the class imbalance prob-lem bagging- boosting- and hybrid-based approachesrdquo IEEETransactions on Systems Man and Cybernetics C Applicationsand Reviews vol 42 no 4 pp 463ndash484 2012
[19] Q Yang and X Wu ldquo10 challenging problems in data miningresearchrdquo International Journal of Information Technology ampDecision Making vol 5 no 4 pp 597ndash604 2006
[20] Z-H Zhou and X-Y Liu ldquoTraining cost-sensitive neural net-works with methods addressing the class imbalance problemrdquoIEEE Transactions on Knowledge and Data Engineering vol 18no 1 pp 63ndash77 2006
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of