Data Mining Diabetic Databases Are Rough Sets a Useful Addition? Joseph L. Breault, MD, MS, MPH...

Data MiningData MiningDiabetic DatabasesDiabetic DatabasesAre Rough Sets a Useful Addition?Are Rough Sets a Useful Addition?

Joseph L. BreaultJoseph L. Breault, MD, MS, MPH, MD, MS, MPH

[email protected]@tulanealumni.net

Tulane University (ScD student)Tulane University (ScD student)

Department of Health Systems ManagementDepartment of Health Systems Management

&&

Alton Ochsner Medical FoundationAlton Ochsner Medical Foundation

Department of Family PracticeDepartment of Family Practice

Diabetic DatabasesDiabetic Databases

Diabetic databases have been used toDiabetic databases have been used to

• Query for diabetes, Query for diabetes,

• As a comprehensive management tool As a comprehensive management tool to improve diabetic care and to improve diabetic care and ommunications among professionals, ommunications among professionals,

• To provide continuous quality To provide continuous quality improvement in diabetes care.improvement in diabetes care.

The Veterans Administration (VA) developed The Veterans Administration (VA) developed their diabetic registry from an outpatient pharmacy their diabetic registry from an outpatient pharmacy database and matched social security numbers to add database and matched social security numbers to add VA hospital admission data to it. They identified VA hospital admission data to it. They identified 139,646 veterans with diabetes. 139,646 veterans with diabetes.

The Belgian Diabetes Registry was created by The Belgian Diabetes Registry was created by required reporting of all incident cases of type 1 required reporting of all incident cases of type 1 diabetes and their first degree relatives younger than diabetes and their first degree relatives younger than 40. This has facilitated epidemiologic and genetic 40. This has facilitated epidemiologic and genetic studies. studies.

One British hospital linked their 7000 patient One British hospital linked their 7000 patient database to their National Health Services Central database to their National Health Services Central Registry to identify mortality data and found that Registry to identify mortality data and found that diabetes was recorded in only 36% of death diabetes was recorded in only 36% of death certificates, so analysis of death certificates alone certificates, so analysis of death certificates alone gives poor information about mortality in diabetes.gives poor information about mortality in diabetes.

Diabetes is a particularly opportune disease for data Diabetes is a particularly opportune disease for data mining technology for a number of reasons. mining technology for a number of reasons.

1.1. Because the mountain of data is there.Because the mountain of data is there.

2.2. Diabetes is a common disease that costs a great deal Diabetes is a common disease that costs a great deal of money, and so has attracted managers and payers of money, and so has attracted managers and payers in the never ending quest for saving money and cost in the never ending quest for saving money and cost efficiency. efficiency.

3.3. Diabetes is a disease that can produce terrible Diabetes is a disease that can produce terrible complications of blindness, kidney failure, complications of blindness, kidney failure, amputation, and premature cardiovascular death, so amputation, and premature cardiovascular death, so physicians and regulators would like to know how to physicians and regulators would like to know how to improve outcomes as much as possible. improve outcomes as much as possible.

Data mining might prove an ideal match in these Data mining might prove an ideal match in these circumstancescircumstances

THE PIMA INDIAN THE PIMA INDIAN DIABETIC DATABASEDIABETIC DATABASE

The Pima Indians may be genetically The Pima Indians may be genetically predisposed to diabetes, and it was noted predisposed to diabetes, and it was noted that their diabetic rate was 19 times that of a that their diabetic rate was 19 times that of a typical town in Minnesotatypical town in Minnesota

The National Institute of Diabetes and Digestive The National Institute of Diabetes and Digestive and Kidney Diseases of the NIH originally and Kidney Diseases of the NIH originally owned the Pima Indian Diabetes Databaseowned the Pima Indian Diabetes Database

In 1990 it was received by the UC-Irvine In 1990 it was received by the UC-Irvine Machine Learning RepositoryMachine Learning Repository

The database has n=768 patients each The database has n=768 patients each with 9 numeric variables:with 9 numeric variables:

1.1. # of pregnancies, # of pregnancies, 2.2. 2-hour OGTT glucose,2-hour OGTT glucose,3.3. Diastolic blood pressureDiastolic blood pressure4.4. Skin fold thicknessSkin fold thickness5.5. 2-hour serum insulin2-hour serum insulin6.6. BMIBMI7.7. Diabetes pedigreeDiabetes pedigree8.8. AgeAge9.9. Diabetes onset within 5 Diabetes onset within 5

yearsyears

The goal is to predict #9. The goal is to predict #9. There are 500 non-There are 500 non-diabetic patients and diabetic patients and 268 diabetic ones for an 268 diabetic ones for an incidence rate of 34.9%. incidence rate of 34.9%. Thus if you guess that Thus if you guess that all are non-diabetic, all are non-diabetic, your accuracy rate is your accuracy rate is 65.1%. We expect a 65.1%. We expect a useful data mining or useful data mining or prediction tool to do prediction tool to do much better than this.much better than this.

PIDD ErrorsPIDD Errors

5 had glucose = 0, 5 had glucose = 0,

11 more had BMI = 0, 11 more had BMI = 0,

28 others had diastolic 28 others had diastolic blood pressure = 0, blood pressure = 0,

192 others had skinfold 192 others had skinfold thickness readings = 0,thickness readings = 0,

140 others had serum 140 others had serum insulin levels = 0.insulin levels = 0.

None of these are None of these are physically possiblephysically possible

392 cases with no missing 392 cases with no missing values. values.

Studies that did not realize Studies that did not realize the previous zeros were the previous zeros were in fact missing variables in fact missing variables essentially used a rule essentially used a rule of substituting zero for of substituting zero for the missing variables. the missing variables.

Ages 21 to 81 and all are Ages 21 to 81 and all are female.female.

STUDIES ON THE PIDDSTUDIES ON THE PIDD

• The independent or target variable is The independent or target variable is diabetes status within 5 years, diabetes status within 5 years, represented by the 9th variable (0,1).represented by the 9th variable (0,1).

• Although articles use somewhat Although articles use somewhat different subgroups of the PIDD, different subgroups of the PIDD, accuracy for predicting diabetic status accuracy for predicting diabetic status ranges from 66% to 81%ranges from 66% to 81%

ROUGH SETS IN MEDICAL ROUGH SETS IN MEDICAL DATA ANALYSISDATA ANALYSIS

• Rough sets investigate structural Rough sets investigate structural relationships in the data rather than relationships in the data rather than probability distributions, and produce probability distributions, and produce decision tables rather than trees. decision tables rather than trees.

• This method forms equivalence classes This method forms equivalence classes within the training data, approximating it within the training data, approximating it with a class below and a class above it. with a class below and a class above it.

A variety of algorithms can be used to define the A variety of algorithms can be used to define the classification boundaries. classification boundaries.

Rough sets do feature reduction. Finding minimal Rough sets do feature reduction. Finding minimal subsets (reducts) of attributes that are efficient for rule subsets (reducts) of attributes that are efficient for rule making is a central part of its processmaking is a central part of its process

Rough sets have been applied to peritoneal lavage in Rough sets have been applied to peritoneal lavage in pancreatitis, toxicity predictions, development of medical pancreatitis, toxicity predictions, development of medical expert system rules, prediction of death in pneumonia, expert system rules, prediction of death in pneumonia, identification of patients with chest pain who do not need identification of patients with chest pain who do not need expensive additional cardiac testing, diagnosing congenital expensive additional cardiac testing, diagnosing congenital malformations, prediction of relapse in childhood leukemia, malformations, prediction of relapse in childhood leukemia, and to predict ambulation in people with spinal cord injury. and to predict ambulation in people with spinal cord injury.

There are extensive reviews of their use in medicine. There are extensive reviews of their use in medicine. To our knowledge, there are no publications about their To our knowledge, there are no publications about their application to the PIDD.application to the PIDD.

Rough Sets in DiabetesRough Sets in Diabetes

A recent study used a A recent study used a dataset of 107 children dataset of 107 children with diabetes from a with diabetes from a Polish medical school. Polish medical school.

Rough set techniques Rough set techniques were applied and were applied and decision rules decision rules generated to predict generated to predict microalbuminuria.microalbuminuria.

The best predictor was The best predictor was age < 7 predicting no age < 7 predicting no microalbuminuria 83.3% microalbuminuria 83.3% of the times, followed by of the times, followed by age 7-12 with disease age 7-12 with disease duration 6-10 predicting duration 6-10 predicting microalbuminuria 80.8% microalbuminuria 80.8% of the times.of the times.

ROUGH SETS & THE PIDDROUGH SETS & THE PIDD

We randomly divided the 392 complete We randomly divided the 392 complete cases in the PIDD into a training set cases in the PIDD into a training set (n=300), and a test set (n=92). The (n=300), and a test set (n=92). The ROSETTA software was downloaded ROSETTA software was downloaded from www.idi.ntnu.no/ ~aleks/rosetta/.from www.idi.ntnu.no/ ~aleks/rosetta/.

ROSETTA’s StepsROSETTA’s Steps

1.1. Deal with missing values in one of 5 Deal with missing values in one of 5 ways, but we had removed these.ways, but we had removed these.

2.2. Discretization where each variable is Discretization where each variable is divided into a limited number of value divided into a limited number of value groups. There are 9 ways to do this groups. There are 9 ways to do this and we chose the equal frequency and we chose the equal frequency binning criteria with k=5 bins. binning criteria with k=5 bins.

3.3. Create reducts, which are subset vectors of Create reducts, which are subset vectors of attributes that facilitate rule generation attributes that facilitate rule generation with minimal subsets. This can be done by with minimal subsets. This can be done by 8 methods; we choose the Johnson reducer 8 methods; we choose the Johnson reducer algorithm. Rules are then generated.algorithm. Rules are then generated.

4.4. Apply a classification method. We choose Apply a classification method. We choose the batch classifier with the standard/tuned the batch classifier with the standard/tuned voting method. When the generated voting method. When the generated training rules are applied to the test set of training rules are applied to the test set of 92 cases the predictive accuracy is 82.6%, 92 cases the predictive accuracy is 82.6%, which is better than all of the previous which is better than all of the previous machine learning algorithms.machine learning algorithms.

ROSETTA’s ROSETTA’s Confusion MatrixConfusion Matrix

(1=diabetes, 0=no diabetes)(1=diabetes, 0=no diabetes)

Domain Knowledge UnhelpfulDomain Knowledge Unhelpful

When the discretization step was tweaked When the discretization step was tweaked by domain knowledge (selecting 5 by domain knowledge (selecting 5 intervals for each variable based on intervals for each variable based on being most clinically meaningful), being most clinically meaningful), results looked slightly improved on the results looked slightly improved on the training set (91.7% vs 91.0%), but were training set (91.7% vs 91.0%), but were much worse on the test set (75.0% vs. much worse on the test set (75.0% vs. 82.6%).82.6%).

Discretization Method ChoicesDiscretization Method Choices

For the Johnson algorithm with tuned voting, For the Johnson algorithm with tuned voting, accuracies were: Boolean 96%, entropy 78%, accuracies were: Boolean 96%, entropy 78%, binning (k=5) 91%, naïve 100%, semi-naïve binning (k=5) 91%, naïve 100%, semi-naïve 99%, and BooleanRSES 90%. 99%, and BooleanRSES 90%.

We suspected that the ones in the high 90s are We suspected that the ones in the high 90s are overfitted and would not do as well on the test overfitted and would not do as well on the test set, thus binning might be a good choice. set, thus binning might be a good choice.

Test results were Boolean 66%, entropy 62%, Test results were Boolean 66%, entropy 62%, binning (k=5) 83%, naïve 67%, semi-naïve binning (k=5) 83%, naïve 67%, semi-naïve 78%, and BooleanRSES 74%.78%, and BooleanRSES 74%.

Binning Number ChoicesBinning Number Choices

What binning number works best? On the training What binning number works best? On the training set using k=2, 3, 4, 5, 6 and 7 gives the set using k=2, 3, 4, 5, 6 and 7 gives the following accuracies using the Johnson reduct following accuracies using the Johnson reduct with tuned voting: 81.3%, 90.3%, 87.3%, with tuned voting: 81.3%, 90.3%, 87.3%, 91.0%, 91.3%, and 95%. 91.0%, 91.3%, and 95%.

We suspect the highest binning numbers are We suspect the highest binning numbers are heading toward overfitting. When the various heading toward overfitting. When the various binning numbers are used on the test set, we binning numbers are used on the test set, we get accuracies of 76.1%, 79.3%, 81.5%, 82.6%, get accuracies of 76.1%, 79.3%, 81.5%, 82.6%, 78.3%, 81.5% indicating k=5 works best.78.3%, 81.5% indicating k=5 works best.

Obtaining a Mean & 95% CIObtaining a Mean & 95% CI

The 82.6% accuracy rate is surprisingly good, The 82.6% accuracy rate is surprisingly good, and exceeds the previously used machine and exceeds the previously used machine learning algorithms that ranged from 66-81%.learning algorithms that ranged from 66-81%.

Is this a quirk of the particular random sample Is this a quirk of the particular random sample that we obtained? 9 additional random that we obtained? 9 additional random samples were used, all with a training set of samples were used, all with a training set of 300 and a test set of 92. 300 and a test set of 92.

= 73.2% with a 95% CI of (69.2% - 77.2%) = 73.2% with a 95% CI of (69.2% - 77.2%)

Other Methods in ROSETTAOther Methods in ROSETTA

Using binning with k=5, reducts with the Using binning with k=5, reducts with the exhaustive calculation (RSES), we generate exhaustive calculation (RSES), we generate rules on the 10 training sets. rules on the 10 training sets.

Then with the respective test sets, we classify Then with the respective test sets, we classify them using the standard/tuned voting (RSES) them using the standard/tuned voting (RSES) with its defaults. The 10 accuracies ranged with its defaults. The 10 accuracies ranged from 68.5% to 79.3% with a mean of 73.9% from 68.5% to 79.3% with a mean of 73.9% and a 95% CI of (71.5%, 76.3%).and a 95% CI of (71.5%, 76.3%).

CONCLUSIONSCONCLUSIONS

• Rough sets and the ROSETTA software Rough sets and the ROSETTA software are useful additions to the analysis of are useful additions to the analysis of diabetic databases. diabetic databases.

• If time, ROSETTA demoIf time, ROSETTA demo

• Questions? Discussion?Questions? Discussion?

Date post:	26-Dec-2015
Category:	Documents
Upload:	roderick-basil-stevens
View:	220 times
Download:	0 times

Data Mining Diabetic Databases Are Rough Sets a Useful Addition? Joseph L. Breault, MD, MS, MPH...

Documents