+ All Categories
Home > Documents > Thesis.doc

Thesis.doc

Date post: 08-Jul-2015
Category:
Upload: tommy96
View: 605 times
Download: 1 times
Share this document with a friend
Popular Tags:
109
DATA MINING THE 1997 NATIONAL AMBULATORY MEDICAL CARE SURVEY By Johnathan P. Durbin B.S., University of Louisville, 1995 A Thesis Submitted to the Faculty of the Graduate School of the University of Louisville in Partial Fulfillment of the Requirements for the Degree of Master of Arts Department of Mathematics University of Louisville Louisville, Kentucky August 2001
Transcript
Page 1: Thesis.doc

DATA MINING THE 1997 NATIONAL AMBULATORY MEDICAL CARE SURVEY

By

Johnathan P. DurbinB.S., University of Louisville, 1995

A ThesisSubmitted to the Faculty of the

Graduate School of the University of Louisvillein Partial Fulfillment of the Requirements

for the Degree of

Master of Arts

Department of MathematicsUniversity of Louisville

Louisville, Kentucky

August 2001

Page 2: Thesis.doc

A PRACTICE IN DATA MINING USINGTHE 1997 NATIONAL AMBULATORY MEDICAL CARE SURVEY

By

Johnathan P. DurbinB.S., University of Louisville, 1995

A thesis Approved on

_________July 12, 2001________

by the following Reading Committee:

__________________________________Thesis Director

__________________________________

__________________________________

ii

Page 3: Thesis.doc

ABSTRACT

Data mining is a technique with a number of methods used to explore large

datasets from a variety of angles with a wide spectrum of analytical tools. There are

techniques for finding data, cleaning data, and validating results. For years new data have

been collected by educational, research, commercial, and governmental entities for future

analysis. The 1997 National Ambulatory Medical Care Survey dataset (NAMCS) is such

a dataset available in the public domain at the C.D.C. [Centers for Disease Control and

Prevention] for public consumption. Once this dataset was found, imported, and cleaned,

it was analyzed. Although statistical packages have become extremely sophisticated,

commercial statistical packages do not do everything needed for data mining. For this

reason, a program was written (DFEPP) to analyze the data to display the results in a

different manner using visualization techniques to present the significant results in an

easily digested yet informative manner.

iii

Page 4: Thesis.doc

TABLE OF CONTENTS

Page

ABSTRACT iii

CHAPTER

I. Introduction 1

II. Acquiring and Importing Data 3

2.1 Acquisition of Data 32.2 Importing Data 6

III. Data Visualization using the Difference From Expected Percentage Plot (DFEPP) Program Design and Use 10

3.1 The Graph Design 113.2 The Design of the DFEPP Program 143.3 The Use of the DFEPP Program 17

IV. Data Mining of the 1997 National Ambulatory Medical Care Survey (NAMCS) Dataset 21

4.1 Analysis of Payee Type by Practice Type 224.1.1 Workers Compensation 234.1.2 Medicare 324.1.3 Medicaid 364.1.4 Self-Pay 414.1.5 Privately Insured 454.1.6 All Other 46

4.2 HMOs 524.3 Modeling 58

4.3.1 Age Group Models 594.3.2 Modeling Classification of Pregnant 62

iv

Page 5: Thesis.doc

V. Conclusions 67

REFERENCES 70

APPENDIX – A (Variable list) 72

VITA 98

v

Page 6: Thesis.doc

LIST OF IMAGES

Page

IMAGE 1. A sample run of a web search tool (Copernic2000) 6

IMAGE 2. A working example of the output for data visualization: 14

IMAGE 3. An output example 17

IMAGE 4. A working example of the input for data visualization 19

IMAGE 5. A working example of the output for data visualization 20

IMAGE 6. Clementine Code 59

IMAGE 7. Neural Network for Age Group Model Output 60

IMAGE 8. Refined Neural Network for Age Group Model Output 61

IMAGE 9. Neural Network for Age Group Model 61

IMAGE 10. C5 Model for Pregnant Output 62

IMAGE 11. Refined C5 Model for Pregnant Output 64

IMAGE 12. Refined Rule Set for Pregnant Model Rule Set 64

IMAGE 13. Refined C5 Model (2) for Pregnant Output 65

IMAGE 14. Refined Rule Set (2) for Pregnant Model Rule Set 65

vi

Page 7: Thesis.doc

LIST OF PLOTS

PLOT 1. Workers compensation by Physician Specialty 23

PLOT 2. Adjusted plot after removal 28

PLOT 3. Medicare by Physician Specialty 32

PLOT 4. Medicaid by Physician Specialty 36

PLOT 5. Distribution of Medicaid population 37

PLOT 6. Age of Pediatric Patients 38

PLOT 7. Modified Medicaid by Physician Specialty 39

PLOT 8. Age of Dermatology Patients 40

PLOT 9. Self Pay by Physician Specialty 41

PLOT 10. Privately Insured by Physician Specialty 45

PLOT 11. All Other Payees by Physician Specialty 46

PLOT 12. HMO Membership Percent by Age 53

PLOT 13. HMO Membership by Age Group 53

PLOT 14. HMO Membership by Payee Type 54

PLOT 15. HMO Membership by Physician Specialty 55

PLOT 16. HMO Membership by Race 56

PLOT 17. Distribution of Asian/Pacific Islander Age 57

vii

Page 8: Thesis.doc

LIST OF TABLES

TABLE 1. Payee Types 22

TABLE 2. Workers Compensation ICD-9 Grouped Codes for Orthopedic Visits 24

TABLE 3. Workers Compensation ICD-9 Codes for Orthopedic Visits 25

TABLE 4. Workers Compensation ICD-9 Grouped Codes for Neurology Visits 26

TABLE 5. Workers Compensation ICD-9 Codes for Neurology Visits 27

TABLE 6. New Proportions after WC Orthopedic Surgeon Visits are removed 28

TABLE 7. Workers’ Comp “Other” Physician Visits 30

TABLE 8. Age Statistics by Physician Type 33

TABLE 9. ICD-9 Codes tabled by Medicare Use 34

TABLE 10. Age Statistics by Payee Type 37

TABLE 11. HMO Membership by Physician Specialty 43

TABLE 12. Has Insurance by Physician Specialty 44

TABLE 13. Has Insurance by Physician Specialty 47

TABLE 14. Has Insurance by ICD-9 Codes/Pediatric 48

TABLE 15. All Pay Methods by Insurance/Pediatrics 49

TABLE 16. Has Insurance by ICD-9 Codes/Neurology 50

viii

Page 9: Thesis.doc

TABLE 17. All Pay Methods by Insurance/Neurology 51

TABLE 18. HMO Membership 52

TABLE 19. Adjusted HMO Membership 52

TABLE 20. Age Statistics by Race 56

ix

Page 10: Thesis.doc

CHAPTER I

INTRODUCTION

The purpose of this paper is to describe the process of data mining through an

example. The primary purpose of data mining is to generate hypotheses to be examined

for validity either with fresh data or by withholding a portion of the initial dataset for

investigation. Data mining is a technique with a number of methods used to explore large

datasets from a variety of angles with a wide spectrum of analytical tools. There are

techniques for finding data, cleaning data, and validating results. For many years new

data have been collected by educational, research, commercial, and governmental entities

so that data mining can be used to find trends and patterns. Much of this available data

have been stored in data warehouses (collections of datasets) or put away by an

organization possibly to be examined in the future. The 1997 National Ambulatory

Medical Care Survey dataset (NAMCS) analyzed in this paper is available in the public

domain at the C.D.C. [Centers for Disease Control and Prevention] for public

consumption along with several other medical datasets. Chapter II covers how to find

medical datasets and import them into various statistical packages. Although statistical

x

Page 11: Thesis.doc

packages have become extremely sophisticated, commercial statistical packages do not

do everything needed for data mining. For this reason, a program was written to analyze

the data (Chapter III) to display the results in a different manner using visualization

techniques to present the significant results in an easily digested yet informative manner.

The NAMCS dataset is analyzed in Chapter IV using various statistical packages and the

program developed in Chapter III. The NAMCS dataset consists of 24,615 patient visit

records each containing 224 variables. The data were about personal physical attributes,

physician’s practice and location, reasons and diagnoses for visits, medication given,

insurance types, tests given, types of medical personnel seen, and other visit data (see

Appendix – A for a full variable list). These data can be analyzed a variety of ways:

differences between patient types in common practices or pay methods; examining

whether certain practice types favor using staff over physicians; what practices or pay

methods favor using screenings or tests; or simple analyses of various physical attributes

of the different patient types. In this thesis, the ways in which different payee types

visited the different practices are analyzed. Different payee types disproportionately

visited certain practice types. Some of these disproportions are expected and others are

less explainable. HMO membership and its distribution through age groups, practice

types, pay methods, and races are also analyzed. Older patients and the practices that

serve them had a lower rate of HMO membership but privately insured, “All Other”

payees, and Asian/Pacific Islanders all had higher rates of membership. Another analysis

was done using modeling techniques to determine patient AGE GROUP and another to

determine if the patient was pregnant. A model was found with ~90% accuracy in

determining whether someone was pregnant using age, reason for visit (Non-Illness care),

xi

Page 12: Thesis.doc

and sex of patient. A model to determine which age group the patient was in was much

less accurate (~50%).

In this thesis techniques to find, import, clean, and analyze data are discussed.

Some of the techniques are used with the NAMCS dataset while other techniques are

only discussed. A program is also written, by the author of this thesis, with visualization

guidelines, discussed in chapter 3, to analyze the NAMCS dataset.

xii

Page 13: Thesis.doc

CHAPTER II

ACQUIRING AND IMPORTING DATA

The first step in a data mining process is to collect the data. A collection

mechanism can be set up to obtain data or the data may already exist in a dataset from an

outside source. After the necessary data have been acquired, they must be put into a

format that can be imported into any statistical packages that will be used to analyze

them. Once the data have been imported, they need to be cleaned for analysis.

2.1 Acquisition of Data

The first step in the data mining process is to acquire the data. Depending on

what is studied, a collection mechanism for data may have to be set up or the data can

come from an outside source. Collecting data can be very expensive and time consuming,

but necessary. When collecting the data during the study, the validity of the collection

mechanism and the data are known. The necessary data may already exist. Studies on

many topics have been done over time and the data for these studies may still be

available. With the invention, and now wide spread use of the computer, much of the data

xiii

Page 14: Thesis.doc

for these studies are on magnetic media, easily copied, and transferable for fellow

researchers to use. Governmental agencies, such as the Census Bureau (www.census.gov)

and C.D.C. (www.cdc.gov), have collected data for years and have large datasets in the

public domain online for downloading. The Freedom of Information Act (FOIA

http://www.usdoj.gov/foia/) gives access to governmental data with some restrictions.

These data may or may not be in an easily usable format and the restrictions may not

allow all of the desired data to be made available due to privacy or security issues. Data

from other countries are less restrictive and are available in a variety of formats. Data can

be bought from outside sources. Some companies can be contracted to collect data or the

data may have already been collected and are available for sale to researchers. When the

data come from an outside source, the validity of the data should be considered. There are

pros and cons to both ways of acquiring data but it is up to the researcher to find the data

and to discuss its validity. For the purpose of this data mining project, a public domain

database was used; one that was closely related to an aspect of health care.

Much of the public domain data are already available on the Internet and the

various search engines make it easy to find relevant datasets. Many of the search engines

will point to Internet sites that give or sell data. The Lycos search engine

(www.lycos.com) was developed by the Carnegie Mellon Institute and tends to point to

more research oriented web sites than other search engines. Other search engines

providing pointers to data include Excite (www.excite.com), Alta-Vista

(www.altavista.com), and MSN (www.msn.com).

xiv

Page 15: Thesis.doc

A new generation of web tools have been developed to make searching easier and

more thorough. Copernic2000 is one of these web tools; it searches many different search

engine databases for whatever topic is being queried. These web search tools are highly

configurable and can be modified to the individual preferences of users. Web users tend

to prefer certain search engines and web search tools allow the user to focus on the search

engines of their choice. The level of search in the databases can also be defined by

choosing how many hits from each search engine database are allowed. These web search

tools can also search other types of Internet sites such as news groups, email databases,

online businesses, news, and many other focused sites.

A sample run of a web search tool (Copernic2000) (Image 1)

Whether a web search engine or web search tool is used, there are certain

guidelines that should be followed. First, use a keyword such as “dataset” and avoid

words such as “data” or “database”. Keywords “data” and “database” will point to

xv

Page 16: Thesis.doc

results, database programs, or databases of articles but the keyword “dataset” will focus

on collections of data. Use the option of searching for all words in a query and if that

does not work, use a search on any words in a query. When a URL is found, consider the

source of the site and its possible biases. There is no optimal way to find data on the

Internet but with the development and refinement of web search tools, locating data is

becoming an easier task.

2.2 Importing Data.

Once a dataset has been found, the dataset needs to be imported into statistical

programs for analysis. The data mining process used to investigate the data relies on

standard statistical packages such as SAS 8® (SAS Institute Inc.), SPSS 10®, and SPSS

Clementine 5.2® (SPSS Inc.). In order to make the investigations, the statistical packages

must be able to read the data. Data are not always in a format that the different statistical

packages can automatically import. Many sites, such as the C.D.C., put their public data

in an ASCII (text) format with rules of how to import the file correctly. Otherwise, the

data are released in a database format or another standard type file. The dataset analyzed

in this paper was in a self-extracting ZIP file that contained 12 ASCII files, one being a

file that explained how the data file was arranged.

There are many different file formats used to save data and to import data. There

are pros and cons to each type of file format. ASCII files are generally either character

delimited files or columnar fixed width files. Character delimited files use a special

xvi

Page 17: Thesis.doc

character such as a comma or tab to separate variable columns. When importing these

type files, errors can occur when a special character is included in a text field, or the

spacing may be shifted enough to confuse tab-delimited imports. Fixed width columnar

ASCII files are not as easy to import, but the import allows the user to work with each

variable and to define variable names, labels, and text related to each variable. The user

can format and label the data to individual preference. The user should become very

familiar with the data variables in the dataset.

There are many standard file types that can be imported, including spreadsheet,

database, and portable files. Spreadsheets are the easiest to import but they sometimes

have record number limitations. The variable names can be included in the first row for

ease of importation. In this study, the dataset used was imported into SPSS 10 from a

columnar ASCII file. An attempt to write the 24,610 records to an Excel spreadsheet file

failed and only wrote 16,383 of the records. This may be an issue with SPSS 10 and older

restrictions on spreadsheet files. Database files are another type of file that can be

imported. Flat file databases (all data contained in one table) and well designed relational

databases (multiple tables related by keys) are not a problem to import but some

relational databases are not always structured well and create importing problems.

Different relational database tables within the same database may contain the identical

table variable names that are not meant to be linked but the import features in some

statistical programs try to link them anyway. Other table links may need to be defined in

a certain way such as one to one, one to many, or many to many and these links do not

always import the data correctly. Outside of having the data in the statistical packages’

xvii

Page 18: Thesis.doc

file format, portable files are the best choice for importing data. The data with their

variable names are stored in this portable file type for ease of import but the only failing

of this portable file type is that it does not include variable labels or text related to

nominal data. Usually researchers do not have much say in what format the data will be

found, but if possible, they should request data in a portable format or the native format

of their statistical package.

Once the data are imported, they may need to be cleaned. Unless the data were

formatted during the import, the variable labels and text related to nominal data have not

been defined. It is not necessary to define them but the labels and nominal data text make

the analysis easier to comprehend. Some data records may contain missing or invalid

information and the records need to be either corrected or removed. Some variables may

not be necessary and can also be removed. The dataset used in this paper initially

contained 224 variables that were reduced to 33 variables as the analysis was refined. For

a full list of variables, see Appendix A. Many variables contained information about the

“marked” status of another variable and could be removed. Some removed variables were

lengthy text entries that were rarely used. Other removed variables contained medicine

codes. Many of the variables were removed after initial analyses showed little promise

for them. Some categorical data can also be refined to be a more manageable size. One of

the variables in the dataset contained more than 300 different categories that could have

easily been refined to a more manageable 9 categories. Some data may also need some

editing to fix errors such as missed decimal placements, text in numeric fields forcing

numeric variables to import as text, and converting variable types to correct types of data.

xviii

Page 19: Thesis.doc

Data mining tools examine data from a variety of angles with a number of

different statistical methods. Not all of these statistical tools or programs can read or

write to common file types without loss of some formatting. Therefore trading data

between programs can sometimes become a problem. SAS programs cannot read native

SPSS 10 SAV files and SPSS programs can not read native SAS files. Both programs can

read and write to common file types but the difficulties described previously can still

occur. Saving data in an ASCII file from one program, then importing the data into

another program can give delimiting problems, or if the columnar format is used, the

variables have to be redefined. Transferring data from one statistical program to another

using spreadsheet format will work better but the constraint on sheet size may limit the

number of records transferred. Portable and database files are the best options currently

available but these formats do not save the variable labels or the text related to nominal

data. An ideal situation would be a format that all statistical packages could export to and

import from without the loss of variable labels and text related to nominal data.

Unfortunately, this ideal currently does not exist.

xix

Page 20: Thesis.doc

CHAPTER III

DATA VISUALIZATION USING THE DIFFERENCE FROM EXPECTED PERCENTAGE PLOT (DFEPP):

PROGRAM DESIGN AND USE

One very important aspect of data mining is visualization, usually in graphical

form. There are many different statistical programs that analyze data and have a number

of graphical formats but these programs may not analyze the data in the desired way or

present results in the best manner. Presenting information in a useful and digestible form

is very important in the data mining process. Most papers are written for audiences with

varying degrees of statistical knowledge and should be written to accommodate most, if

not all, of the audience. Visual representation of information is the simplest way to digest

results for the general population and technical detail can be added to validate

information for those with greater statistical knowledge. The statistical packages used

give effective analyses and reporting but they do not always present significant results in

a manner desired by the investigator. For this reason, a program was written and

designed, by the author of this thesis, in Visual Basic 6 using some guidelines in

visualization. In this chapter the design and use of the Difference From Expected

Percentage Plot (DFEPP) program will be covered.

xx

Page 21: Thesis.doc

3.1 The Graph Design

Presenting results from a data analysis in a format that is easily read is a necessity

when analyzing and reporting on data. Analysis results should be presented in layers of

detail from the most general to the most in-depth. Graphs and plots are easily understood

and are used for a quick, less detailed, analysis of data. Tables and associated numeric

information can also be used in the presentation of data for greater detail but are

generally less easy to understand. A mix of the two types of presentations is an ideal way

to present data analysis results to a general audience with varying degrees of statistical

knowledge.

There have been very few publications on data presentation and graphic design but the

few publications written provide some basic guidelines. (Tufte, 1997 and White, 1984)

xxi

Page 22: Thesis.doc

According to Tufte’s “The Visual Display of Quantitative Information” (Tufte, 1997):

Excellence in statistical graphics consist of complex ideas communicated with

clarity, precision, and efficiency. Graphical displays should:

• Show the data.

• Induce the viewer to think about the substance rather than about

methodology, graphic design, the technology of graphic production, or

something else.

• Avoid distorting what the data have to say.

• Present many numbers in a small space.

• Make large data sets coherent.

• Encourage the eye to compare different pieces of data.

• Reveal the data at several levels of detail, from a broad overview to the fine

structure.

• Serve a reasonably clear purpose: description, exploration, tabulation, or

decoration.

• Be closely integrated with the statistical and verbal descriptions of a data set.

xxii

Page 23: Thesis.doc

Jan V. White’s “Using Charts and Graphs” (White, 1984) suggested some other concepts

to include:

• Sort from most to least significant.

• Make sure plot segments are connected to associated text.

• Make significances stand out.

This graph (a plot from a program discussed later in this chapter) uses many of the

concepts include in the books by White and Tufte.

A working example of the output for data visualization: (Image 2)

The above graph shows the data and is simple enough that the design of the graph is not a

distraction from the data presentation. It reduces a large dataset to a simple plot, stratum

information, count, chi square, and associated p-value to provide much information in a

xxiii

Page 24: Thesis.doc

small space and makes a large data set coherent. It encourages the eye to compare

different pieces of data through the use of color and by listing the categories by

significance. It serves a reasonably clear purpose: description, exploration, and tabulation

and it is closely integrated with the statistical and verbal descriptions of a data set. For

ease of readability, the text for each category (actual %, category name, count, chi square

value, and associated p-value) are connected by a line to the associated bar plot.

3.2 The Design of the DFEPP Program

There are a variety of factors to consider when writing any program: who will use

the program, what operating systems will be used, what type of data will be used, the

intended purpose, and the intended output. Some specialized programs can be written

cryptically but they are usually for a very limited audience that is generally familiar with

its use. Graphical User Interfaced (GUI) programs are much less cryptic and the easiest

type of program to use for a novice. Older programs were developed where the user

interacted via a command line interface that would intimidate some users, but most GUI

based programs use standardized graphic and menu controls familiar to most computer

users. GUI makes the programs extremely easy to use. Any program that might be used

by the general public should be GUI based.

Many different programming languages were considered in the development of

this DFEPP program. ANSI (American National Standards Institute) C and C++ are very

xxiv

Page 25: Thesis.doc

powerful programming languages and can be compiled to run on many different

operating systems, but they lack some of the features needed to copy a generated graph

into a clipboard for pasting into other applications. The Visual C and C++ packages have

a better user interface with the ability to copy graphs onto a clipboard but these languages

are not ANSI compliant and will only run on a few types of operating systems. Java,

developed by Sun Microsystems (www.sun.com), was another language considered for

its portability but it is fairly limited with respect to pasting results into a windows

clipboard. Microsoft Visual Basic 6 ® (VB6) was used to write the DFEPP program.

Programs written in VB6 are extremely easy to program and use with Windows based

controls and interface. Anyone who is somewhat familiar with Windows can use a VB6

coded program. In this program, the interface is familiar and the cutting and pasting of

the generated graph into another program is a simple matter due to the tools included in

VB6. The only downside of VB6 is that it only works on a limited number of operating

systems (MS Windows based), but those few operating systems are on 90%+ of all PCs.

Creating visualization programs require consideration of how the data are entered,

processed, and used. A majority of programming languages can read and write to a

variety of file types and structured files. Input from sources such as keyboards and

scanners, and output to devices such as monitors and printers can be easily accomplished

by most programming languages. The DFEPP program merely required some simple

input into text boxes and a mouse click to plot the graph. VB6 gives an easy input method

for the user as well as easy access to clipboard controls. The graphical output of the

DFEPP program needed to be pasted into other Windows based program (such as Word

xxv

Page 26: Thesis.doc

and Excel) and the tools in VB6 programming environment allowed for easy copying and

pasting of a graph. Other languages would also do all of the necessary processing but the

input and output needed would not be as user-friendly.

3.3 The Use of the DFEPP Program

The DFEPP program was written to show significant differences between

expected and actual values of one stratum of a categorical variable across all strata of

another categorical variable. The dataset to be analyzed in Chapter IV was reduced to 33

categorical variables containing data on patient demographics, types of physician

practices, payment for services, and other information on ambulatory visits. An example

of the use of this program is to look at how the different payee types disproportionately

go to different practices. For instance, assuming that payee types visits practice types at

the same rate as their overall percent of population, the privately insured should be 51%

of each type of physician practice.

An output example (Image 3)

xxvi

Page 27: Thesis.doc

The program sorts the categorical data (practice type (J)) from greatest

percentage difference between actual (percent of actual privately insured in a practice

type) and expected percentage (percent of privately insured 51% (H)) from greatest to

least and generates a difference from expected percentage plot using the expected percent

value (51%) as a baseline and the actual percent values to plot a bar graph. The user

defines the major and minor percentage differences to their preference (L). The program

highlights the major percentage differences in red and minor percentage differences in

blue within the bar plot (I). Chi Square values are also derived using the number of

elements in each stratum (practice types), and the actual percentages and expected

percentages of the isolated strata (payee type (K)). For example, the privately insured

were 51% of all patients but only 32% of 1418 cardiology patients, giving a Chi Square

value of 204.84.

84.2041418*%49

)1418%)68%49((

1418*%51

)1418%)32%51(( 22

=−+−

The Chi Square values are also highlighted by color for significance. In this paper an

alpha of 0.01 is considered the cut-off point for major significance and the Chi Square

values greater than or equal to 6.635 are highlighted red. Chi Square values between

3.841 and 6.635 are associated with an alpha of 0.05 and have lesser significance but are

highlighted blue in case the user chooses to point out those significances with the lower

alpha. The p-values that are associated with the Chi Square values with one degree of

freedom are also given and highlighted to associated significance. If there is no

significance then “No Sig” is displayed in the p-value column.

xxvii

Page 28: Thesis.doc

A working example of the input for data visualization: (Image 4)

Box A is to input the title, B gives the baseline percentage, and E is to input the

categories. The major and minor percentage differences are inputted to C. Column D

contains the actual percentage of A in each of the associate categories in column E.

Column F is the actual count of each of the associate categories in column E. Column G

contains a series of check boxes that select the categories in the associated column E to

be analyzed. Once the user has provided all of the necessary information, the plot option

is chosen in the menu bar to give the following hanging plot:

xxviii

Page 29: Thesis.doc

A working example of the output for data visualization: (Image 5)

H gives the strata analyzed with their expected percentage values. I is the difference

from expected percentage plot using the expected percentage value as a baseline and the

actual percentage values contained in J. J contains each category, the percentage of

strata H in each category, and the number of total members per category. K contains the

Chi Square values and associated p-values for the corresponding categories in J,

highlighting the values with some significance by color. L is a legend for graph I

explaining the major and minor significance lines. If the user is satisfied with the graph

then the copy option may be chosen in the menu bar to copy the graph into the clipboard

to paste in to another program. Otherwise the user closes the graph window to modify the

initial data entry window, adjusts the graph options, and then plots the updated graph.

xxix

Page 30: Thesis.doc

CHAPTER IV

DATA MINING OF THE 1997 NATIONAL AMBULATORY MEDICAL CARE SURVEY (NAMCS) DATASET

The 1997 National Ambulatory Medical Care Survey (NAMCS) is a national

probability sample survey conducted by the Division of Health Care Statistics, National

Center for Health Statistics (NCHS), and Centers for Disease Control and Prevention

(CDC). The survey consists of 24,715 patient records from visits to 1,247 physicians in

the year 1997. Initially, each patient visit record consisted of 224 variables, including

demographic information, diagnoses, drugs prescribed, types of visits, types of medical

professionals seen, medical tests and screenings done, and location of physician office.

During the data cleanup phase of the project, many of these variables were removed as

the focus of the analysis narrowed, leaving 33 variables with information about pay

method, race, age group, practice types, and other categorical information. Other

variables were reduced from 500+ different categories, using the SPSS

Transform/Compute feature, to far fewer categories. Once the data were cleaned, they

were analyzed using a variety of statistics packages and methods. In section 4.1, the

xxx

Page 31: Thesis.doc

relationship of patient payee types to practice types was analyzed to look for

disproportionate relationships. SAS 8® (SAS Institute Inc.), SPSS 10®, and SPSS

Clementine 5.2® (SPSS Inc.) were all used to analyze the dataset but the DFEPP program

was used to do much of the visualization.

4.1 Analysis of Payee Type by Practice Type

In this section, the ways the different payee types visited the various practice

types were analyzed. Initially there were fourteen practice types (Cardiologists,

Dermatologists, General/Family Practice, General Surgery, Internal Medicine,

Neurology, OB/GYN, Ophthalmology, Orthopedic Surgery, Otolaryngology, Pediatrics,

Psychiatry, Urology, and Other) and nine types of payees (privately insured, Medicare,

Medicaid, workers compensation, self-pay, no charge, other, unknown, and blank). Since

payee types identified as no charge, other, unknown, and blank all had a limited number

of records, they were collected into one “all other” payee type yielding the following

distribution of payee types:

Payee Types (Table 1)

12562 51.0%

5395 21.9%

1945 7.9%

503 2.0%

2176 8.8%

2029 8.2%

Private Insurance

Medicare

Medicaid

Worker's Comp.

Self-Pay

All Other

PayeeType

Count Col %

Each payee type is given as a percentage of the overall study population and should be

near the same percentage of each practice type’s patient load but this is not always the

case. Many of the payee types correlated with different practice types but some of these

xxxi

Page 32: Thesis.doc

preferences are expected and some are not. Although statistical methods can be used to

investigate specific hypotheses, the primary purpose of data mining is to generate

hypotheses to be examined for validity either with fresh data or by withholding a portion

of the initial dataset for investigation. The first investigation examined all of the cases

and the relationships between payee type and the number of visits to a particular practice

type.

The DFEPP plot below gives an indication of this relationship:

Workers Compensation by Physician Specialty (Plot 1)

4.1.1 Workers Compensation

Workers compensation payees were 2% of all payee types and if there were

relationships, would be expected to be near 2% of patient visits to each type of practice.

The workers compensation payees go to orthopedic surgeons at a much higher rate than

the expected 2% of visits to orthopedic surgeons. They were 18.1% of the 1616

orthopedic surgeons’ patients and the probability that the null hypothesis is true (actual

xxxii

Page 33: Thesis.doc

number of visits was the expected 2% of visits to orthopedic surgeons) is less than 0.0005

(χ2=1616.1, p<0.0005) showing that these types of payees go to orthopedic surgeons at a

significantly higher rate. This is not completely unexpected since people go to orthopedic

surgeons for breaks and bruises and these are the main types of injuries that occur at

work. By using the filter and general tables/frequency features in SPSS 10, the reasons

that workers compensation payees visited orthopedic surgeons can be determined.

Workers Compensation ICD-9 Grouped Codes for Orthopedic Visits (Table 2)

1 .5%

1 .5%

18 8.1%

3 1.4%

70 31.7%

1 .5%

109 49.3%

18 8.1%

140-239 Neoplasms

240-279 Endocrine, nutritional and metabolicdiseases, and immunity disorders

320-389 Diseases of the nervous systemand sense organs

680-709 Diseases of the skin andsubcutaneous tissue

710-739 Diseases of the musculoskeletalsystem and connective tissue

780-799 Symptoms, signs, and ill-definedconditions

800-999 Injury and poisoning

V - Supplementary classification of factorsinfluencing health status and contact withhealth services

Count %

ICD-9 Code CategoryWorkers Comp. to

Orthopedic Surgeons

The preceding table is somewhat vague and by using a less broad categorical variable for

the ICD-9-CM (International Classification of Diseases, 9th Revision, Clinical Modification) codes, a

better understanding of these visits can be determined. The following table gives a better

understanding of why the workers compensation payees went to orthopedic surgeons.

xxxiii

Page 34: Thesis.doc

Workers Compensation ICD-9 Codes for Orthopedic visits (Table 3)

1 .5%

1 .5%

1 .5%

1 .5%

17 7.7%

2 .9%

1 .5%

26 11.8%

43 19.5%

1 .5%

1 .5%

1 .5%

13 5.9%

16 7.2%

12 5.4%

50 22.6%

1 .5%

5 2.3%

1 .5%

5 2.3%

4 1.8%

1 .5%

5 2.3%

6 2.7%

1 .5%

4 1.8%

1 .5%

00 intestinal infectious diseases

215.3 benign neoplasms Lower limb, including hip

278.0 Obesity

337.21 Reflex sympathetic dystrophy of the upper limb

35x.xx Carpal tunnel syndrome(13), Lesion of ulnarnerve(2), Lesion of ulnar nerve(1), & Mononeuritis(1)

68x.xx Diseases of the skin and subcutaneous tissue

70x.xx

71x.xx Diseases of the musculoskeletal system andconnective tissue

72x.xx

73x.xx

79x.xx ill-defined and unknown causes of morbidity andmortality

80x.xx fractures

81x.xx

82x.xx

83x.xx dislocations

84x.xx sprains and strains of joints and adjacent muscles

87x.xx open wound

88x.xx

905.9 Late effect of traumatic amputation

92x.xx contusion with intact skin surface or crushing injury

95x.xx injury to nerves and spinal cord

996.6 Infection and inflammatory reaction due to internalprosthetic device, implant, and graft

V1 persons with potential health hazards related topersonal and family history

V4 persons with a condition influencing their health status

V5 persons encountering health services for specificprocedures and aftercare

V6 persons encountering health services in othercircumstances

V9 missing

Count %

ICD-9-CM Codes forWorkers Comp. to

Orthopedic Surgeons

xxxiv

Page 35: Thesis.doc

Initially this table was created using the first 2 characters in the ICD-9-CM codes but

categories that had just a few visits could be better described by extracting the full

ICD-9-CM code from a complete non-abbreviated table of workers compensation visits

to orthopedic surgeons. The table shows that 73.3% of the visits were for sprains, strains,

breaks, and bruises while 9.5% of the visits were for nerve damage (7.7% carpal tunnel,

1.8% nerve/spinal cord damage), 4.1% were for cuts, and 13.1% for all other.

The workers compensation payees also go to neurologists at a higher rate than the

expected 2% of visits. They comprised 4.1% of the 703 neurology patients and the

probability that the null hypothesis (actual % = 2% expected) is true is less than 0.0005

(χ2=15.82, p<0.0005). Therefore the alternative hypothesis is valid (workers

compensation patients go at a significantly higher rate than expected to neurologists). By

filtering the data and then tabling it, the reason the workers compensation payees visited

this practice type can be determined. The following frequency table of workers

compensation payees going to neurology visits shows why they went:

Workers Compensation ICD-9 Grouped Codes for Neurology Visits (Table 4)

2 6.9%

3 10.3%

9 31.0%

5 17.2%

10 34.5%

290-319 Mental disorders

320-389 Diseases of the nervous systemand sense organs

710-739 Diseases of the musculoskeletalsystem and connective tissue

780-799 Symptoms, signs, and ill-definedconditions

800-999 Injury and poisoning

Count %

ICD-9 Code CategoryWorkers Comp. to

Neurologists

xxxv

Page 36: Thesis.doc

Again, for such a small number of cases, a more in-depth analysis can be done by

comparing the complete ICD-9-CM code of each patient’s visit.

Workers Compensation ICD-9 Codes for Neurology Visits (Table 5)

2 6.9%

1 3.4%

2 6.9%

1 3.4%

1 3.4%

1 3.4%

2 6.9%

1 3.4%

1 3.4%

1 3.4%

1 3.4%

1 3.4%

4 13.8%

4 13.8%

3 10.3%

1 3.4%

1 3.4%

1 3.4%

3102- Concussion

3530-Nerve Dmg

3540-Carpal Tunnel

72210 Back Injuries

72280

7231-

7242-

7244-

7245-

7292-Soft Tissue Dmg

7299-

7803-Convulsions

7820-Disturbance of skin sensation

8471-SPRAINS AND STRAINS OFJOINTS AND ADJACENT MUSCLES

8472-

8479-

8489-

8840-Upper Limb Wound

Count %

Physician's diagnosesfor Workers' Comp.

Neurology visits

The table shows that 55.1% of the visits were for sprains, strains, breaks, and bruises.

24.1% of the visits were for nerve damage (6.9% carpal tunnel, 17.2% nerve/spinal cord

damage), 10.3% were for cuts, and 10.3% for all other. The workers compensation

payees go to orthopedic surgeons for many of the same reasons but at somewhat different

proportions.

There were many other practice types that had significantly lower percentages of

workers compensation visits but this is not unexpected. Children are not generally

involved with work and therefore would not use workers compensation to pay for

xxxvi

Page 37: Thesis.doc

pediatric visits. OB/GYN and urology visits would also rarely be paid for by workers

compensation. Excluding the visits to these practices and to the practices with a

disproportionately higher percentage of visits will give a better representation of how the

other types of practices are visited by workers compensation payees. This subset of

workers compensation payees visits by the remaining practice types are distributed as

follows:

New Proportions after WC Orthopedic Surgeon Visits are removed (Table 6)

7962 47.0%

4483 26.5%

1098 6.5%

251 1.5%

1753 10.3%

1393 8.2%

Private Insurance

Medicare

Medicaid

Worker's Comp.

Self-Pay

All Other

Count %

Payee Type

The workers compensation payees are reduced to 1.5% of the patient population. When

the 1.5% value is used as the expected value to analyze the data with the DFEPP

program, the following plot is generated:

Adjusted plot after removal (Plot 2)

xxxvii

Page 38: Thesis.doc

The plot shows that there is not much deviation from the expected percentage but there

are significant differences when the chi square values are considered. Visits to ‘Other’

physicians show the greatest deviation from the expected value with a significantly

higher number than expected visits. ‘Other’ physicians treated workers compensation

payees for a variety of reasons but mainly for the same reasons as the orthopedic surgeon

visits: sprains, strains, breaks, cuts, and bruises (see table 7).

xxxviii

Page 39: Thesis.doc

Workers’ Comp “Other” Physician Visits (Table 7)ICD-9-CM Code

Count Col % ICD-9-CM Code

Count Col %

1119- 1 1.2%25000 1 1.2%33720 1 1.2%33722 1 1.2%3540- 2 2.5%37205 1 1.2%49390 1 1.2%

Disease

Related

81600 2 2.5%8360- 1 1.2%8404- 1 1.2%8409- 2 2.5%8449- 1 1.2%8460- 2 2.5%8469- 1 1.2%

515-- 1 1.2% Bre 8470- 3 3.7%55092 1 1.2% 8471- 2 2.5%71885 1 1.2% 8472- 3 3.7%71943 1 1.2% 8489- 3 3.7%7210- 1 1.2% 8793- 1 1.2%7217- 1 1.2% 8820- 1 1.2%72210 4 4.9% 8830- 2 2.5%72252 1 1.2% 8860- 1 1.2%72280 1 1.2% 9064- 1 1.2%7234- 1 1.2% 9069- 1 1.2%72400 1 1.2% 9248- 3 3.7%7242- 2 2.5% 9300- 1 1.2%7244- 1 1.2% 9404- 1 1.2%7245- 4 4.9% 94420 1 1.2%7246- 1 1.2% 9556- 1 1.2%7248- 1 1.2% 9594- 1 1.2%72632 2 2.5% 9595- 1 1.2%

Break

s, Bru

ises, Strains, S

prain

s, Cu

ts

7294- 1 1.2% V135- 1 1.2%75612 1 1.2% V155- 1 1.2%7804- 1 1.2% V583- 1 1.2%

Personal History

7809- 1 1.2% V6759 1 1.2%7820- 1 1.2% V703- 1 1.2%

Follow-up

V990- 1 1.2% Blank

Psychiatry and general surgery practices were also visited at a significantly higher

rate than the expected 1.5% visit rate for the workers compensation payees. The main

reason for psychiatric visits was depression (74%). General surgery visits tended to be for

cuts, burns, and other wounds (33.3%) and 43% tended to be for breaks, bruises, strains

and sprains. Notably there are significantly lower numbers of visits to ophthalmology

(0.3% actual vs. 1.5% expected) and otolaryngology (0.1% actual vs. 1.5% expected)

practices. This may show that the OSHA (Occupational Safety & Health Administration

xxxix

Page 40: Thesis.doc

http://www.osha.gov/) rules guarding vision and hearing loss work effectively to reduce

such injuries. The dermatology visits were also significantly lower (0.1% actual vs. 1.5%

expected) but many burns and other skin problems were treated by general surgery

practices. This would explain, in part, the significantly higher number of general surgery

visits and the correspondingly lower number of dermatology visits.

With the workers compensation payees being 18.1% of the orthopedic surgery

visits and only 2.0% of the total population, the other payee types visits to orthopedic

surgeons will tend to show fewer visits. Therefore a lower number of visits will be

correspondingly less significant than shown in the DFEPP plots. Although there were

other significant disproportions, the workers compensation payee visits were too few to

create any major disproportion in other payee types’ visits to the various practices.

xl

Page 41: Thesis.doc

4.1.2 Medicare

Medicare payees were 21.9% of all payee types and if they showed no preference,

would be expected to be near 21.9% of patient visits to each type of practice. This is not

the case. Medicare patients went to cardiologists, urologists, ophthalmologists, and to

Medicare by Physician Specialty (Plot 3)

internal medicine visits at significantly higher rates. The probability that they were the

expected 21.9% of each of the visit loads for each practice is less than 0.0005. They were

53.4% of 1418 visits to cardiologists (χ2=822.63, p<0.0005), 38.9% of 1072 to urologists

(χ2=181.13, p<0.0005), 38.6% of 1437 to ophthalmologists (χ2=234.31, p<0.0005), and

33.1% of the 2358 visits for internal medicine (χ2=172.94, p<0.0005). All but the internal

medicine visits are expected. The Medicare population consists of retired or disabled

individuals. The average age of the Medicare population is 71.6 years with a standard

xli

Page 42: Thesis.doc

deviation of 13.01 years. These types of practices treat heart problems, eyesight, and

urinary problems and these are the problems occurring in an older population.

Age statistics by Physician Type (Table 8)

AGE

42.73 3834 23.33

55.09 2358 20.06

5.34 2651 7.33

49.67 1270 20.47

35.82 2022 13.91

45.55 1222 21.91

65.41 1418 15.26

46.38 1409 22.49

57.25 1072 20.16

43.29 1461 16.96

46.23 703 21.83

58.65 1437 22.51

39.84 1175 24.81

52.26 2578 19.62

43.89 24610 24.81

Physician SpecalityGeneral and familypractice

Internal medicine

Pediatrics

General surgery

Obstetrics andgynecology

Orthopedic surgery

Cardiovascular disease

Dermatology

Urology

Psychiatry

Neurology

Ophthalmology

Otolaryngology

All other

Total

Mean N Std. Deviation

The average age of the entire population is 43.89 years with a standard deviation of

24.81 years. The Medicare payees are significantly older; therefore they will

disproportionately visit those practices. The significantly higher number of internal

medicine visits by this population is harder to explain. Table 9 shows why Medicare and

non-Medicare payees visited internal medicine practices:

xlii

Page 43: Thesis.doc

-9 ( 9)ICD Codes tabled by Medicare Use Table

47 3.0% 8 1.0%

17 1.1% 10 1.3%

159 10.1% 86 11.0%

12 .8% 5 .6%

47 3.0% 13 1.7%

68 4.3% 26 3.3%

208 13.2% 247 31.7%

246 15.6% 76 9.7%

55 3.5% 22 2.8%

62 3.9% 27 3.5%

1 .1% 1 .1%

54 3.4% 12 1.5%

142 9.0% 63 8.1%

1 .1%

158 10.0% 85 10.9%

85 5.4% 21 2.7%

216 13.7% 78 10.0%

Infectious and parasiticdiseases

Neoplasms

Endocrine, nutritional andmetabolic diseases, andimmunity

Diseases of the bloodand blood-forming organs

Mental disorders

Diseases of the nervoussystem and senseorgans

Diseases of thecirculatory system

Diseases of therespiratory system

Diseases of the digestivesystem

Diseases of thegenitourinary system

Complications ofpregnancy, childbirth, andthe puerperium

Diseases of the skin andsubcutaneous tissue

Diseases of themusculoskeletal systemand connective tissue

Congenital anomalies

Symptoms, signs, andill-defined conditions

Injury and poisoning

Supplementaryclassification of factorsinfluencing health s

Count %

ICD-9 Code Category

False

Count %

ICD-9 Code Category

True

Uses Medicare

Medicare payees went to internal medicine practices for diseases of the circulatory

system at a very disproportionate rate. A total of 13.2% of the population of non-

Medicare payees visited this practice type for diseases of the circulatory system but

31.7% of the population of Medicare payees visited this practice type for the same

xliii

Page 44: Thesis.doc

diseases. The other categorical reasons for the visits to this practice by Medicare and non-

Medicare payees were not that different. By reducing the number of visits for diseases of

the circulatory system of the Medicare population to the non-Medicare percentage rate,

the rate of Medicare payees going to internal medicine visits becomes less significant at

28.7%. By reducing the visits, a new χ2 value of 59.8 (p<0.0005) was computed showing

that there was still a significantly higher number of visits to this practice type by

Medicare payees. Other practices were visited at significantly lower rates. 0.8% of 2651

to pediatricians (χ2=690.05, p<0.0005) and 4.7% of 2022 OB/GYN (χ2=349.74,

p<0.0005). The Medicare visits to pediatricians are probably due to recording errors. The

low rate of visits to OB/GYNs for Medicare payees is not an unexpected result. The

average age of OB/GYN patients is 35.82 years with a standard deviation of 13.91 years.

The average age for Medicare patients is 71.9 years and this is over two standard

deviations from the average OB/GYN patients’ age.

With the Medicare payees responsible for 53.4% of the visits to cardiologists and

only 21.9% of the total population, the other payee types’ visits to cardiologists will tend

to show fewer visits and a correspondingly lower number of visits will be less significant

than shown in the DFEPP plots. Urology and Ophthalmology visits were also at

significantly higher rates, but lesser, and will also skew downward the rates of other

payee types visits to these practices.

xliv

Page 45: Thesis.doc

4.1.3 Medicaid

Medicaid payees accounted for 7.9% of all payee types and if they showed no

preference, would be expected to be near 7.9% of patient visits to each type of practice.

The Medicaid payees go to pediatricians at a much higher rate than the expected 7.9% of

visits to pediatricians. They were responsible for 20.0% of the 2651 pediatric patients and

the probability that the null hypothesis is true (actual number of visits was the expected

7.9% of visits to pediatricians) is less than 0.0005 (χ2=533.45, p<0.0005) showing that

these types of payees go to pediatricians at a significantly higher rate.

Medicaid by Physician Specialty (Plot 4)

The higher rate of Medicaid payees to pediatricians is not unexpected. The average age

for Medicaid payees is 27.47 years with a standard deviation of 24.53 years. The

distribution for this population is not normal and plot 5 shows this.

xlv

Page 46: Thesis.doc

Distribution of Medicaid population (Plot 5)

AGE

100.090.0

80.070.0

60.050.0

40.030.0

20.010.0

0.0

400

300

200

100

0

Std. Dev = 24.43

Mean = 27.5

N = 1945.00

The distribution of Medicaid payees is skewed towards the younger ages and it is the

youngest of all payee types.

Age Statistics by Payee Type (Table 10)

AGE

36.22 12562 21.24

71.61 5395 13.01

27.47 1945 24.43

41.23 503 12.91

36.88 2176 19.34

41.59 2029 22.02

43.89 24610 24.81

Payee TypePrivate Insurance

Medicare

Medicaid

Worker's Comp.

Self-Pay

All Other

Total

Mean N Std. Deviation

95% of the pediatric visits were by patients 20 years or younger (plot 6) and since the

Medicaid population is the youngest, it would carry a disproportionately higher rate of

visits.

xlvi

Page 47: Thesis.doc

( 6)Age of Pediatric Patients Plot

AGE of Pediatric patients

85.0 80.0

75.0 70.0

65.0 60.0

55.0 50.0

45.0 40.0

35.0 30.0

25.0 20.0

15.0 10.0

5.0 0.0

1400

1200

1000

800

600

400

200

0

Std. Dev = 7.33 Mean = 5.3 N = 2651.00

There were other practices that the Medicaid payees visited at lower than expected rates.

By removing the pediatric visits, it can be determined how the Medicaid population

visited the other practices. A new expected percentage of 6.4% of Medicaid payees is

used to re-evaluate the data with the DFEPP plot.

xlvii

Page 48: Thesis.doc

Modified Medicaid by Physician Specialty (Plot 7)

Urology and orthopedic surgery practices were visited at lower than expected rates but

this is merely a reflection of the disproportionately higher visits by the Medicare and

workers compensation payees to these practice types respectively. The OB/GYN visits

are significantly higher than expected but this population contains a greater percentage of

women in child bearing age and with the significantly lower number of Medicare patients

attending this practice, a higher than expected result should visit OB/GYNs. Surprisingly

the visits to dermatologists by the Medicaid payees are significantly lower than expected.

Many people believe that dermatology patients are mainly children with acne problems.

Plot 8 shows how the dermatology visits are distributed by age:

xlviii

Page 49: Thesis.doc

Age of Dermatology Patients (Plot 8)

AGE of all Dermatologists' Patients

100.090.0

80.070.0

60.050.0

40.030.0

20.010.0

0.0

140

120

100

80

60

40

20

0

Std. Dev = 22.49

Mean = 46.4

N = 1409.00

The average age of dermatology patients is 46.4 years with a standard deviation of 22.49

years. The population of dermatology patients is far older than Medicaid payees and

would therefore have fewer Medicaid payees.

Although there was a disproportionately higher number of pediatric visits in the

Medicaid population, the lack of visits in the Medicare population will offset the higher

rate in this population giving the remaining payee types the potential to have near their

expected distribution for pediatric visits. The other practices of the Medicaid population

showed preferences that are merely a reflection of other payee types disproportionately

visiting those practices.

xlix

Page 50: Thesis.doc

4.1.4 Self-Pay

Self-Pay payees were 8.8% of all payee types and if they showed no preference,

would be expected to be near 8.8% of patient visits to each type of practice. The Self-Pay

payees go to psychiatrists at a much higher rate than expected.

Self Pay by Physician Specialty (Plot 9)

They represented 26.0% of the 1461 psychiatric patients and the probability that the null

hypothesis is true (actual number of visits was the expected 8.8% of visits to

psychiatrists) is less than 0.0005 (χ2=538.55, p<0.0005) showing that these types of

payees go to psychiatrists at a significantly higher rate. This presents some possibilities:

that the uninsured have more problems that require psychiatric visits or that insurance

will not pay for psychiatric visits. The first possibility is hard to explore but the second

can be explored indirectly. There was no variable to determine if someone was insured

l

Page 51: Thesis.doc

but there was a variable to determine if the patient was a member of an HMO. Overall

25.1% of patients were HMO members but only 4.4% of Self-Pay payees were members

of an HMO. When isolating the self-pay psychiatric visits, 8.4% of the patients were

members of an HMO. This shows that the self-pay patients going to psychiatrists were

more apt to be HMO members when compared to the entire self-pay population and

therefore were more apt to have insurance. Another consideration is that people did not

pay for these visits with insurance and therefore would not mark down whether or not

they belonged to an HMO. Regardless, self-pay payees do visit psychiatrists at a

significantly higher rate and at least 8.4% of the visits were by insured patients. Self-pay

payees also visited dermatologists at a significantly higher rate. They represented 16.9%

of the 1408 dermatology patients and the probability that the null hypothesis is true

(actual number of visits was the expected 8.8% of visits to dermatologists) is less than

0.0005 (χ2=115.19, p<0.0005) showing that self-pay payees go to dermatologists at a

significantly higher rate. Only 5.0% of the dermatology self-pay payees were members of

an HMO and this is not significantly higher than the 4.4% however there is another way

to evaluate the HMO data. There were four different responses to HMO insured: Yes, No,

Unknown, and Blank.

li

Page 52: Thesis.doc

HMO Membership by Physician Specialty (Table 11)

2.2% 77.1% 19.6% 1.1%

7.0% 84.2% 8.8%

3.0% 89.1% 6.7% 1.2%

1.6% 90.5% 7.9%

1.5% 92.6% 5.9%

5.0% 82.5% 2.5% 10.0%

95.7% 4.3%

5.0% 72.3% 22.3% .4%

96.4% 3.6%

8.4% 54.7% 35.5% 1.3%

89.1% 10.9%

3.9% 76.6% 18.0% 1.6%

6.1% 89.0% 3.7% 1.2%

5.4% 59.5% 35.1%

General and familypractice

Internal medicine

Pediatrics

General surgery

Obstetrics andgynecology

Orthopedic surgery

Cardiovascular disease

Dermatology

Urology

Psychiatry

Neurology

Ophthalmology

Otolaryngology

All other

PhysicianSpecality

Row %

yes

Row %

no

Row %

unknown

Row %

blank

Does the patient belong to an HMO?

The “Unknown” responses are more than likely insured patients that do not know if they

have an HMO plan, not uninsured that are unsure if they have an HMO (insurance) plan.

By collecting the “Yes” and “Unknown” responses into an “Insured” response and the

“No” responses into a “Possible but No HMO” response, who are likely and possibly

insured may be determined yielding the following distribution (Table 12):

lii

Page 53: Thesis.doc

Has Insurance by Physician Specialty (Table 12)

21.8% 77.1% 1.1%

15.8% 84.2%

9.7% 89.1% 1.2%

9.5% 90.5%

7.4% 92.6%

7.5% 82.5% 10.0%

4.3% 95.7%

27.3% 72.3% .4%

3.6% 96.4%

43.9% 54.7% 1.3%

10.9% 89.1%

21.9% 76.6% 1.6%

9.8% 89.0% 1.2%

40.5% 59.5%

24.3% 74.8% .9%

General and familypractice

Internal medicine

Pediatrics

General surgery

Obstetrics andgynecology

Orthopedic surgery

Cardiovascular disease

Dermatology

Urology

Psychiatry

Neurology

Ophthalmology

Otolaryngology

All other

PhysicianSpecality

Total

Row %

Insured

Row %

Possiblebut NoHMO

Row %

Blank

Has Insurance

The modified distribution gives a better idea of which patients went to the various

practices with insurance. On average, at least 24.3% of self-pay payees had insurance but

visits to psychiatrists had a much higher rate of insured self-pay patients (43.9%)

showing possibly that insurance companies tend not to cover psychiatric services and the

patients have to pick up the cost. The dermatology patients also show a

disproportionately higher number of self-pay payees but their insured percentage is not

that different from the rest of the self-payees.

liii

Page 54: Thesis.doc

4.1.5 Privately Insured

The privately insured went to many practice types at highly disproportionate rates.

They were the second youngest population within this study, and would be expected to

Privately Insured by Physician Specialty (Plot 10)

favor certain practices. They were not older and would not tend to see cardiologists for

heart disease or ophthalmologists for failing eyesight but they were young enough to be

of family bearing age and would tend to see OB/GYN and pediatricians. The

disproportionate visits to cardiologists, OB/GYN, ophthalmologists, and pediatricians

within the privately insured population are nearly in inverse proportion to the Medicare

population’s visits for these four practices. The otolaryngology visits are the only

unexplainable disproportionately higher visited practice in the privately insured

population. The privately insured go to otolaryngologists at a much higher rate than the

expected 51.0% of otolaryngology visits. They were responsible for 63.3% of the 1175

otolaryngology patients and the probability that the null hypothesis is true (actual number

liv

Page 55: Thesis.doc

of visits was the expected 51.0% of visits to otolaryngologists) is less than 0.0005

(χ2=71.13, p<0.0005) showing that these types of payees go to otolaryngologists at a

significantly higher rate. There were other significances outside of the five mentioned

practices but none of the other practice types showed a significant deviation from the

expected percentage (greater than 10%) and are not analyzed.

4.1.6 All Other

The last payee type analyzed is the “All Other” payee. The all other payee type

consists of the no charge, other, unknown, and blank payee types. There were too few

visits in each of the subcategories (no charge, other, unknown, and blank payee types) to

effectively analyze but the combined “All Other” payee type had a sufficient number of

visits to analyze. The all other payees were 8.2% of the population and would be

expected to be near 8.2% of patient visits to each practice type.

All Other Payees by Physician Specialty (Plot 11)

lv

Page 56: Thesis.doc

The visits by the all other payee types to each of the practice types were all within 10% of

their expected percentage of 8.2%. Only the neurology visits approached the 10%

difference threshold used as a cutoff point. By using the HMO variable, the patients with

insurance can be extracted.

Has Insurance by Physician Specialty (Table 13)

59.2% 35.0% 5.8%

61.9% 31.9% 6.2%

78.0% 18.1% 4.0%

25.3% 50.0% 24.7%

60.4% 37.9% 1.6%

52.0% 40.0% 8.0%

73.3% 19.8% 7.0%

44.7% 48.5% 6.8%

51.2% 41.5% 7.3%

49.1% 47.3% 3.6%

73.3% 25.0% 1.7%

63.8% 28.9% 7.2%

28.6% 63.3% 8.2%

58.0% 40.6% 1.3%

57.8% 35.8% 6.4%

General and familypractice

Internal medicine

Pediatrics

General surgery

Obstetrics andgynecology

Orthopedic surgery

Cardiovascular disease

Dermatology

Urology

Psychiatry

Neurology

Ophthalmology

Otolaryngology

All other

PhysicianSpecality

Total

Row %

Insured

Row %

Possiblebut NoHMO

Row %

Blank

Has Insurance

Note that 57.8% of the all other payee type may have had some insurance. The variable

“Has Insurance” was previously defined in section 4.1.4. Similarly, 73.3% of the

neurology patients may have had insurance. This is significantly higher than the overall

57.8% average for this payee type showing that insurance tends to not cover neurology

visits as well as the other practice types. Pediatric visits by this payee type also had a

lvi

Page 57: Thesis.doc

significantly higher number of insured visitors. By reviewing the ICD-9 codes, the

reasons patients went to the different practices can be determined.

lvii

Page 58: Thesis.doc

Has Insurance by ICD-9 Codes/Pediatric (Table 14)

Physician Specality Pediatrics

4.5% 1.7%

.6% .6%

.6%

.6% .6%

10.7% 2.8%

16.9% 5.1% .6%

2.8% .6%

.6%

2.8% .6% .6%

.6%

.6%

2.8%

2.8% .6%

32.2% 5.1% 2.3%

Infectious and parasiticdiseases

Neoplasms

Endocrine, nutritional andmetabolic diseases, andimmunity

Diseases of the bloodand blood-forming organs

Mental disorders

Diseases of the nervoussystem and senseorgans

Diseases of thecirculatory system

Diseases of therespiratory system

Diseases of the digestivesystem

Diseases of thegenitourinary system

Complications ofpregnancy, childbirth, andthe puerperium

Diseases of the skin andsubcutaneous tissue

Diseases of themusculoskeletal systemand connective tissue

Congenital anomalies

Symptoms, signs, andill-defined conditions

Injury and poisoning

Supplementaryclassification of factorsinfluencing health s

ICD-9CodeCategory

Layer %

Insured

Layer %

Possiblebut NoHMO

Layer %

Blank

Has Insurance

lviii

Page 59: Thesis.doc

As shown, 26.6% of the pediatric visits in the insured all other payee type went for

diagnoses V20.2 (Routine infant or child health check (a subset of “Supplementary

classification of factors influencing health” 32.2%)). This group also went for diseases of

the nervous system/sense organs (10.7%) (hearing loss/ear infections) and of the

respiratory system (16.9%) (soar throats/ tonsillitis/ colds).

All Pay Methods by Insurance/Pediatrics (Table 15)

Physician Specality Pediatrics

924 52.6% 822 46.8% 11 .6%

6 28.6% 15 71.4%

137 25.8% 393 74.0% 1 .2%

16 9.7% 147 89.1% 2 1.2%

1 25.0% 3 75.0%

122 81.9% 25 16.8% 2 1.3%

8 72.7% 3 27.3%

7 53.8% 1 7.7% 5 38.5%

Private Insurance

Medicare

Medicaid

Worker's Compensation

Self-pay

No charge

Other

Unknown

Blank

Primaryexpectedsource ofpayment forthe visit

Count Row %

Insured

Count Row %

Possible but No HMO

Count Row %

Blank

Has Insurance

When looking at the expanded list of pay methods, pediatric visits were paid for by other

means 81.9% of the time. This could merely be families using local government funded

health clinics for pediatric visits.

lix

Page 60: Thesis.doc

Has Insurance by ICD-9 Codes/Neurology (Table 16)

Physician Specality Neurology

.8% .8%

1.7%

2.5% .8%

30.8% 5.0% 1.7%

.8% 1.7%

5.8% 5.0%

.8% .8%

17.5% 3.3%

.8% 3.3%

11.7% 4.2%

Infectious and parasiticdiseases

Neoplasms

Endocrine, nutritional andmetabolic diseases, andimmunity

Diseases of the bloodand blood-forming organs

Mental disorders

Diseases of the nervoussystem and senseorgans

Diseases of thecirculatory system

Diseases of therespiratory system

Diseases of the digestivesystem

Diseases of thegenitourinary system

Complications ofpregnancy, childbirth, andthe puerperium

Diseases of the skin andsubcutaneous tissue

Diseases of themusculoskeletal systemand connective tissue

Congenital anomalies

Symptoms, signs, andill-defined conditions

Injury and poisoning

Supplementaryclassification of factorsinfluencing health s

ICD-9CodeCategory

Layer %

Insured

Layer %

Possiblebut NoHMO

Layer %

Blank

Has Insurance

The 30.8% of the all other visits to neurologists for diseases of the nervous system and

sense organs were not covered by insurance even though the patient probably had

insurance; 17.5% went for Symptoms, signs, and ill-defined conditions

lx

Page 61: Thesis.doc

(apnea/convulsions/nervous system injury) and 11.7% of the visits were for follow ups

and paper work.

All Pay Methods by Insurance/Neurology (Table 17)

Physician Specality Neurology

134 42.7% 180 57.3%

16 12.7% 109 86.5% 1 .8%

3 5.1% 56 94.9%

11 37.9% 18 62.1%

6 10.9% 49 89.1%

2 100.0%

16 39.0% 25 61.0%

71 97.3% 2 2.7%

1 25.0% 1 25.0% 2 50.0%

Private Insurance

Medicare

Medicaid

Worker's Compensation

Self-pay

No charge

Other

Unknown

Blank

Primaryexpectedsource ofpayment forthe visit

Count Row %

Insured

Count Row %

Possible but No HMO

Count Row %

Blank

Has Insurance

A majority of the visits for this payee type were by unknown ways of pay for neurology

visits. This may show a tendency for insurance not to cover neurological disorders,

leaving patients to pay for these problems themselves.

Although visits to neurologists by the all other payee type do show significance,

this payee type has the least significant difference of all types.

Each different payee type had significant disproportions in the way patients visit the

different practices. Many were expected but a few were not easily explained. The

workers compensation payees went for bumps and bruises; the Medicare population went

to practices that serve ailments in older patients. The Medicaid population is very young

and sees practices that serve children and adults of child bearing age. The privately

insured visited many practices disproportionately but most of the differences could be

lxi

Page 62: Thesis.doc

attributed to the other payee types’ disproportions. Self-Pay tended to pay for psychiatry

and dermatology visits at a significantly higher rate showing that these practice visits are

not covered as well as the other practices by insurance. The “All Other” payees went to

neurologists at a significantly higher rate leaving a majority of them with an unknown

way of paying for these services.

4.2 HMOs

What do HMOs pay for? Who are members of HMOs? Are the practices visited

significantly different when compared to the non-HMO population? These are all

questions that can be answered by an analysis of this dataset.

Initially there were four different types of responses to the question of whether or

not the patient was a member of an HMO (yes, no, unknown, and left blank).

HMO Membership (Table 18)

6187 25.1%

15853 64.4%

2242 9.1%

328 1.3%

yes

no

unknown

blank

Does thepatient belongto an HMO?

Count Col %

By making an assumption that the unknown and blank responses are proportionately

distributed through the yes and no responses and removing them, the real proportion of

HMO membership may be determined.

Adjusted HMO Membership (Table 19)

lxii

Page 63: Thesis.doc

6187 28.1%

15853 71.9%

yes

no

Does the patient belongto an HMO?

Count Col %

By using the adjusted figures, 28.1% of this population is aware that they are members

and 71.9% is aware that they are not.

Who are members of HMOs? The younger patients (under 65 years) were more apt to

be members of HMOs than patients in the oldest two age groups (65-74 years and 75+)

HMO Membership Percent by Age (Plot 12)

% of HMO Membership

0.005.00

10.0015.0020.0025.0030.0035.0040.0045.00

Under15

years

15-24years

25-44years

45-64years

65-74years

75yearsandover

% of HMO Membership

In all, 28.1% of patients were members of an HMO but three of the age groups

significantly deviated from the expected proportion.

HMO Membership by Age Group (Plot 13)

lxiii

Page 64: Thesis.doc

The membership in the two older age groups is significantly lower than for other groups

(patients 75 years and over: 13.7% actual vs. 28.1% expected, χ2=281.11, p<0.0005)

(patients with 65-74 years: 17.5% actual vs. 28.1% expected, χ2=167.34, p<0.0005). The

youngest age group had a significantly higher rate of membership in HMOs (38.9%

actual vs. 28.1% expected, χ2=214.07, p<0.0005). A majority of the older two age groups

are eligible for Medicare.

HMO Membership by Payee Type (Plot 14)

Surprisingly, the Medicare population does not have the lowest rate of HMO membership

(6.7% actual vs. 28.1% expected, χ2=1133.35, p<0.0005). People who had to self pay had

the lowest membership rate (5.5% actual vs. 28.1% expected, χ2=435.58, p<0.0005). This

could be for a variety of reasons: uninsured self pay patients do not have insurance and

would not have an HMO membership, or if the patient was a member and the visit was

not covered, they may not have marked being an HMO member. The significantly lower

rate of workers compensation (11.7% actual vs. 28.1% expected, χ2=44.46, p<0.0005)

visits may be due to workers compensation paying for the visit and not the patients’

private insurance. Therefore the patients may have not marked HMO coverage even if

they were members. The Medicaid population is the youngest population and should

lxiv

Page 65: Thesis.doc

follow the younger age groups higher level of HMO membership but their membership

rate is significantly lower than expected (14.6% actual vs. 28.1% expected, χ2=164.62,

p<0.0005). A reason for the low rate could be that Medicaid programs are state run and

only some of the states have HMO options. Also, these data were collected in 1997 when

the concept of Medicaid HMOs was not widely implemented. The significantly0 higher

rate of privately insured (39.9% actual vs. 28.1% expected, χ2=7692.22, p<0.0005) is

partially explained by the lack of HMO coverage in the state run programs, reducing

expected average. The higher rate does show that the privately insured are much more

likely to have been members of an HMO than any other insured type. The all other

payees show the greatest deviation from the expected member rate. They have a

significantly higher rate of HMO membership (52.9% actual vs. 28.1% expected,

χ2=469.71, p<0.0005). Many of the reasons why people were in this group were explored

in the previous section (public funded family clinics, neurology visits uncovered).

What do HMOs pay for?

HMO Membership by Physician Specialty (Plot 15)

lxv

Page 66: Thesis.doc

Plot 15 shows the rates of HMO membership for each physician type. 43.6% of visits to

pediatricians are by HMO members. This is significantly higher than the expected rate for

pediatric visits (43.6% actual vs. 28.1% expected, χ2=297.16, p<0.0005). The pediatric

patients are young and would be expected to follow the higher rate of membership of the

younger age groups but the higher rate cannot be completely explained by this. Many of

the lower than expected rates can be attributed to age group preferences such as the older

age groups’ preferred practice types (urology, cardiologists, and ophthalmologists) with

their lower rate of membership. OB/GYN visits are mainly for a younger population and

that rate would be expected to be higher. There are other differences but most of them

can be correlated to age group preferences.

Does any race favor HMOs? When each race is compared to the 28.1% baseline, the

Asian/Pacific Islander population has a significantly higher rate of HMO membership.

HMO Membership by Race (Plot 16)

The Asian/Pacific Islander age is not significantly different from the other races so age

cannot explain the higher HMO rate. Another factor such as location or culture may play

a role.

Age Statistics by Race (Table 20)

lxvi

Page 67: Thesis.doc

AGE

44.42 19186 25.08

39.58 2154 24.42

41.62 700 23.91

43.86 22040 25.03

RACEWhite

Black

Asian/Pacific Islander

Total

Mean N Std. Deviation

lxvii

Page 68: Thesis.doc

/ ( 17)Distribution of Asian Pacific Islander Age Plot

AGE

90

.0

85

.0

80

.0

75

.0

70

.0

65

.0

60

.0

55

.0

50

.0

45

.0

40

.0

35

.0

30

.0

25

.0

20

.0

15

.0

10

.0

5.0

0.0

60

50

40

30

20

100

The Asian/Pacific Islander population is not distributed skewed to the younger ages.

Membership in an HMO differed greatly when looking at the six age groups and

three races. The older the patient, the less likely they were to be a member of an HMO.

Most of this is due to the Medicare population’s lack of HMO membership. The youngest

age group was most likely to be a member of an HMO. If the patient is an Asian/Pacific

Islander, they are more likely to be an HMO member than a member of another race.

Privately Insured and “All Other” payees had the greatest membership and Medicare,

Medicaid, self-pay, and workers compensation had significantly lower than average

membership rates. Different practices were disproportionately visited by HMO members

at significant rates. Much of this is due to the type of practice and patients ages. Practices

that see predominately older patients will have a lower rate of HMO members.

Conversely, practices that see predominately younger patients will have a higher rate of

HMO members.

lxviii

Page 69: Thesis.doc

4.3 Modeling

Data Mining techniques can be used to model the data using a variety of

techniques: Neural Networks, Genetic Algorithms, Decision Trees, Regression Analysis,

Factor Analysis. Each of these models works best with particular data types for input and

output. Regression analysis and genetic algorithms use numeric data input variables to

create a function that can optimize model generation of a numeric data output variable.

Supervised neural networks create a model from a variety of data types to predict an

output variable. Unsupervised neural networks (Kohonen) have no output variable but

generate dynamic data clusters through means of node competition. Factor analyses

generate either variable or data clusters. Decision trees and rules sets are used to predict

categorical data from a variety of input types. When any of these models are created, they

should be created with a subset of the data and validated with a second subset of the data.

Models created with an entire dataset tend to be more complicated and without external

data, never proven to be valid.

Each of these models can be used independently or in conjunction with one

another. For instance, supervised neural networks generate a black box that accepts inputs

and generates output but does not give any function or rules to understand how the

processing occurred. However, neural networks do give a sensitivity analysis of the

variables in the network. Variables that had little effect can be removed leaving variables

with greater effect on the outcome. The variables with greater effect can be used as a

refined starting point in other models. Factor analysis yields clusters of associated

variables. Variables that are clustered together are interrelated and would tend to be the

lxix

Page 70: Thesis.doc

best inputs to predict variables within the same group. In this section a few different

techniques will be used to create models for AGE GROUP (categories of age) and

PREGNANCY (yes or no).

4.3.1 Age Group Models

SPSS Clementine 5.2 is the primary package used in this analysis. It provides all

of the previously mentioned modeling techniques. The dataset initially had 224 variables

but was reduced to 33 variables (31 categorical and 2 numeric). Genetic algorithms and

regression analysis will not work well with categorical data and will not be used.

Supervised neural networks can help with a sensitivity analysis of the variables used in

the network to find which variables have the greatest impact on AGE GROUP. The

process using neural networks first requires that the data are filtered. Next is to define the

input/output variables, sampling from the data, and generating a neural network with a

sensitivity analysis. Then variables with the higher influence on AGE GROUP can be

determined.

Clementine Code (Image 6)

lxx

Page 71: Thesis.doc

Neural Network for Age Group Model Output (Image 7)

Image 7 is a screen dump of a neural network sensitivity analysis. This shows the

predicted accuracy, some of the structure of the network, and a score of the relative

importance of a variable. The relative importance is a score from 0.0 (low importance) to

1.0 (high importance) of a variable’s importance in the network. AGE, PHYSICIAN

SPECALITY, PRIMARY PAYMENT, MAJOR REASON, PAYEE TYPE, and DAY all

had a relative importance score of 0.10 or more. Refining the network to use only these

variables should improve the model.

lxxi

Why don’t you use what you did earlier, ie A,B,C to define the different parts of the screen dump.

Page 72: Thesis.doc

Refined Neural Network for Age Group Model Output (Image 8)

The model’s accuracy did improve but this is not unexpected. The AGE GROUP variable

is based on age and AGE is the variable used to determine which age group they are in. If

all variables except age are removed, the network is 100% accurate. It is not possible to

find an accurate rule using the remaining variables excluding AGE. As shown in image

10, the model was very inaccurate.

Neural Network for Age Group Model Output (Image 9)

When using only the C5 modeling (a rule set model) and only the higher level variables

in the structure, no model over 52% correct was found. Even when seemingly age-related

variables were included (such as pregnant, payee type, and practice type) no accurate

models were found.

lxxii

Page 73: Thesis.doc

4.3.2 Modeling Classification of Pregnant

A more accurate model to determine if someone is pregnant can be generated by

using a C5 model with SEX, PAY METHOD, REASON FOR VISIT, TIME SPENT

WITH PHYSICIAN, and AGE GROUP.

C5 Model for Pregnant Output (Image 10)

This model was only 89.87% accurate. Other variables could be added to make the model

more accurate but this adds complexity to the model. A balance between complexity and

accuracy is determined by the researcher but adding complexity for limited results may

not be justifiable. The following is a rule set generated by a C5 model to predict

pregnancy.

lxxiii

Page 74: Thesis.doc

Rule set for Pregnant:

Default : -> Noif Male :-> Blankif Female:Rules for Unknown: if major reason for visit == blank/unknown if major reason for visit == Non-illness care and Age Group == Under 15 years and time spent with physician =< 2 Rules for Yes:

major reason for visit == Non-illness care and:

If Age Group == 15-24 years and time spent with physician =< 14

or (time spent with physician > 14 and Primary expected source of payment for the visit == Self-pay)

if Age Group == 25-44 years and Primary expected source of payment for the visit == Blank

or Primary expected source of payment for the visit == Medicaid or Primary expected source of payment for the

visit == [Medicare Worker's Compensation] or ( Primary expected source of payment for the visit == Other and time spent with physician =< 25) or (Primary expected source of payment for the visit == Private Insurance and time spent with physician > 2 and time spent with physician =< 12 ) or (Primary expected source of payment for the visit == Private Insurance and time spent with physician > 50 )

If the patient was female, of child bearing age and visiting for Non-Illness care then she

was probably pregnant. The different payee types show up deep in the structure and

when removed, only slightly lessen the accuracy of the model (Image 11,12).

lxxiv

Page 75: Thesis.doc

Refined C5 Model for Pregnant Output (Image 11)

Refined Rule Set for Pregnant Model Rule Set (Image 12)

This model defined above also shows if the patient was female, of child bearing age and

visiting for Non-Illness care then she was probably pregnant. Time spent with a

physician also had some influence on the model. Removing the TIME SPENT WITH

PHYSICIAN also slightly degrades the model’s accuracy but simplifies the model

(Image 13,14).

lxxv

Page 76: Thesis.doc

Refined C5 Model (2) for Pregnant Rule Set (Image 13)

Refined Rule Set (2) for Pregnant Model Rule Set (Image 14)

Variables that add little to the outcomes should be discarded. Additional variables will

always add marginally to the results.

There are a variety of types of modeling techniques. Some techniques require

certain types of data input and others are less constraining. Some techniques will generate

functions or rule sets as output where others create unreadable neural networks. Other

techniques generate variable or data clusters. These techniques can be used in

conjunction with each other to refine models. Only a portion of data should be used to

create the model leaving the entire dataset to validate the model. Using these techniques

lxxvi

Page 77: Thesis.doc

to predict whether someone was pregnant or what age group they were in yielded mixed

results. The model to determine if a patient was pregnant had a ~90% accuracy but the

model to determine which age group the patient was in, once the AGE variable was

removed, never had an accuracy over 52%. The AGE GROUP variable had six strata that

spread into two or more strata (per age group) in other variables so no other categorical

variable could substantively improve the model. The PREGNANCY variable only had

four strata. The male population always left the answer blank so three strata remained.

Assigning BLANK to male responses automatically gave the model ~50% accuracy.

Pregnancy is considered a non-illness and usually is within a certain patient age range.

By using variables associated with sex, age, and non-illness, the model generated became

highly accurate. If such associated variables are not apparent, neural networks can

generate a sensitivity analysis to find “Relative Importance Scored” variables or variable

clustering techniques to find associated variables. This may be the best course of action

to find an initial set of inputs for a model.

lxxvii

Page 78: Thesis.doc

CHAPTER V

CONCLUSIONS

Data mining is a process that, until recent times, was not feasible. Analyzing large

datasets was too time consuming and too apt to have some human computational error in

the analysis. With the creation of the modern computer and mass storage, the analysis of

large datasets has become a less tedious task with less chance for computational error.

What may have taken months to compute before now only takes a few minutes. Once the

data are imported into a statistical package, a variety of analyses can be done in a limited

amount of time. Researchers can test and refine the focus of their analyses in minutes.

The modern computer also allowed for much data to be stored in digital formats which

are easily distributed. Although there are a lot of data on paper, most modern data are

stored in some digital format. Government institutions have stored data on a variety of

topics for years. Corporations have stored internal and customer data. Educational and

research facilities also have stored data. Most of this data is available to the public but

corporations and other entities tend to keep their data to themselves. Much publicly

available data can be found on the Internet at numerous data warehouses.

lxxviii

Page 79: Thesis.doc

Data mining is not easily defined. It is a process of acquiring, importing, cleaning,

and analyzing large datasets. Acquiring data can be from a researcher’s own collection

mechanism or data from an outside source. Importing data is a process of transforming

the data into a format accepted by a statistical package. Cleaning data is a process of

removing bad data or correcting existing data. The analyses of datasets depend upon the

types of data. Certain techniques require nominal data, some require numeric data, and

other techniques can use a mixture of data types. These techniques can be used alone or

in conjunction with one another. In chapter 4, a sensitivity analysis from a neural network

was used to refine the variables for a C5 model. A sensitivity analysis from a neural

network could also be used to refine a variable list for a regression analysis. Variable

clusters can be used in a similar manner. SPSS and SAS both have statistical packages

that have many techniques to analyze and present results in an informative manner. Some

packages are more industry specific. There are specialized packages designed for

analyzing web practices and designs, customer patterns, patterns in the stock market, and

other industry specific data. These packages may not always present data in a manner the

researcher desires and a program may need to be written to analyze the data in a different

manner. White and Tufte’s books give guidelines on effective presentation of information

(data visualization) that any statistical program should adhere to. Data visualization gives

statistical information on large datasets with a mixture of graphic and text elements in a

coherent manner. No matter what statistical package or program is used, presenting

results to in a manner that is easily digestible is a must.

In this thesis data mining was used in many ways. Various techniques were used

to acquire the dataset. SPSS 10 was used in importing and cleaning the data. SPSS 10,

lxxix

Page 80: Thesis.doc

SPSS Clementine, and SAS 8 were all used in exploring the data. A program was

developed using visualization techniques to further analyze and present results in a

different manner. The developed program used an expected percentage of a stratum

across actual values of strata of another variable to show significant deviations from the

expected values. This program was used to show how different payee types

disproportionately went to different practices and also examined HMO membership.

Other analyses were done using modeling techniques. A model was developed to predict

which age group a patient belonged to with little success. Initially a neural net was used

to find the inputting variables for a C5 model. The initial variables selected yielded very

good results but when the AGE variable was removed from the equation, the model

degraded. A C5 model was also created to determine pregnancy with better results. The

model for pregnancy was refined by using the results from previous C5 models to

generate an accurate and simple rule set (~90% accurate).

Data Mining is a recent concept is data analysis. As more and different types of

data are collected, newer forms of data analysis techniques and associated programs will

be created or refined. As computers on the Internet are used in parallel processing,

statistical programs can do even more complicated analysis (such as SETI@HOME,

FOLDING@HOME). As computers and parallel processing develop, data mining will

also develop in parallel, becoming a more effective way of analysis in the future.

lxxx

Page 81: Thesis.doc

References:

Gehan, Edmund A., Ph.D.Lemak, Noreen A., M.D.Statistics in Medical Research, Developments in Clinical TrialsPlenum Publishing Corporation, c1994

Knowledge Discovery Nuggetshttp://www.kdnuggets.com/

Microsoft Corp.One Microsoft WayRedmond, WA 98052-6399http://www.microsoft.com/

National Ambulatory Medical Care SurveyU.S. DEPARTMENT OF HEALTH AND HUMAN SERVICESCenters for Disease Control and Prevention

National Center for Health StatisticsDivision of Data Services Hyattsville, MD 20782-2003(301) 458-4636http://www.cdc.gov/nchs/about/major/ahcd/ahcd1.htm

SAS Institute Inc.SAS Campus DriveCary, NC 27513-2414 http://www.sas.com/

SPSS Inc. 233 S. Wacker Drive,11th floorChicago, Illinois 60606http://www.spss.com/

Sun Microsystems, Inc.901 San Antonio RoadPalo Alto, CA 94303 USAhttp://www.sun.com/

Tufte, Edward R., 1942-Visual explanations : images and quantities, evidence and narrative / Edward R. Tufte.Cheshire, Conn. : Graphics Press, c1997.

lxxxi

Page 82: Thesis.doc

Westphal, ChristopherBlaxton, TheresaData Mining Solutions: Methods and Tools for Solving Real-World ProblemsJohn Wiley & Sons, Inc, c1998

White, Jan V., 1928-Using charts and graphsR. R. Bowker Company c1984

lxxxii

Page 83: Thesis.doc

Appendix – A (Variable List from NAMCS Files) This section consists of a detailed breakdown of each data record. For each item on the record, the user is provided with a sequential item number, field length, file location, and brief description of the item, along with valid codes. Unless otherwise stated in the "item description" column, the data are derived from the Patient Record form. The American Medical Association (AMA), the American Osteopathic Association (AOA) and the induction interview (reference 3) are alternate sources of data, while the computer generates other items by recoding selected data items.

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES -----------------------------------------------------------------------------

1 DATE OF VISIT

1.1 2 1-2 MONTH OF VISIT 01-12: January-December

1.2 4 3-6 YEAR OF VISIT 1996 or 1997*

1.3 1 7 DAY OF WEEK OF VISIT 1=Sunday 2=Monday 3=Tuesday 4=Wednesday 5=Thursday 6=Friday 7=Saturday 2 3 8-10 PATIENT AGE (IN YEARS; DERIVED FROM DATE OF BIRTH) 000-999 100 = 100 years and over

3 1 11 SEX 1 = Female 2 = Male

4 1 12 IS PATIENT PREGNANT? 1 = Yes 2 = No 3 = Unknown 4 = Blank/Not applicable

* Survey dates for the 1997 NAMCS were Dec. 30, 1996 through Dec. 28, 1997.

lxxxiii

Page 84: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------- 5 1 13 RACE 1 = White 2 = Black 3 = Asian/Pacific Islander 4 = American Indian/Eskimo/Aleut

6 1 14 ETHINICITY 1 = Hispanic orgin 2 = Not Hispanic 3 = Blank

7 1 15 WAS PATIENT REFERRED BY ANOTHER PHYSICIAN? 1 = Yes 2 = No 3 = Unknown 4 = Blank

8 1 16 WAS AUTHORIZATION REQUIRED FOR CARE? 1 = Yes 2 = No 3 = Unknown 4 = Blank

9 1 17 ARE YOU THE PATIENT'S PRIMARY CARE PHYSICIAN? 1 = Yes 2 = No 3 = Unknown 4 = Blank

10 1 18 PRIMARY EXPECTED SOURCE OF PAYMENT FOR THIS VISIT 1 = Private Insurance 2 = Medicare 3 = Medicaid 4 = Worker's Compensation 5 = Self-pay 6 = No charge 7 = Other 8 = Unknown 9 = Blank

lxxxiv

Page 85: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES -----------------------------------------------------------------------------

11 1 19 DOES THIS PATIENT BELONG TO AN HMO? (Health Maintenance Organization) 1 = Yes 2 = No 3 = Unknown 4 = Blank

12 1 20 IS THIS A CAPITATED VISIT? 1 = Yes 2 = No 3 = Unknown 4 = Blank

13 1 21 HAVE YOU OR ANYONE IN YOUR PRACTICE/ DEPARTMENT SEEN PATIENT BEFORE? 1 = Yes, established patient 2 = No, new patient 3 = Blank

14 PATIENT'S REASON(S) FOR VISIT (See page 9 in "Description of the NAMCS" and Reason for Visit Classification)

14.1 5 22-26 REASON #1 10050-89990 = 1005.0-8999.0 90000 = Blank

14.2 5 27-31 REASON #2 10050-89990 = 1005.0-8999.0 90000 = Blank

14.3 5 32-36 REASON #3 10050-8990 = 1005.0-8999.0 90000 = Blank 15 1 37 MAJOR REASON FOR THIS VISIT 1 = Acute problem 2 = Chronic problem, routine 3 = Chronic problem, flareup 4 = Pre- or post surgery/injury follow up 5 = Non-illness care (e.g. routine prenatal, general exam., well baby) 6 = Blank or unknown

lxxxv

Page 86: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------- 16 1 38 IS THIS VISIT RELATED TO INJURY OR POISONING? 0 = No 1 = Yes

17 1 39 PLACE OF OCCURENCE OF INJURY 1 = Residence 2 = Recreation/Sports Area 3 = Street/Highway 4 = School 5 = Other public building 6 = Industrial places 7 = Other * 8 = Unknown 9 = Not applicable (not an injury visit)

18 1 40 IS THIS INJURY INTENTIONAL? 1 = Yes (self-inflicted) 2 = Yes (assault) 3 = No, unintentional 4 = Unknown 5 = Not applicable (not an injury visit) 19 1 41 IS THIS INJURY WORK RELATED? 1 = Yes 2 = No 3 = Unknown 4 = Not applicable (not an injury visit)

20 CAUSE OF INJURY (See p. 9 in "Descrip- tion of the National Ambulatory Medical Care Survey" for explanation of codes.)

20.1 4 42-45 CAUSE #1 (ICD-9-CM, E-Codes) There is an implied decimal between the third and for inapplicable fourth digits, a dash A prefix 'E' is implied. 8000-999[-] = E800.0-E999 0000 = Not applicable/Blank

20.2 4 46-49 CAUSE #2 (ICD-9-CM, E-Codes) There is an implied decimal between the third and fourth digits; for inapplicable fourth fourth digits, a dash is inserted. A prefix 'E' is implied. 8000-999[-] = E800.0-E999 0000 = Not applicable/Blank

* Due to a data processing problem, responses of "other" place of occurrence of injury were changed to "unknown" for the 1997 NAMCS.

lxxxvi

Page 87: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------- 20.3 4 50-53 CAUSE #3 (ICD-9-CM, E-Codes) There is an implied decimal between the third and fourth digits; for inapplicable fourth digits, a dash is inserted. A prefix 'E' is implied. 8000-999[-] = E800.0-E999 0000 = Not applicable/Blank

21 100 54-153 CAUSE OF INJURY - VERBATIM TEXT Description of events that preceded the injury. Some entries contain the acronym 'MVA.' MVA=motor vehicle accident.

NOTES ON USING THE CAUSE OF INJURY VERBATIM TEXT DATA

In previous survey years, the cause of injury was converted to an external cause of injury code (E-code) by NCHS medical coders. In 1997, the actual verbatim text has been included on the public use file in addition to the E-code. The inclusion of the verbatim text is meant to assist data users in two major ways. First, the verbatim text can be used by researchers to assign records to injury classification schemes other than the "Supplementary Classification of External Causes of Injury and Poisoning" found in the ICD-9-CM, if so desired. Second, users can search for key text words (for example, swimming pool) to identify diverse causes of injury. It should be noted that, in an effort to preserve confidentiality, all geographic names, personal names, commercial names, and specific dates of injury have been stripped from the verbatim text.

It is important to remember, however, that because of their very specific nature, exact verbatim text strings will not translate into national estimates and should not be used as such. In general, we consider any estimate based on fewer than 30 occurrences in the data to be unreliable. Therefore, a single record showing the specific cause of injury of "tripped over a student's backpack in her classroom and fell on left knee" should not be weighted to produce a national estimate. If, however, a researcher is able to identify 30 or more records where the verbatim text involves a "backpack"-related injury, it might then be possible to sum the patient visit weights for these records to generate a national estimate related to a broader group of visits for backpack-related injuries. The reliability of such an estimate would still depend upon the associated relative standard error.

lxxxvii

Page 88: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES -----------------------------------------------------------------------------

22 PHYSICIAN'S DIAGNOSES (See page 9 in "Description of the National Ambulatory Medical Care Survey" for explanation of coding.) 22.1 5 154-158 DIAGNOSIS #1 (ICD-9-CM) There is an implied decimal between the third and fourth digits; for inapplicable fourth or fifth digits, a dash is inserted. 0010[-] - V829[-] = 001.0[0] - V82.9[0] V990- = Noncodable, insufficient information for coding, V991- = Left before being seen; patient walked out; not seen by doctor; left against medical advice V992- = Transferred to another facility; sent to see a specialist V997- = Entry of "none," "no diagnosis," "no disease," or "healthy" 00000 = Blank 22.2 5 159-163 DIAGNOSIS #2 (ICD-9-CM) There is an implied decimal between the third and fourth digits; for inapplicable fourth or fifth digits, a dash is inserted. 0010[-] - V829[-] = 001.0[0]- V82.9[0] V990- = Noncodable, insufficient information for coding, illegible V991- = Left before being seen; patient walked out; not seen by doctor; left against medical advice V992- = Transferred to another facility; sent to see specialist V997- = Entry of "none," "no diagnosis," "no disease," or "healthy" 00000 = Blank

lxxxviii

Page 89: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES -----------------------------------------------------------------------------

22.3 5 164-168 DIAGNOSIS #3 (ICD-9-CM) There is an implied decimal between the third and and fourth digits; for inapplicable fourth and fifth digits, a dash is inserted. 0010[-] - V829[-] = 001.0[0] - V82.9[0] V990- = Noncodable, insufficient information for coding, illegible V991- = Left before being seen; patient walked out; not seen by doctor; left against medical advice V992- = Transferred to another facility; sent to see specialist V997- = Entry of "none," "no diagnosis," "no disease," or "healthy" 00000 = Blank 23 PROBABLE, QUESTIONABLE, AND RULEOUT DIAGNOSES 23.1 1 169 IS DIAGNOSIS #1 PROBABLE, QUESTIONABLE, OR RULE OUT? 0 = No 1 = Yes 2 = Not applicable

23.2 1 170 IS DIAGNOSIS #2 PROBABLE, QUESTIONABLE, OR RULE OUT? 0 = No 1 = Yes 2 = Not applicable

23.3 1 171 IS DIAGNOSIS #3 PROBABLE, QUESTIONABLE, OR RULE OUT? 0 = No 1 = Yes 2 = Not applicable

24 DIAGNOSTIC/SCREENING SERVICES

24.1 1 172 Were any diagnostic/screening services ordered or provided at this visit? 0 = No 1 = Yes 2 = No answer (all checkboxes and write-in fields blank, including 'None' box)

lxxxix

Page 90: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

EXAMINATIONS

0 = No, 1 = Yes

24.2 1 173 Breast 24.3 1 174 Pelvic 24.4 1 175 Rectal 24.5 1 176 Skin 24.6 1 177 Visual acuity 24.7 1 178 Glaucoma 24.8 1 179 Hearing

TESTS

0 = No, 1 = Yes

24.9 1 180 Blood pressure 24.10 1 181 Strep test 24.11 1 182 Pap test 24.12 1 183 Urinalysis 24.13 1 184 Pregnancy test 24.14 1 185 PSA 24.15 1 186 Blood lead level 24.16 1 187 Cholesterol measure 24.17 1 188 HIV serology 24.18 1 189 Other STD test 24.19 1 190 Hematocrit/hemoglobin 24.20 1 191 Other blood test 24.21 1 192 EKG

IMAGING 0 = No, 1 = Yes

24.22 1 193 X-Ray 24.23 1 194 CT Scan/MRI 24.24 1 195 Mammography 24.25 1 196 Ultrasound

24.26 1 197 ALL OTHER DIAGNOSTIC/SCREENING SERVICES 0 = No, 1 = Yes

xc

Page 91: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

24.27 4 198-201 OTHER DIAGNOSTIC/SCREENING SERVICE #1 (ICD-9-CM, Procedures) A left-justified alphanumeric code with an implied decimal after the first two digits; inapplicable fourth digits have a dash inserted.

0101-9999 = 01.01-99.99 0010 = Item 17, box 26 on Patient Record form was checked, no entry was made in write-in field 0000 = Not applicable/Blank 24.28 4 202-205 OTHER DIAGNOSTIC/SCREENING SERVICE #2 (ICD-9-CM, Procedures) A left-justified alphanumeric code with an implied decimal after the first two digits; inapplicable fourth digits have a dash inserted.

0101-9999 = 01.01-99.99 0010 = Item 17, box 26 on Patient Record form was checked, no entry was made in write-in field 0000 = Not applicable/Blank

24.29 2 206-207 TOTAL NUMBER OF CHECKBOX AND WRITE-IN DIAGNOSTIC/SCREENING SERVICES ORDERED OR PROVIDED 00-26 99 = All boxes blank, including 'None.'

25 THERAPEUTIC AND PREVENTIVE SERVICES

25.1 1 208 Were therapeutic or preventive services ordered or provided? 0 = No 1 = Yes 2 = No answer (all checkboxes and write-in fields blank, including 'None' box)

xci

Page 92: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

COUNSELING/EDUCATION

0 = No, 1 = Yes

25.2 1 209 Diet/nutrition 25.3 1 210 Exercise 25.4 1 211 HIV/STD transmission 25.5 1 212 Family planning/contraception 25.6 1 213 Prenatal instructions 25.7 1 214 Breast self-exam 25.8 1 215 Tobacco use/exposure 25.9 1 216 Growth/development 25.10 1 217 Mental Health 25.11 1 218 Stress management 25.12 1 219 Skin cancer prevention 25.13 1 220 Injury prevention

OTHER THERAPY

0 = No, 1 = Yes

25.14 1 221 Psychotherapy 25.15 1 222 Psychopharmacotherapy 25.16 1 223 Physiotherapy 25.17 1 224 All other therapeutic and preventive services

25.18 4 225-228 OTHER THERAPEUTIC/PREVENTIVE SERVICE #1 (ICD-9-CM Procedures) A left-justified alphanumeric code with an implied decimal after the first two digits; inapplicable fourth digits have a dash inserted.

0101-9999 = 01.01-99.99 0010 = Item 18, box 17 on Patient Record form was checked, no entry was made in write-in field 0000 = Not applicable/Blank

xcii

Page 93: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

25.19 4 229-232 OTHER THERAPEUTIC/PREVENTIVE SERVICE #2 (ICD-9-CM Procedures) A left-justified alphanumeric code with an implied decimal after the first two two digits; inapplicable fourth digits have a dash inserted.

0101-9999 = 01.01-99.99 0010 = Item 18, box 17 on Patient Record form was checked, no entry was made in write-in field 0000 = Not applicable/Blank

25.20 2 233-234 Total number of checkbox and write-in therapeutic or preventive services ordered or provided 00-17 99 = All boxes blank, including 'None.' 26 AMBULATORY SURGICAL PROCEDURES

26.1 1 235 Were any ambulatory surgical procedures performed at this visit?

0 = No 1 = Yes 2 = No answer (all checkboxes and write-in fields blank, including 'None' box.) NOTE: Because some survey respondents reported ambulatory surgical procedures in the open-ended response categories of the diagnostic and screening services item (item 17) and the therapeutic and preventive services item (item 18) (and vice versa), it is recommended that any analysis of procedures take into account all of the open-ended response categories from all of these items.

26.2 4 236-239 AMBULATORY SURGICAL PROCEDURE #1 (ICD-9-CM, Vol. 3, Procedures, see page 10 in "Description of the National Ambulatory Medical Care Survey" for explanation of codes.)

A left-justified alphanumeric code with an implied decimal after the first two digits; inapplicable fourth digits have a dash inserted.

0101-9999 = 01.01-99.99 0000 = Not applicable/Blank

xciii

Page 94: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

26.3 4 240-243 AMBULATORY SURGICAL PROCEDURE #2 (ICD-9-CM, Vol. 3, Procedures) A left-justified alpahnumeric code with an implied decimalafter the first two digits; inapplicable fourth digits have a dash inserted.

0101-9999 = 01.01-99.99 0000 = Not applicable/Blank

26.4 1 244 TOTAL NUMBER OF AMBULATORY SURGICAL PROCEDURES PERFORMED AT THIS VISIT 0-2 9 = No procedures recorded and 'None' box blank

27 MEDICATIONS (See page 12 in "Description of the National Ambulatory Medical Care Survey" for more information.

27.1 1 245 WERE MEDICATIONS ORDERED OR PROVIDED AT THIS VISIT? 0 = No 1 = Yes 2 = No answer (all checkboxes and write-in fields blank, including 'None' box)

27.2 5 246-250 MEDICATION #1 00005-97181 = 00005-97181 90000 = Blank 99980 = Unknown Entry; Other 99999 = Illegible Entry

27.3 5 251-255 MEDICATION #2 00005-97181 = 00005-97181 90000 = Blank 99980 = Unknown entry 99999 = Illegible entry

27.4 5 256-260 MEDICATION #3 00005-97181 = 00005-97181 90000 = Blank 99980 = Unknown entry 99999 = Illegible entry

xciv

Page 95: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

27.5 5 261-265 MEDICATION #4 00005-97181 = 00005-97181 90000 = Blank 99980 = Unknown entry 99999 = Illegible entry

27.6 5 266-270 MEDICATION #5 00005-97181 = 00005-97181 90000 = Blank 99980 = Unknown entry 99999 = Illegible entry

27.7 5 271-275 MEDICATION #6 00005-97181 = 00005-97181 90000 = Blank 99980 = Unknown entry 99999 = Illegible entry

27.8 1 276 NUMBER OF MEDICATIONS CODED 0-6

28 FORMULARY LIST 28.1 1 277 WERE ANY DRUGS FROM FORMULARY LIST? 0 = No 1 = Yes 2 = Unknown 3 = Not applicable (no drugs mentioned at visit)

28.2 1 278 WAS DRUG #1 FROM FORMULARY LIST? 0 = No 1 = Yes 2 = Unknown 3 = Not applicable (no drugs mentioned at visit)

28.3 1 279 WAS DRUG #2 FROM FORMULARY LIST? 0 = No 1 = Yes 2 = Unknown 3 = Not applicable (no drugs mentioned at visit)

28.4 1 280 WAS DRUG #3 FROM FORMULARY LIST? 0 = No 1 = Yes 2 = Unknown 3 = Not applicable (no drugs mentioned at visit)

xcv

Page 96: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

28.5 1 281 WAS DRUG #4 FROM FORMULARY LIST? 0 = No 1 = Yes 2 = Unknown 3 = Not applicable (no drugs mentioned at this visit) 28.6 1 282 WAS DRUG #5 FROM FORMULARY LIST? 0 = No 1 = Yes 2 = Unknown 3 = Not applicable (no drugs mentioned at this visit)

28.7 1 283 WAS DRUG #6 FROM FORMULARY LIST? 0 = No 1 = Yes 2 = Unknown 3 = Not applicable (no drugs mentioned at visit)

28.8 1 284 NUMBER OF DRUGS FROM FORMULARY LIST 0-6 = 0 - 6 drugs 7 = Not applicable 8 = Unknown

29 PROVIDERS SEEN AT THIS VISIT

0 = No, 1 = Yes

29.1 1 285 No answer (all categories blank) 29.2 1 286 Physician 29.3 1 287 Physician assistant 29.4 1 288 Nurse practitioner 29.5 1 289 Nurse midwife 29.6 1 290 R.N. 29.7 1 291 L.P.N. 29.8 1 292 Medical/nursing assistant 29.9 1 293 Other

30 3 294-296 TIME SPENT WITH PHYSICIAN (in minutes) 000-240

31 6 297-302 PATIENT VISIT WEIGHT A right-justified integer developed by the NAMCS staff for the purpose of producing national estimates from sample data.

xcvi

Page 97: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

32 1 303 GEOGRAPHIC REGION (Based on actual location of physician's practice.) 1 = Northeast 2 = Midwest 3 = South 4 = West

33 1 304 METROPOLITAN/NON METROPOLITAN (Based on actual location in conjunction with the defintion of the Bureau of the Census and the U.S. Office of Management and Budget.)

1 = MSA (Metropolitan Statistical Area) 2 = Non-MSA

34 3 305-307 PHYSICIAN SPECIALTY COLLECTED FROM INDUCTION INTERVIEW (REFERENCE 3) (See "Physician Specialty List.")

35 1 308 TYPE OF DOCTOR 1 = M.D.- Doctor of Medicine 2 = D.O.- Doctor of Osteopathy

36 4 309-312 PHYSICIAN CODE - A unique code assigned to all records from a particular physician

37 3 313-315 PATIENT CODE- A number assigned to identify each individual record from a particular physician

****THE FOLLOWING FIELDS SHOW WHETHER DATA WERE IMPUTED TO REPLACE BLANKS****

38 IMPUTED FIELDS 0 = Not Imputed 1 = Imputed

38.1 1 316 Visit date 38.2 1 317 Birth year 38.3 1 318 Sex 38.4 1 319 Race 38.5 1 320 Time spent with physician

******************* END OF IMPUTED DATA FIELDS ********************

xcvii

Page 98: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

39 DRUG-RELATED INFO FOR MEDICATION #1

39.1 5 321-325 GENERIC NAME CODE 50001-51379, 51383-92503 = Specific Generic code 51380 = Combination Product 51381 = Fixed Combination 51382 = Multi-vitamin/Multi-mineral 50000 = Generic Name Undetermined

39.2 1 326 PRESCRIPTION STATUS CODE 1 = Prescription Drug 2 = Nonprescription Drug 3 = Undetermined

39.3 1 327 CONTROLLED SUBSTANCE STATUS CODE 1 = Schedule 1 (Research Only) 2 = Schedule II 5 = Schedule V 3 = Schedule III 6 = No Control 4 = Schedule IV 7 = Undetermined

39.4 1 328 COMPOSITION STATUS CODE 1 = Single Entity Drug 2 = Combination Drug 3 = Undetermined

39.5 4 329-332 NAT'L DRUG CODE DIRECTORY DRUG CLASS 0100-2100 = NDC Drug Class

39.6 INGREDIENT CODE (Ingredients of Combination Drugs; Maximum of 5 Generic Name Codes)

39.6a 5 333-337 INGREDIENT #1 CODE - 50001-92503, or 50000 39.6b 5 338-342 INGREDIENT #2 CODE - 50001-92503, or 50000 39.6c 5 343-347 INGREDIENT #3 CODE - 50001-92503, or 50000 39.6d 5 348-352 INGREDIENT #4 CODE - 50001-92503, or 50000 39.6e 5 353-357 INGREDIENT #5 CODE - 50001-92503, or 50000

xcviii

Page 99: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

40 DRUG-RELATED INFO FOR MEDICATION #2

40.1 5 358-362 GENERIC NAME CODE 50001-51379, 51383-92503 = Specific Generic code 51380 = Combination Product 51381 = Fixed Combination 51382 = Multi-vitamin/Multi-mineral 50000 = Generic Name Undetermined

40.2 1 363 PRESCRIPTION STATUS CODE 1 = Prescription Drug 2 = Nonprescription Drug 3 = Undetermined

40.3 1 364 CONTROLLED SUBSTANCE STATUS CODE 1 = Schedule I (Research Only) 2 = Schedule II 5 = Schedule V 3 = Schedule III 6 = No Control 4 = Schedule IV 7 = Undetermined

40.4 1 365 COMPOSITION STATUS CODE 1 = Single Entity Drug 2 = Combination Drug 3 = Undetermined

40.5 4 366-369 NAT'L DRUG CODE DIRECTORY DRUG CLASS 0100 - 2100 = NDC Drug Class 40.6 INGREDIENT CODE (Ingredients of Combination Drugs; Maximum of 5 Generic Name Codes) 40.6a 5 370-374 INGREDIENT #1 CODE - 50001-92503, or 50000 40.6b 5 375-379 INGREDIENT #2 CODE - 50001-92503, or 50000 40.6c 5 380-384 INGREDIENT #3 CODE - 50001-92503, or 50000 40.6d 5 385-389 INGREDIENT #4 CODE - 50001-92503, or 50000 40.6e 5 390-394 INGREDIENT #5 CODE - 50001-92503, or 50000

xcix

Page 100: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ---------------------------------------------------------------------------- 41 DRUG-RELATED INFO FOR MEDICATION #3

41.1 5 395-399 GENERIC NAME CODE 50001-51379, 51383-92503 = Specific Generic code 51380 = Combination Product 51381 = Fixed Combination 51382 = Multi-vitamin/Multi-mineral 50000 = Generic Name Undetermined

41.2 1 400 PRESCRIPTION STATUS CODE 1 = Prescription Drug 2 = Nonprescription Drug 3 = Undetermined

41.3 1 401 CONTROLLED SUBSTANCE STATUS CODE 1 = Schedule 1 (Research Only) 2 = Schedule II 5 = Schedule V 3 = Schedule III 6 = No Control 4 = Schedule IV 7 = Undetermined

41.4 1 402 COMPOSITION STATUS CODE 1 = Single Entity Drug 2 = Combination Drug 3 = Undetermined

41.5 4 403-406 NAT'L DRUG CODE DIRECTORY DRUG CLASS 0100 - 2100 = NDC Drug Class 41.6 INGREDIENT CODE (Ingredients of Combination Drugs; Maximum of 5 Generic Name Codes) 41.6a 5 407-411 INGREDIENT #1 CODE - 50001-92503, or 50000 41.6b 5 412-416 INGREDIENT #2 CODE - 50001-92503, or 50000 41.6c 5 417-421 INGREDIENT #3 CODE - 50001-92503, or 50000 41.6d 5 422-426 INGREDIENT #4 CODE - 50001-92503, or 50000 41.6e 5 427-431 INGREDIENT #5 CODE - 50001-92503, or 50000

c

Page 101: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

42 DRUG-RELATED INFO FOR MEDICATION #4

42.1 5 432-436 GENERIC NAME CODE 50001-51379, 51383-92503 = Specific Generic code 51380 = Combination Product 51381 = Fixed Combination 51382 = Multi-vitamin/Multi-mineral 50000 = Generic Name Undetermined

42.2 1 437 PRESCRIPTION STATUS CODE 1 = Prescription Drug 2 = Nonprescription Drug 3 = Undetermined

42.3 1 438 CONTROLLED SUBSTANCE STATUS CODE 1 = Schedule 1 (Research Only) 2 = Schedule II 5 = Schedule V 3 = Schedule III 6 = No Control 4 = Schedule IV 7 = Undetermined

42.4 1 439 COMPOSITION STATUS CODE 1 = Single Entity Drug 2 = Combination Drug 3 = Undetermined

42.5 4 440-443 NAT'L DRUG CODE DIRECTORY DRUG CLASS 0100 - 2100 = NDC Drug Class 42.6 INGREDIENT CODE (Ingredients of Combination Drugs; Maximum of 5 Generic Name Codes) 42.6a 5 444-448 INGREDIENT #1 CODE - 50001-92503, or 50000 42.6b 5 449-453 INGREDIENT #2 CODE - 50001-92503, or 50000 42.6c 5 454-458 INGREDIENT #3 CODE - 50001-92503, or 50000 42.6d 5 459-463 INGREDIENT #4 CODE - 50001-92503, or 50000 42.6e 5 464-468 INGREDIENT #5 CODE - 50001-92503, or 50000

ci

Page 102: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

43 DRUG-RELATED INFO FOR MEDICATION #5

43.1 5 469-473 GENERIC NAME CODE 50001-51379, 51383-92503 = Specific Generic code 51380 = Combination Product 51381 = Fixed Combination 51382 = Multi-vitamin/Multi-mineral 50000 = Generic Name Undetermined

43.2 1 474 PRESCRIPTION STATUS CODE 1 = Prescription Drug 2 = Nonprescription Drug 3 = Undetermined

43.3 1 475 CONTROLLED SUBSTANCE STATUS CODE 1 = Schedule 1 (Research Only) 2 = Schedule II 5 = Schedule V 3 = Schedule III 6 = No Control 4 = Schedule IV 7 = Undetermined

43.4 1 476 COMPOSITION STATUS CODE 1 = Single Entity Drug 2 = Combination Drug 3 = Undetermined

43.5 4 477-480 NAT'L DRUG CODE DIRECTORY DRUG CLASS 0100 - 2100 = NDC Drug Class

43.6 INGREDIENT CODE (Ingredients of Combination Drugs; Maximum of 5 Generic Name Codes)

43.6a 5 481-485 INGREDIENT #1 CODE - 50001-92503, or 50000 43.6b 5 486-490 INGREDIENT #2 CODE - 50001-92503, or 50000 43.6c 5 491-495 INGREDIENT #3 CODE - 50001-92503, or 50000 43.6d 5 496-500 INGREDIENT #4 CODE - 50001-92503, or 50000 43.6e 5 501-505 INGREDIENT #5 CODE - 50001-92503, or 50000

cii

Page 103: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

44 DRUG-RELATED INFO FOR MEDICATION #6

44.1 5 506-510 GENERIC NAME CODE 50001-51379, 51383-92503 = Specific Generic code 51380 = Combination Product 51381 = Fixed Combination 51382 = Multi-vitamin/Multi-mineral 50000 = Generic Name Undetermined

44.2 1 511 PRESCRIPTION STATUS CODE 1 = Prescription Drug 2 = Nonprescription Drug 3 = Undetermined 44.3 1 512 CONTROLLED SUBSTANCE STATUS CODE 1 = Schedule 1 (Research Only) 2 = Schedule II 5 = Schedule V 3 = Schedule III 6 = No Control 4 = Schedule IV 7 = Undetermined 44.4 1 513 COMPOSITION STATUS CODE 1 = Single Entity Drug 2 = Combination Drug 3 = Undetermined 44.5 4 514-517 NAT'L DRUG CODE DIRECTORY DRUG CLASS 0100 - 2100 = NDC Drug Class 44.6 INGREDIENT CODE (Ingredients of Combination Drugs; Maximum of 5 Generic Name Codes) 44.6a 5 518-522 INGREDIENT #1 CODE - 50001-92503, or 50000 44.6b 5 523-527 INGREDIENT #2 CODE - 50001-92503, or 50000 44.6c 5 528-532 INGREDIENT #3 CODE - 50001-92503, or 50000 44.6d 5 533-537 INGREDIENT #4 CODE - 50001-92503, or 50000 44.6e 5 538-542 INGREDIENT #5 CODE - 50001-92503, or 50000

ciii

Page 104: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ---------------------------------------------------------------------------- The items on this page are appearing for the first time on the NAMCS public use file. They were collected using the Physician Induction Interview form at the start of the survey process. All of them pertain to aspects of the physician's practice.

45 1 543 TYPE OF OFFICE SETTING FOR THIS VISIT

1 = Free standing private, solo, or group office 2 = Free standing clinic/urgicenter (not part of hospital emergency department or outpatient department) 3 = Neighborhood health or mental health center 4 = Family planning clinic 5 = Privately operated clinic 6 = Local government clinic (state, county, city) 7 = Health maintenance organization (HMO) or other prepaid practice 8 = Other/unknown

46 1 544 IS THIS A SOLO PRACTICE? 1 = Yes 2 = No

47 1 545 EMPLOYMENT STATUS OF PHYSICIAN 1 = Owner 2 = Employee 3 = Contractor 4 = Blank

48 1 546 WHO OWNS THIS OFFICE? 1 = Hospital 2 = Physician or physician group 3 = Other health care corporation 4 = Health maintenance organization (HMO) 5 = Other 6 = Blank

49 1 547 IS LAB TESTING PERFORMED AT THIS OFFICE? 0 = No 1 = Yes 2 = Blank

civ

Page 105: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

*** THE FOLLOWING ITEM WAS ADDED TO FACILITATE CALCULATION OF VISIT RATES BY RACE ***

50 1 548 RACE RECODE 1 = White 2 = Black 3 = Other

*** THE FOLLOWING ITEM WAS ADDED TO ENABLE USERS TO CREATE TABLES USING THE PHYSICIAN SPECIALTY GROUPS IN NAMCS SUMMARY REPORTS. ***

51 2 549-550 PHYSICIAN SPECIALTY RECODE

01 = General and family practice 03 = Internal medicine 04 = Pediatrics 05 = General surgery 06 = Obstetrics and gynecology 07 = Orthopedic surgery 08 = Cardiovascular diseases 09 = Dermatology 10 = Urology 11 = Psychiatry 12 = Neurology 13 = Ophthalmology 14 = Otolaryngology 15 = All other

(Note: For this variable, doctors of osteopathy (stratum 02) have been aggregated with doctors of medicine according to their self-designated practice specialty, and therefore are not differentiated in the variable range. To isolate doctors of osteopathy from medical doctors using the Physician Specialty Recode, it is necessary to crosstabulate it with Type of Doctor located in position 308.)

*** THE FOLLOWING ITEM WAS ADDED TO ENABLE USERS TO CREATE SUBSETS OF VISITS BY PATIENTS UNDER ONE YEAR OF AGE ***

52 3 551-553 AGE IN DAYS 001-365 = 001-365 days 999 = More than 365 days

53 1 554 AGE RECODE 1 = Under 15 years 2 = 15-24 years 3 = 25-44 years 4 = 45-64 years 5 = 65-74 years 6 = 75 years and over

cv

Page 106: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

NUMERIC CODES FOR CAUSE OF INJURY, DIAGNOSIS, AND PROCEDURES

***The following items were included on the public use file to facilitateanalysis of visits using the ICD-9-CM codes. Prior to the 1995 public use file, all ICD-9-CM diagnosis codes on the NAMCS micro-data file were converted from alphanumeric to numeric fields according to the following coding conventions: A prefix of "1" was added to ICD-9-CM codes in the rangeof 001.0[-] through 999.9[-]. A prefix of "20" was substituted for the letter"V" for codes in the range of V01.0[-] through V82.9[-]. Inapplicable fourthand fifth digits were zerofilled. This conversion was done to facilitate analysis of ICD-9-CM data using the Ambulatory Care Statistics software systems. Similar conversions were made for ICD-9-CM procedure codesand external cause of injury codes. Specific coding conventions are discussedin the public use documentation for each data year.

In 1995, however, the decision was made to use the actual ICD-9-CM codes onthe public use data file. Codes were not prefixed, and a dash was inserted for inapplicable fourth and fifth digits. For specific details pertaining to eachtype of code (diagnosis, procedure, cause of injury), refer to the documentation for the survey year of interest. This had the advantage of preserving actual codes and avoiding possible confusion over the creation of some artificial codes due to zerofilling.

It has come to our attention that some users of NAMCS data find it preferableto use the numeric field recodes rather than the alphanumeric fields in certain data applications. Therefore, for 1997, we have included numericrecodes for cause of injury, diagnosis, and procedure (ambulatory surgicalprocedure as well as "other" diagnostic/screening service and "other"therapeutic/preventive service) as listed below. These are in addition to theactual codes for these variables which appear earlier on the public use file.Users can make their own choice about which format best suits their needs.

We are interested in hearing from data users as to which format they prefer so that a decision can be made about whether to include both formats in futureyears. Please contact Susan Schappert, Ambulatory Care Statistics Branch,at 301-436-7132, ext. 172 with any comments or suggestions.******

54 CAUSE OF INJURY RECODE 54.1 4 555-558 CAUSE OF INJURY #1 (Recode to Numeric Field) 8000-9999 =E800.0 - E999.[9] 0000 = Blank

54.2 4 559-562 CAUSE OF INJURY #2 (Recode to numeric Field) 8000-9999 =E800.0 - E999.[9] 0000 = Blank

cvi

Page 107: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

54.3 4 563-566 CAUSE OF INJURY #3 (Recode to Numeric Field) 8000-9999 =E800.0 - E999.[9] 0000 = Blank

55 DIAGNOSIS RECODE

55.1 6 567-572 DIAGNOSIS #1 (Recode to Numeric Field) 100100-208290 = 001.0[0] - V82.9[0] 209900 = Noncodable, insufficient information for coding, illegible 209970 = Diagnosis of "none" 900000 = Blank

55.2 6 573-578 DIAGNOSIS #2 (Recode to Numeric Field) 100100-208290 = 001.0[0] - V82.9[0] 209900 = Noncodable, insufficient information for coding, illegible 209970 = Diagnosis of "none" 900000 = Blank

55.3 6 579-584 DIAGNOSIS #3 (Recode to Numeric Field) 100100-208290 = 001.0[0] - V82.9[0] 209900 = Noncodable, insufficient information for coding, illegible 209970 = Diagnosis of "none" 900000 = Blank

56 OTHER DIAGNOSTIC/SCREENING SERVICES RECODE

56.1 4 585-588 OTHER DIAGNOSTIC/SCREENING SERVICE #1 (Recode to numeric field) 0101-9999 = 01.0[-] - 99.9[-] 0000 = Blank 0010 = item 17, box 25 checked, on Patient Record form was checked, but no entry was made in write-in response field

56.2 4 589-592 OTHER DIAGNOSTIC/SCREENING SERVICE #2 (Recode to numeric field) 0101-9999 = 01.0[-] - 99.9[-] 0000 = Blank 0010 = item 17, box 25 checked, on Patient Record form was checked, but no entry was made in write-in response field

cvii

Page 108: Thesis.doc

ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------

57 OTHER THERAPEUTIC/PREVENTIVE SERVICES RECODE

57.1 4 593-596 OTHER THERAPEUTIC/PREVENTIVE SERVICE #1 (Recode to numeric field) 0101-9999 = 01.0[-] - 99.9[-] 0000 = Blank 0010 = item 18, box 17 checked, on Patient Record form was checked, but no entry was made in write-in response field

57.2 4 597-600 OTHER THERAPEUTIC/PREVENTIVE SERVICE #2 (Recode to numeric field) 0101-9999 = 01.0[-] - 99.9[-] 0000 = Blank 0010 = item 18, box 17 checked, on Patient Record form was checked, but no entry was made in write-in response field

58 AMBULATORY SURGICAL PROCEDURE RECODE

58.1 4 601-604 AMBUALTORY SURGICAL PROCEDURE #1 (Recode to numeric field) 0101-9999 = 01.0[-] - 99.9[-] 0000 = Blank

58.2 4 605-608 AMBULATORY SURGICAL PROCEDURE #2 (Recode to Numeric Field) 0101-9999 = 01.0[-] - 99.9[-] 0000 = Blank

cviii

Page 109: Thesis.doc

VITA

The author, Johnathan Paul Durbin, is the son of Norman Paul Durbin and Carol

(Nachtsheim) Durbin. He was born June 16, 1964, in Santa Cruz, California.

His elementary education was obtained in various public schools in California and

Kentucky. His secondary education was obtained at Western High School, Louisville,

Kentucky, where he graduated in 1982.

In September, 1982, he entered the University of Kentucky and worked on a

Bachelor of Science in Computer Science but became disabled and was unable to finish

it. In 1992 he entered the University of Louisville, and in December, 1995, received a

Bachelor of Science with a major in mathematics with a computer focus.

cix


Recommended