Data Mining Notes

.Data Mining In Excel: Lecture Notes and Cases

Preliminary Draft 2/04

Nitin R. PatelPeter C. Bruce

(c) Quantlink Corp. 2004

Distributed by:Resampling Stats, Inc.612 N. Jackson St.Arlington, VA 22201

[email protected]

Contents

1 Introduction 11.1 Who is This Book For? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 What is Data Mining? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Where is Data Mining Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 The Origins of Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Terminology and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.6 Organization of Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.7 Factors Responsible for the Rapid Growth of Data Mining . . . . . . . . . . . . . . . 5

2 Overview of the Data Mining Process 72.1 Core Ideas in Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Anity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.4 Data Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.5 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.6 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Supervised and Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 The Steps In Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 SEMMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Preliminary Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5.1 Sampling from a Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5.2 Pre-processing and Cleaning the Data . . . . . . . . . . . . . . . . . . . . . . 122.5.3 Partitioning the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 Building a Model - An Example with Linear Regression . . . . . . . . . . . . . . . . 192.6.1 Can Excel Handle the Job? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Supervised Learning - Classication & Prediction 293.1 Judging Classication Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1 A Two-class Classier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1.2 Bayes Rule for Minimum Error . . . . . . . . . . . . . . . . . . . . . . . . . 30

i

ii CONTENTS

3.1.3 Practical Assessment of a Classier Using Misclassication Error as the Cri-terion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.4 Asymmetric Misclassication Costs and Bayes Risk . . . . . . . . . . . . . . 343.1.5 Stratied Sampling and Asymmetric Costs . . . . . . . . . . . . . . . . . . . 343.1.6 Generalization to More than Two Classes . . . . . . . . . . . . . . . . . . . . 353.1.7 Lift Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.1.8 Example: Boston Housing (Two classes) . . . . . . . . . . . . . . . . . . . . . 363.1.9 ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.1.10 Classication using a Triage strategy . . . . . . . . . . . . . . . . . . . . . . . 40

4 Multiple Linear Regression 434.1 A Review of Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1.1 Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.1.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.1.3 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Illustration of the Regression Process . . . . . . . . . . . . . . . . . . . . . . . . . . 454.3 Subset Selection in Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4 Dropping Irrelevant Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.5 Dropping Independent Variables With Small Coecient Values . . . . . . . . . . . . 494.6 Algorithms for Subset Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.6.1 Forward Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.6.2 Backward Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.6.3 Step-wise Regression (Efroymsons method) . . . . . . . . . . . . . . . . . . . 514.6.4 All Subsets Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.7 Identifying Subsets of Variables to Improve Predictions . . . . . . . . . . . . . . . . . 51

5 Logistic Regression 555.1 Example 1: Estimating the Probability of Adopting a New Phone Service . . . . . . 555.2 Multiple Linear Regression is Inappropriate . . . . . . . . . . . . . . . . . . . . . . . 565.3 The Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.4 Odd Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.5 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.6 Example 2: Financial Conditions of Banks . . . . . . . . . . . . . . . . . . . . . . . . 59

5.6.1 A Model with Just One Independent Variable . . . . . . . . . . . . . . . . . . 605.6.2 Multiplicative Model of Odds Ratios . . . . . . . . . . . . . . . . . . . . . . . 615.6.3 Computation of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.7 Appendix A - Computing Maximum Likelihood Estimates and Condence Intervalsfor Regression Coecients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.7.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.7.2 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.7.3 Loglikelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

CONTENTS iii

5.7.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.8 Appendix B - The Newton-Raphson Method . . . . . . . . . . . . . . . . . . . . . . 65

6 Neural Nets 676.1 The Neuron (a Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.2 The Neuron (a mathematical model . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.2.1 Single Layer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.2.2 Multilayer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.3 Example 1: Fishers Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.4 The Backward Propagation Algorithm - Classication . . . . . . . . . . . . . . . . . 73

6.4.1 Forward Pass - Computation of Outputs of all the Neurons in the Network. . 746.4.2 Backward Pass: Propagation of Error and Adjustment of Weights . . . . . . 74

6.5 Adjustment for Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.6 Multiple Local Optima and Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.7 Overtting and the choice of training epochs . . . . . . . . . . . . . . . . . . . . . . 756.8 Adaptive Selection of Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.9 Successful Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7 Classication and Regression Trees 777.1 Classication Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777.2 Recursive Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777.3 Example 1 - Riding Mowers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787.4 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847.5 Minimum Error Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 897.6 Best Pruned Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 897.7 Classication Rules from Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 917.8 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

8 Discriminant Analysis 938.1 Example 1 - Riding Mowers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 938.2 Fishers Linear Classication Functions . . . . . . . . . . . . . . . . . . . . . . . . . 958.3 Measuring Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 988.4 Classication Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 998.5 Example 2 - Classication of Flowers . . . . . . . . . . . . . . . . . . . . . . . . . . 998.6 Appendix - Mahalanobis Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

9 Other Supervised Learning Techniques 1059.1 K-Nearest neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

9.1.1 The K-NN Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1069.1.2 Example 1 - Riding Mowers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1069.1.3 K-Nearest Neighbor Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 1089.1.4 Shortcomings of k-NN algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 109

iv CONTENTS

9.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1109.2.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1109.2.2 The Problem with Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . 1119.2.3 Simplify - assume independence . . . . . . . . . . . . . . . . . . . . . . . . . . 1119.2.4 Example 1 - Saris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

10 Anity Analysis - Association Rules 11510.1 Discovering Association Rules in Transaction Databases . . . . . . . . . . . . . . . . 11510.2 Support and Condence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11510.3 Example 1 - Electronics Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11610.4 The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11710.5 Example 2 - Randomly-generated Data . . . . . . . . . . . . . . . . . . . . . . . . . . 11810.6 Shortcomings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

11 Data Reduction and Exploration 12311.1 Dimensionality Reduction - Principal Components Analysis . . . . . . . . . . . . . . 12311.2 Example 1 - Head Measurements of First Adult Sons . . . . . . . . . . . . . . . . . . 12311.3 The Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12411.4 Example 2 - Characteristics of Wine . . . . . . . . . . . . . . . . . . . . . . . . . . . 12611.5 Normalizing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12811.6 Principal Components and Orthogonal Least Squares . . . . . . . . . . . . . . . . . 129

12 Cluster Analysis 13112.1 What is Cluster Analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13112.2 Example 1 - Public Utilities Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13112.3 Hierarchical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

12.3.1 Nearest neighbor (Single linkage) . . . . . . . . . . . . . . . . . . . . . . . . . 13412.3.2 Farthest neighbor (Complete linkage) . . . . . . . . . . . . . . . . . . . . . . 13512.3.3 Group average (Average linkage) . . . . . . . . . . . . . . . . . . . . . . . . . 135

12.4 Optimization and the k-means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 13712.5 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14012.6 Other distance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

13 Cases 14313.1 Charles Book Club . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14313.2 German Credit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15213.3 Textile Cooperatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15813.4 Tayko Software Cataloger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16113.5 IMRB : Segmenting Consumers of Bath Soap . . . . . . . . . . . . . . . . . . . . . . 167

Chapter 1

Introduction

1.1 Who is This Book For?

This book arose out of a data mining course at MITs Sloan School of Management. Preparationfor the course revealed that there are a number of excellent books on the business context of datamining, but their coverage of the statistical and machine-learning algorithms that underlie datamining is not suciently detailed to provide a practical guide if the instructors goal is to equipstudents with the skills and tools to implement those algorithms. On the other hand, there arealso a number of more technical books about data mining algorithms, but these are aimed at thestatistical researcher, or more advanced graduate student, and do not provide the case-orientedbusiness focus that is successful in teaching business students.

Hence, this book is intended for the business student (and practitioner) of data mining tech-niques, and its goal is threefold:

1. To provide both a theoretical and practical understanding of the key methods of classication,prediction, reduction and exploration that are at the heart of data mining;

2. To provide a business decision-making context for these methods;

3. Using real business cases, to illustrate the application and interpretation of these methods.

An important feature of this book is the use of Excel, an environment familiar to businessanalysts. All required data mining algorithms (plus illustrative data sets) are provided in an Exceladd-in, XLMiner. The presentation of the cases is structured so that the reader can follow alongand implement the algorithms on his or her own with a very low learning curve.

While the genesis for this book lay in the need for a case-oriented guide to teaching data-mining,analysts and consultants who are considering applying data mining techniques in contexts wherethey are not currently in use will also nd this a useful, practical guide.

1

2 1. Introduction

1.2 What is Data Mining?

The eld of data mining is still relatively new, and in a state of evolution. The rst InternationalConference on Knowledge Discovery and Data Mining (KDD) was held in 1995, and there are avariety of denitions of data mining.

A concise denition that captures the essence of data mining is:Extracting useful information from large data sets (Hand, et al: 2001).

A slightly longer version is:Data mining is the process of exploration and analysis, by automatic or semi-automatic means,

of large quantities of data in order to discover meaningful patterns and rules. (Berry and Lino:1997 and 2000)

Berry and Lino later had cause to regret the 1997 reference to automatic and semi-automaticmeans, feeling it shortchanged the role of data exploration and analysis.

Another denition comes from the Gartner Group, the information technology research rm(from their web site, Jan. 2004):

Data mining is the process of discovering meaningful new correlations, patterns and trends bysifting through large amounts of data stored in repositories, using pattern recognition technologiesas well as statistical and mathematical techniques.

A summary of the variety of methods encompassed in the term data mining follows below(Core Ideas).

1.3 Where is Data Mining Used

Data mining is used in a variety of elds and applications. The military might use data miningto learn what roles various factors play in the accuracy of bombs. Intelligence agencies might useit to determine which of a huge quantity of intercepted communications are of interest. Securityspecialists might use these methods to determine whether a packet of network data constitutes athreat. Medical researchers might use them to predict the likelihood of a cancer relapse.

Although data mining methods and tools have general applicability, in this book most examplesare chosen from the business world. Some common business questions one might address throughdata mining methods include:

1. From a large list of prospective customers, which are most likely to respond? We could useclassication techniques (logistic regression, classication trees or other methods) to iden-tify those individuals whose demographic and other data most closely matches that of ourbest existing customers. Similarly, we can use prediction techniques to forecast how muchindividual prospects will spend.

2. Which customers are most likely to commit fraud (or might already have committed it)? Wecan use classication methods to identify (say) medical reimbursement applications that have

1.4 The Origins of Data Mining 3

a higher probability of involving fraud, and give them greater attention.

3. Which loan applicants are likely to default? We might use classication techniques to identifythem (or logistic regression to assign a probability of default value).

4. Which customers are more likely to abandon a subscription service (telephone, magazine,etc.)? Again, we might use classication techniques to identify them (or logistic regressionto assign a probability of leaving value). In this way, discounts or other enticements mightbe proered selectively where they are most needed.

1.4 The Origins of Data Mining

Data mining stands at the conuence of the elds of statistics and machine learning (also knownas articial intelligence). A variety of techniques for exploring data and building models have beenaround for a long time in the world of statistics - linear regression, logistic regression, discriminantanalysis and principal components analysis, for example. But the core tenets of classical statistics- computing is dicult and data are scarce - do not apply in data mining applications where bothdata and computing power are plentiful.

This gives rise to Daryl Pregibons description of data mining as statistics at scale and speed.A useful extension of this is statistics at scale, speed, and simplicity. Simplicity in this caserefers not to simplicity of algorithms, but rather to simplicity in the logic of inference. Due to thescarcity of data in the classical statistical setting, the same sample is used to make an estimate,and also to determine how reliable that estimate might be. As a result, the logic of the condenceintervals and hypothesis tests used for inference is elusive for many, and their limitations are notwell appreciated. By contrast, the data mining paradigm of tting a model with one sample andassessing its performance with another sample is easily understood.

Computer science has brought us machine learning techniques, such as trees and neuralnetworks, that rely on computational intensity and are less structured than classical statisticalmodels. In addition, the growing eld of database management is also part of the picture.

The emphasis that classical statistics places on inference (determining whether a pattern orinteresting result might have happened by chance) is missing in data mining. In comparison tostatistics, data mining deals with large data sets in open-ended fashion, making it impossible toput the strict limits around the question being addressed that inference would require.

As a result, the general approach to data mining is vulnerable to the danger of overtting,where a model is t so closely to the available sample of data that it describes not merely structuralcharacteristics of the data, but random peculiarities as well. In engineering terms, the model istting the noise, not just the signal.

4 1. Introduction

1.5 Terminology and Notation

Because of the hybrid parentry of data mining, its practitioners often use multiple terms to referto the same thing. For example, in the machine learning (articial intelligence) eld, the variablebeing predicted is the output variable or the target variable. To a statistician, it is the dependentvariable. Here is a summary of terms used:

Algorithm refers to a specic procedure used to implement a particular data mining technique- classication tree, discriminant analysis, etc.

Attribute is also called a feature, variable, or, from a database perspective, a eld.Case is a set of measurements for one entity - e.g. the height, weight, age, etc. of one person;

also called record, pattern or row (each row typically represents a record, each column avariable)

Condence has a specic meaning in association rules of the type If A and B are purchased,C is also purchased. Condence is the conditional probability that C will be purchased, IF A andB are purchased.

Condence also has a broader meaning in statistics (condence interval), concerning thedegree of error in an estimate that results from selecting one sample as opposed to another.

Dependent variable is the variable being predicted in supervised learning; also called outputvariable, target variable or outcome variable.

Estimation means the prediction of the value of a continuous output variable; also calledprediction.

Feature is also called an attribute, variable, or, from a database perspective, a eld.Input variable is a variable doing the predicting in supervised learning; also called independentvariable, predictor.

Model refers to an algorithm as applied to a data set, complete with its settings (many ofthe algorithms have parameters which the user can adjust).

Outcome variable is the variable being predicted in supervised learning; also called depen-dent variable, target variable or output variable.

Output variable is the variable being predicted in supervised learning; also called dependentvariable, target variable or outcome variable.

P (A|B) is read as the probability that A will occur, given that B has occurred.Pattern is a set of measurements for one entity - e.g. the height, weight, age, etc. of one

person; also called record, case or row (each row typically represents a record, each columna variable)

Prediction means the prediction of the value of a continuous output variable; also calledestimation.

Record is a set of measurements for one entity - e.g. the height, weight, age, etc. of oneperson; also called case, pattern or row (each row typically represents a record, each columna variable)

Score refers to a predicted value or class. Scoring new data means to use a model developedwith training data to predict output values in new data.

1.6 Organization of Data Sets 5

Supervised Learning refers to the process of providing an algorithm (logistic regression, re-gression tree, etc.) with records in which an output variable of interest is known and the algorithmlearns how to predict this value with new records where the output is unknown.

Test data refers to that portion of the data used only at the end of the model building andselection process to assess how well the nal model might perform on additional data.

Training data refers to that portion of data used to t a model.Unsupervised Learning refers to analysis in which one attempts to learn something about the

data other than predicting an output value of interest (whether it falls into clusters, for example).Validation data refers to that portion of the data used to assess how well the model ts, to

adjust some models, and to select the best model from among those that have been tried.Variable is also called a feature, attribute, or, from a database perspective, a eld.

1.6 Organization of Data Sets

Data sets are nearly always constructed and displayed so that variables are in columns, and recordsare in rows. In the example below (the Boston Housing data), the values of 14 variables arerecorded for a number of census tracts. Each row represents a census tract - the rst tract had aper capital crime rate (CRIM) of 0.02729, had 0 of its residential lots zoned for over 25,000 squarefeet (ZN), etc. In supervised learning situations, one of these variables will be the outcome vari-able, typically listed at the end or the beginning (in this case it is median value, MEDV, at the end).

1.7 Factors Responsible for the Rapid Growth of Data Mining

Perhaps the most important factor propelling the growth of data mining is the growth of data. Themass retailer Walmart in 2003 captured 20 million transactions per day in a 10-terabyte database.In 1950, the largest companies had only enough data to occupy, in electronic form, several dozenmegabytes (a terabyte is 1,000,000 megabytes).

The growth of data themselves is driven not simply by an expanding economy and knowledgebase, but by the decreasing cost and increasing availability of automatic data capture mechanisms.Not only are more events being recorded, but more information per event is captured. Scannablebar codes, point of sale (POS) devices, mouse click trails, and global positioning satellite (GPS)data are examples.

The growth of the internet has created a vast new arena for information generation. Many ofthe same actions that people undertake in retail shopping, exploring a library or catalog shoppinghave close analogs on the internet, and all can now be measured in the most minute detail.

6 2. Overview of the Data Mining Process

In marketing, a shift in focus from products and services to a focus on the customer and his orher needs has created a demand for detailed data on customers.

The operational databases used to record individual transactions in support of routine businessactivity can handle simple queries, but are not adequate for more complex and aggregate analysis.Data from these operational databases are therefore extracted, transformed and exported to a datawarehouse - a large integrated data storage facility that ties together the decision support systemsof an enterprise. Smaller data marts devoted to a single subject may also be part of the system.They may include data from external sources (e.g. credit rating data).

Many of the exploratory and analytical techniques used in data mining would not be possiblewithout todays computational power. The constantly declining cost of data storage and retrievalhas made it possible to build the facilities required to store and make available vast amounts ofdata. In short, the rapid and continuing improvement in computing capacity is an essential enablerof the growth of data mining.

Chapter 2

Overview of the Data Mining Process

2.1 Core Ideas in Data Mining

2.1.1 Classication

Classication is perhaps the most basic form of data analysis. The recipient of an oer mightrespond or not respond. An applicant for a loan might repay on time, repay late or declarebankruptcy. A credit card transaction might be normal or fraudulent. A packet of data travelingon a network might be benign or threatening. A bus in a eet might be available for service orunavailable. The victim of an illness might be recovered, still ill, or deceased.

A common task in data mining is to examine data where the classication is unknown or willoccur in the future, with the goal of predicting what that classication is or will be. Similar datawhere the classication is known are used to develop rules, which are then applied to the data withthe unknown classication.

2.1.2 Prediction

Prediction is similar to classication, except we are trying to predict the value of a variable (e.g.amount of purchase), rather than a class (e.g. purchaser or nonpurchaser).

Of course, in classication we are trying to predict a class, but the term prediction in thisbook refers to the prediction of the value of a continuous variable. (Sometimes in the data miningliterature, the term estimation is used to refer to the prediction of the value of a continuousvariable, and prediction may be used for both continuous and categorical data.)

2.1.3 Anity Analysis

Large databases of customer transactions lend themselves naturally to the analysis of associationsamong items purchased, or what goes with what. Association rules can then be used in a varietyof ways. For example, grocery stores might use such information after a customers purchases haveall been scanned to print discount coupons, where the items being discounted are determined bymapping the customers purchases onto the association rules.

7


2.1.4 Data Reduction

Sensible data analysis often requires distillation of complex data into simpler data. Rather thandealing with thousands of product types, an analyst might wish to group them into a smallernumber of groups. This process of consolidating a large number of variables (or cases) into asmaller set is termed data reduction.

2.1.5 Data Exploration

Unless our data project is very narrowly focused on answering a specic question determined inadvance (in which case it has drifted more into the realm of statistical analysis than of data mining),an essential part of the job is to review and examine the data to see what messages it holds, muchas a detective might survey a crime scene. Here, full understanding of the data may require areduction in its scale or dimension to let us to see the forest without getting lost in the trees.Similar variables (i.e. variables that supply similar information) might be aggregated into a singlevariable incorporating all the similar variables. Analogously, cluster analysis might be used toaggregate records together into groups of similar records.

2.1.6 Data Visualization

Another technique for exploring data to see what information they hold is graphical analysis. Forexample, combining all possible scatter plots of one variable against another on a single page allowsus to quickly visualize relationships among variables.

The Boston Housing data is used to illustrate this. In this data set, each row is a city neighbor-hood (census tract, actually) and each column is a variable (crime rate, pupil/teacher ratio, etc.).The outcome variable of interest is the median value of a housing unit in the neighborhood. Figure2.1 takes four variables from this data set and plots them against each other in a series of two-wayscatterplots. In the lower left, for example, the crime rate (CRIM) is plotted on the x-axis andthe median value (MEDV) on the y-axis. In the upper right, the same two variables are plottedon opposite axes. From the plots in the lower right quadrant, we see that, unsurprisingly, themore lower economic status residents a neighborhood has, the lower the median house value. Fromthe upper right and lower left corners we see (again, unsurprisingly) that higher crime rates areassociated with lower median values. An interesting result can be seen in the upper left quadrant.All the very high crime rates seem to be associated with a specic, mid-range value of INDUS(proportion of non-retain businesses per neighborhood). That a specic, middling level of INDUSis really associated with high crime rates seems dubious. A closer examination of the data revealsthat each specic value of INDUS is shared be a number of neighborhoods, indicating that INDUSis measured for a broader area than that of the census tract neighborhood. The high crime rateassociated so markedly with a specic value of INDUS indicates that the few neighborhoods withextremely high crime rates fall mainly within one such broader area.

2.2 Supervised and Unsupervised Learning 9

Figure 2.1 Matrix scatterplot for four variables from the Boston Housing data.

2.2 Supervised and Unsupervised Learning

A fundamental distinction among data mining techniques is between supervised methods and un-supervised methods.

Supervised learning algorithms are those used in classication and prediction. We must havedata available in which the value of the outcome of interest (e.g. purchase or no purchase) is known.These training data are the data from which the classication or prediction algorithm learns,or is trained, about the relationship between predictor variables and the outcome variable. Oncethe algorithm has learned from the training data, it is then applied to another sample of data (thevalidation data) where the outcome is known, to see how well it does, in comparison to othermodels. If many dierent models are being tried out, it is prudent to save a third sample of knownoutcomes (the test data) to use with the nal, selected model to predict how well it will do. Themodel can then be used to classify or predict the outcome variable of interest in new cases where


the outcome is unknown.Simple linear regression analysis is an example of supervised learning (though rarely called that

in the introductory statistics course where you likely rst encountered it). The Y variable is the(known) outcome variable. A regression line is drawn to minimize the sum of squared deviationsbetween the actual Y values and the values predicted by this line. The regression line can now beused to predict Y values for new values of X for which we do not know the Y value.

Unsupervised learning algorithms are those used where there is no outcome variable to predictor classify. Hence, there is no learning from cases where such an outcome variable is known.Anity analysis, data reduction methods and clustering techniques are all unsupervised learningmethods.

2.3 The Steps In Data Mining

This book focuses on understanding and using data mining algorithms (steps 4-7 below). However,some of the most serious errors in data analysis result from a poor understanding of the problem- an understanding that must be developed well before we get into the details of algorithms to beused. Here is a list of the steps to be taken in a typical data mining eort:

1. Develop an understanding of the purpose of the data mining project (if it is a one-shot eortto answer a question or questions) or application (if it is an ongoing procedure).

2. Obtain the data set to be used in the analysis. This often involves random sampling froma large database to capture records to be used in an analysis. It may also involve pullingtogether data from dierent databases. The databases could be internal (e.g. past purchasesmade by customers) or external (credit ratings). While data mining deals with very largedatabases, usually the analysis to be done requires only thousands or tens of thousands ofrecords.

3. Explore, clean, and preprocess the data. This involves verifying that the data are in reasonablecondition. How should missing data be handled? Are the values in a reasonable range, givenwhat you would expect for each variable? Are there obvious outliers? The data are reviewedgraphically - for example, a matrix of scatterplots showing the relationship of each variablewith each other variable. We also need to ensure consistency in the denitions of elds, unitsof measurement, time periods, etc.

4. Reduce the data, if necessary, and (where supervised training is involved) separate it intotraining, validation and test data sets. This can involve operations such as eliminating un-needed variables, transforming variables (for example, turning money spent into spent> $100 vs. spent

2.4 SEMMA 11

5. Determine the data mining task (classication, prediction, clustering, etc.). This involvestranslating the general question or problem of step 1 into a more specic statistical question.

6. Choose the data mining techniques to be used (regression, neural nets, Wards method ofhierarchical clustering, etc.).

7. Use algorithms to perform the task. This is typically an iterative process - trying multiplevariants, and often using multiple variants of the same algorithm (choosing dierent vari-ables or settings within the algorithm). Where appropriate, feedback from the algorithmsperformance on validation data is used to rene the settings.

8. Interpret the results of the algorithms. This involves making a choice as to the best algorithmto deploy, and, where possible, testing our nal choice on the test data to get an idea howwell it will perform. (Recall that each algorithm may also be tested on the validation datafor tuning purposes; in this way the validation data becomes a part of the tting process andis likely to underestimate the error in the deployment of the model that is nally chosen.)

9. Deploy the model. This involves integrating the model into operational systems and runningit on real records to produce decisions or actions. For example, the model might be appliedto a purchased list of possible customers, and the action might be include in the mailing ifthe predicted amount of purchase is > $10.

We concentrate in this book on steps 3-8.

2.4 SEMMA

The above steps encompass the steps in SEMMA, a methodology developed by SAS:Sample from data sets, partition into training, validation and test data setsExplore data set statistically and graphicallyModify: transform variables, impute missing valuesModel: t predictive models, e.g. regression, tree, collaborative lteringAssess: compare models using validation data set

SPSS-Clementine also has a similar methodology, termed CRISP-DM (CRoss-Industry Stan-dard Process for Data Mining).

2.5 Preliminary Steps

2.5.1 Sampling from a Database

Quite often, we will want to do our data mining analysis on less than the total number of recordsthat are available. Data mining algorithms will have varying limitations on what they can handlein terms of the numbers of records and variables, limitations that may be specic to computingpower and capacity as well as software limitations. Even within those limits, many algorithms willexecute faster with smaller data sets.

12 2. Overview of the Data Mining

From a statistical perspective, accurate models can be built with as few as several hundredrecords (see below). Hence, often we will want to sample a subset of records for model building.

If the event we are interested in is rare, however (e.g. customers purchasing a product inresponse to a mailing), sampling a subset of records may yield so few events (e.g. purchases) thatwe have little information on them. We would end up with lots of data on non-purchasers, butlittle on which to base a model that distinguishes purchasers from non-purchasers. In such cases,we would want our sampling procedure to over-weight the purchasers relative to the non-purchasersso that our sample would end up with a healthy complement of purchasers.

For example, if the purchase rate were 1% and we were going to be working with a sample of1000 records, unweighted sampling would be expected to yield only 10 purchasers. If, on the otherhand, a purchaser has a probability of being selected that is 99 times the probability of selecting anon-purchaser, then the proportions selected for the sample will be more roughly equal.

2.5.2 Pre-processing and Cleaning the Data

2.5.2.1 Types of Variables

There are several ways of classifying variables. Variables can be numeric or text (character). Theycan be continuous (able to assume any real numeric value, usually in a given range), integer (assum-ing only integer values), or categorical (assuming one of a limited number of values). Categoricalvariables can be either numeric (1, 2, 3) or text (payments current, payments not current, bank-rupt). Categorical variables can also be unordered (North America, Europe, Asia) or ordered (highvalue, low value, nil value).

2.5.2.2 Variable Selection

More is not necessarily better when it comes to selecting variables for a model. Other things beingequal, parsimony, or compactness is a desirable feature in a model.

For one thing, the more variables we include, the greater the number of records we will needto assess relationships among the variables. 15 records may suce to give us a rough idea of therelationship between Y and a single dependent variable X. If we now want information aboutthe relationship between Y and fteen dependent variables X1 X15, fteen variables will not beenough (each estimated relationship would have an average of only one records worth of informa-tion, making the estimate very unreliable).

2.5.2.3 Overtting

For another thing, the more variables we include, the greater the risk of overtting the data. Whatis overtting?

Consider the following hypothetical data about advertising expenditures in one time period,and sales in a subsequent time period:

2.5 Preliminary Steps 13

Advertising Sales239 514364 789602 550644 1386770 1394789 1440911 1354

Figure 2.2 : X-Y Scatterplot for advertising and Sales Data

We could connect up these lines with a smooth and very complex function, one that explainsall these data points perfectly and leaves no error (residuals).


X-Y scatterplot, smoothed

However, we can see that such a curve is unlikely to be that accurate, or even useful, inpredicting future sales on the basis of advertising expenditures.

A basic purpose of building a model is to describe relationships among variables in such a waythat this description will do a good job of predicting future outcome (dependent) values on thebasis of future predictor (independent) values. Of course, we want the model to do a good job ofdescribing the data we have, but we are more interested in its performance with data to come.

In the above example, a simple straight line might do a better job of predicting future sales onthe basis of advertising than the complex function does.

In this example, we devised a complex function that t the data perfectly, and in doing soover-reached. We certainly ended up explaining some variation in the data that was nothingmore than chance variation. We have mislabeled the noise in the data as if it were a signal.

Similarly, we can add predictors to a model to sharpen its performance with the data at hand.Consider a database of 100 individuals, half of whom have contributed to a charitable cause.Information about income, family size, and zip code might do a fair job of predicting whether or notsomeone is a contributor. If we keep adding additional predictors, we can improve the performanceof the model with the data at hand and reduce the misclassication error to a negligible level.However, this low error rate is misleading, because it likely includes spurious explanations.

For example, one of the variables might be height. We have no basis in theory to supposethat tall people might contribute more or less to charity, but if there are several tall people in oursample and they just happened to contribute heavily to charity, our model might include a termfor height - the taller you are, the more you will contribute. Of course, when the model is applied


to additional data, it is likely that this will not turn out to be a good predictor.If the data set is not much larger than the number of predictor variables, then it is very likely

that a spurious relationship like this will creep into the model. Continuing with our charity example,with a small sample just a few of whom are tall, whatever the contribution level of tall people maybe, the computer is tempted to attribute it to their being tall. If the data set is very large relativeto the number of predictors, this is less likely. In such a case, each predictor must help predict theoutcome for a large number of cases, so the job it does is much less dependent on just a few cases,which might be ukes.

Overtting can also result from the application of many dierent models, from which the bestperforming is selected (more about this below).

2.5.2.4 How Many Variables and How Much Data?

Statisticians could give us procedures to learn with some precision how many records we wouldneed to achieve a given degree of reliability with a given data set and a given model. Data minersneeds are usually not so precise, so we can often get by with rough rules of thumb. A good rule ofthumb is to have ten records for every predictor variable. Another, used by Delmater and Hancockfor classication procedures (2001, p. 68) is to have at least 6*M*N records, where

M = number of outcome classes, andN = number of variablesEven when we have an ample supply of data, there are good reasons to pay close attention to

the variables that are included in a model. Someone with domain knowledge (i.e. knowledge of thebusiness process and the data) should be consulted - knowledge of what the variables represent canoften help build a good model and avoid errors.

For example, shipping paid might be an excellent predictor of amount spent, but it is nota helpful one. It will not give us much information about what distinguishes high-paying fromlow-paying customers that can be put to use with future prospects.

In general, compactness or parsimony is a desirable feature in a model. A matrix of X-Y plotscan be useful in variable selection. In such a matrix, we can see at a glance x-y plots for all variablecombinations. A straight line would be an indication that one variable is exactly correlated withanother. Typically, we would want to include only one of them in our model. The idea is to weedout irrelevant and redundant variables from our model.

2.5.2.5 Outliers

The more data we are dealing with, the greater the chance of encountering erroneous values resultingfrom measurement error, data entry error, or the like. If the erroneous value is in the same rangeas the rest of the data, it may be harmless. If it is well outside the range of the rest of the data(a misplaced decimal, for example), it may have substantial eect on some of the data miningprocedures we plan to use.

Values that lie far away from the bulk of the data are called outliers/indexoutliers. The termfar away is deliberately left vague because what is or is not called an outlier is basically an


arbitrary decision. Analysts use rules of thumb like anything over 3 standard deviations awayfrom the mean is an outlier, but no statistical rule can tell us whether such an outlier is the resultof an error. In this statistical sense, an outlier is not necessarily an invalid data point, it is just adistant data point.

The purpose of identifying outliers is usually to call attention to data that needs further review.We might come up with an explanation looking at the data - in the case of a misplaced decimal,this is likely. We might have no explanation, but know that the value is wrong - a temperatureof 178 degrees F for a sick person. Or, we might conclude that the value is within the realmof possibility and leave it alone. All these are judgments best made by someone with domainknowledge. (Domain knowledge is knowledge of the particular application being considered direct mail, mortgage nance, etc., as opposed to technical knowledge of statistical or data miningprocedures.) Statistical procedures can do little beyond identifying the record as something thatneeds review.

If manual review is feasible, some outliers may be identied and corrected. In any case, if thenumber of records with outliers is very small, they might be treated as missing data.

How do we inspect for outliers? One technique in Excel is to sort the records by the rstcolumn, then review the data for very large or very small values in that column. Then repeatfor each successive column. For a more automated approach that considers each record as a unit,clustering techniques could be used to identify clusters of one or a few records that are distant fromothers. Those records could then be examined.

2.5.2.6 Missing Values

Typically, some records will contain missing values. If the number of records with missing valuesis small, those records might be omitted.

However, if we have a large number of variables, even a small proportion of missing values canaect a lot of records. Even with only 30 variables, if only 5% of the values are missing (spreadrandomly and independently among cases and variables), then almost 80% of the records wouldhave to be omitted from the analysis. (The chance that a given record would escape having amissing value is 0.9530 = 0.215.)

An alternative to omitting records with missing values is to replace the missing value with animputed value, based on the other values for that variable across all records. For example, if, among30 variables, household income is missing for a particular record, we might substitute instead themean household income across all records.

Doing so does not, of course, add any information about how household income aects theoutcome variable. It merely allows us to proceed with the analysis and not lose the informationcontained in this record for the other 29 variables. Note that using such a technique will understatethe variability in a data set. However, since we can assess variability, and indeed the performanceof our data mining technique, using the validation data, this need not present a major problem.


2.5.2.7 Normalizing (Standardizing) the Data

Some algorithms require that the data be normalized before the algorithm can be eectively imple-mented. To normalize/indexstandardizing data the data, we subtract the mean from each value,and divide by the standard deviation of the resulting deviations from the mean. In eect, we areexpressing each value as number of standard deviations away from the mean.

To consider why this might be necessary, consider the case of clustering. Clustering typicallyinvolves calculating a distance measure that reects how far each record is from a cluster center, orfrom other records. With multiple variables, dierent units will be used - days, dollars, counts, etc.If the dollars are in the thousands and everything else is in the 10s, the dollar variable will cometo dominate the distance measure. Moreover, changing units from (say) days to hours or monthscould completely alter the outcome.

Data mining software, including XLMiner, typically has an option that normalizes the data inthose algorithms where it may be required. It is an option, rather than an automatic feature ofsuch algorithms, because there are situations where we want the dierent variables to contributeto the distance measure in proportion to their scale.

2.5.3 Partitioning the Data

In supervised learning, a key question presents itself:How well will our prediction or classication model perform when we apply it to new data? We

are particularly interested in comparing the performance among various models, so we can choosethe one we think will do the best when it is actually implemented.

At rst glance, we might think it best to choose the model that did the best job of classifyingor predicting the outcome variable of interest with the data at hand. However, when we use thesame data to develop the model then assess its performance, we introduce bias.

This is because when we pick the model that does best with the data, this models superiorperformance comes from two sources:

A superior model

Chance aspects of the data that happen to match the chosen model better than other models.

The latter is a particularly serious problem with techniques (such as trees and neural nets) thatdo not impose linear or other structure on the data, and thus end up overtting it..

To address this problem, we simply divide (partition) our data and develop our model usingonly one of the partitions. After we have a model, we try it out on another partition and see howit does. We can measure how it does in several ways. In a classication model, we can count theproportion of held-back records that were misclassied. In a prediction model, we can measure theresiduals (errors) between the predicted values and the actual values.

We will typically deal with two or three partitions.


2.5.3.1 Training Partition

Typically the largest partition, these are the data used to build the various models we are examining.The same training partition is generally used to develop multiple models.

2.5.3.2 Validation Partition

This partition (sometimes called the test partition) is used to assess the performance of eachmodel, so that you can compare models and pick the best one. In some algorithms (e.g. classicationand regression trees), the validation partition may be used in automated fashion to tune and improvethe model.

2.5.3.3 Test Partition

This partition (sometimes called the holdout or evaluation partition) is used if we need toassess the performance of the chosen model with new data.

Why have both a validation and a test partition? When we use the validation data to assessmultiple models and then pick the model that does best with the validation data, we again encounteranother (lesser) facet of the overtting problem chance aspects of the validation data that happento match the chosen model better than other models.

The random features of the validation data that enhance the apparent performance of thechosen model will not likely be present in new data to which the model is applied. Therefore, wemay have overestimated the accuracy of our model. The more models we test, the more likelyit is that one of them will be particularly eective in explaining the noise in the validation data.Applying the model to the test data, which it has not seen before, will provide an unbiased estimateof how well it will do with new data.

Sometimes (for example, when we are concerned mainly with nding the best model and lesswith exactly how well it will do), we might use only training and validation partitions.

The partitioning should be done randomly to avoid getting a biased partition. In XLMiner,the user can supply a variable (column) with a value t (training), v (validation) and s(test) assigned to each case (row). Alternatively, the user can ask XLMiner to do the partitioningrandomly.

Note that with nearest neighbor algorithms for supervised learning, each record in the validationset is compared to all the records in the training set to locate its nearest neighbor(s). In a sense,the training partition itself is the model - any application of the model to new data requires theuse of the training data. So the use of two partitions is an essential part of the classication orprediction process, not merely a way to improve or assess it. Nonetheless, we can still interpret theerror in the validation data in the same way we would interpret error from any other model.

XLMiner has a utility that can divide the data up into training, validation and test sets eitherrandomly according to user-set proportions, or on the basis of a variable that denotes which partitiona record is to belong to. It is possible (though cumbersome) to divide the data into more than 3partitions by successive partitioning - e.g. divide the initial data into 3 partitions, then take oneof those partitions and partition it further.

2.6 Building a Model ... 19

2.6 Building a Model - An Example with Linear Regression

Lets go through the steps typical to many data mining tasks, using a familiar procedure - multiplelinear regression. This will help us understand the overall process before we begin tackling newalgorithms. We will illustrate the Excel procedure using XLMiner.

1. Purpose. Lets assume that the purpose of our data mining project is to predict the medianhouse value in small Boston area neighborhoods.

2. Obtain the data. We will use the Boston Housing data. The data set in question is smallenough that we do not need to sample from it - we can use it in its entirety.

3. Explore, clean, and preprocess the data.

Lets look rst at the description of the variables (crime rate, number of rooms per dwelling,etc.) to be sure we understand them all. These descriptions are available on the descriptiontab on the worksheet, as is a web source for the data set. They all seem fairly straightforward,but this is not always the case. Often variable names are cryptic and their descriptions maybe unclear or missing.

This data set has 14 variables and a description of each variable is given in the table below.

CRIM Per capita crime rate by townZN Proportion of residential land zoned for lots over

25,000 sq.ft.INDUS Proportion of non-retail business acres per townCHAS Charles River dummy variable (= 1 if tract

bounds river; 0 otherwise)NOX Nitric oxides concentration (parts per 10 million)RM Average number of rooms per dwellingAGE Proportion of owner-occupied units built prior to 1940DIS Weighted distances to ve Boston employment centersRAD Index of accessibility to radial highwaysTAX Full-value property-tax rate per $10,000PTRATIO Pupil-teacher ratio by townB 1000(Bk - 0.63)2 where Bk is the proportion of blacks

by townLSTAT % Lower status of the populationMEDV Median value of owner-occupied homes in $1000s


The data themselves look like this:

It is useful to pause and think about what the variables mean, and whether they should beincluded in the model. Consider the variable TAX. At rst glance, we consider that tax ona home is usually a function of its assessed value, so there is some circularity in the model- we want to predict a homes value using TAX as a predictor, yet TAX itself is determinedby a homes value. TAX might be a very good predictor of home value in a numerical sense,but would it be useful if we wanted to apply our model to homes whose assessed value mightnot be known? Reect, though, that the TAX variable, like all the variables, pertains tothe average in a neighborhood, not to individual homes. While the purpose of our inquiryhas not been spelled out, it is possible that at some stage we might want to apply a modelto individual homes and, in such a case, the neighborhood TAX value would be a usefulpredictor. So, we will keep TAX in the analysis for now.

In addition to these variables, the data set also contains an additional variable, CATMEDV,which has been created by categorizing median value (MEDV) into two categories highand low. The variable CATMEDV is actually a categorical variable created from MEDV.If MEDV >=$30,000, CATV = 1. If MEDV


.

We can tell right away that the 79.29 is in error - no neighborhood is going to have houses thathave an average of 79 rooms. All other values are between 3 and 9. Probably, the decimalwas misplaced and the value should be 7.929. (This hypothetical error is not present in thedata set supplied with XLMiner.)

4. Reduce the data and partition it into training, validation and test partitions. Our data set hasonly 13 variables, so data reduction is not required. If we had many more variables, at thisstage we might want to apply a variable reduction technique such as Principal ComponentsAnalysis to consolidate multiple similar variables into a smaller number of variables. Our taskis to predict the median house value, and then assess how well that prediction does. We willpartition the data into a training set to build the model, and a validation set to see how wellthe model does. This technique is part of the supervised learning process in classicationand prediction problems. These are problems in which we know the class or value of theoutcome variable for some data, and we want to use that data in developing a model thatcan then be applied to other data where that value is unknown.

In Excel, select XLMiner Partition and the following dialog box appears:


.

Here we specify which data range is to be partitioned, and which variables are to be includedin the partitioned data set.

The partitioning can be handled in one of two ways:

a) The data set can have a partition variable that governs the division into training andvalidation partitions (e.g. 1 = training, 2 = validation), or

b) The partitioning can be done randomly. If the partitioning is done randomly, we have theoption of specifying a seed for randomization (which has the advantage of letting us duplicatethe same random partition later, should we need to).

In this case, we will divide the data into two partitions - training and validation. The trainingpartition is used to build the model, the validation partition is used to see how well the model doeswhen applied to new data. We need to specify the percent of the data used in each partition.

Note: Although we are not using it here, a test partition might also be used.Typically, a data mining endeavor involves testing multiple models, perhaps with multiple

settings on each model. When we train just one model and try it out on the validation data, wecan get an unbiased idea of how it might perform on more such data.


However, when we train lots of models and use the validation data to see how each one does,then pick the best performing model, the validation data no longer provide an unbiased estimateof how the model might do with more data. By playing a role in picking the best model, thevalidation data have become part of the model itself. In fact, several algorithms (classication andregression trees, for example) explicitly factor validation data into the model building algorithmitself (in pruning trees, for example).

Models will almost always perform better with the data they were trained on than fresh data.Hence, when validation data are used in the model itself, or when they are used to select the bestmodel, the results achieved with the validation data, just as with the training data, will be overlyoptimistic.

The test data, which should not be used either in the model building or model selection process,can give a better estimate of how well the chosen model will do with fresh data. Thus, once wehave selected a nal model, we will apply it to the test data to get an estimate of how well it willactually perform.

1. Determine the data mining task. In this case, as noted, the specic task is to predict thevalue of MEDV using the 13 predictor variables.

2. Choose the technique. In this case, it is multiple linear regression.

3. Having divided the data into training and validation partitions, we can use XLMiner to builda multiple linear regression model with the training data - we want to predict median houseprice on the basis of all the other values.


4. Use the algorithm to perform the task. In XLMiner, we select Prediction Multiple LinearRegression:

The variable MEDV is selected as the output (dependent) variable, the variable CAT.MEDVis left unused, and the remaining variables are all selected as input (independent or predictor)variables. We will ask XLMiner to show us the tted values on the training data, as well asthe predicted values (scores) on the validation data.


.

XLMiner produces standard regression output, but we will defer that for now, as well as themore advanced options displayed above. See the chapter on multiple linear regression, or theuser documentation for XLMiner, for more information. Rather, we will review the predictionsthemselves. Here are the predicted values for the rst few records in the training data, alongwith the actual values and the residual (prediction error). Note that these predicted valueswould often be called the tted values, since they are for the records that the model was t to.


And here are the results for the validation data:

Lets compare the prediction error for the training and validation data:

Prediction error can be measured several ways. Three measures produced by XLMiner areshown above.

On the right is the average error - simply the average of the residuals (errors). In both cases,it is quite small, indicating that, on balance, predictions average about right - our predictionsare unbiased. Of course, this simply means that the positive errors and negative errorsbalance each other out. It tells us nothing about how large those positive and negative errorsare.

The residual sum of squares on the left adds up the squared errors, so whether an error ispositive or negative it contributes just the same. However, this sum does not yield informationabout the size of the typical error.

The RMS error or root mean squared error is perhaps the most useful term of all. It takesthe square root of the average squared error, so gives an idea of the typical error (whetherpositive or negative) in the same scale as the original data.


As we might expect, the RMS error for the validation data ($5,337), which the model is seeingfor the rst time in making these predictions, is larger than for the training data ($4,518),which were used in training the model.

5. Interpret the results.

At this stage, we would typically try other prediction algorithms (regression trees, for exam-ple) and see how they do, error-wise. We might also try dierent settings on the variousmodels (for example, we could use the best subsets option in multiple linear regression tochose a reduced set of variables that might perform better with the validation data). Afterchoosing the best model (typically, the model with the lowest error while also recognizingthat simpler is better), we then use that model to predict the output variable in fresh data.These steps will be covered in more detail in the analysis of cases.

6. Deploy the model. After the best model is chosen, it is then applied to new data to predictMEDV for records where this value is unknown. This, of course, was the overall purpose.

2.6.1 Can Excel Handle the Job?

An important aspect of this process to note is that the heavy duty analysis does not necessarilyrequire huge numbers of records. The data set to be analyzed may have millions of records, ofcourse, but in doing multiple linear regression or applying a classication tree the use of a sampleof (say) 20,000 is likely to yield as accurate an answer as using the whole data set. The principleinvolved is the same as the principal behind polling - 2000 voters, if sampled judiciously, can givean estimate of the entire populations opinion within one or two percentage points.

Therefore, in most cases, the number of records required in each partition (training, validationand test) can be accommodated within the rows allowed by Excel.

Of course, we need to get those records into Excel, so the standard version of XLMiner providesan interface for random sampling of records from an external database.

Likewise, we need to apply the results of our analysis to a large database, so the standardversion of XLMiner has a facility for scoring the output of the model to an external database. Forexample, XLMiner would write an additional column (variable) to the database consisting of thepredicted purchase amount for each record.

28 3. Supervised Learning - Classication & Prediction

Chapter 3

Supervised Learning - Classication &Prediction

In supervised learning, we are interested in predicting the class (classication) or continuous value(prediction) of an outcome variable. In the previous chapter, we worked through a simple example.Lets now examine the question of how to judge the usefulness of a classier or predictor and howto compare dierent ones.

3.1 Judging Classication Performance

Not only do we have a wide choice of dierent types of classiers to choose from but within eachtype of classier we have many options such as how many nearest neighbors to use in a k-nearestneighbors classier, the minimum number of cases we should require in a leaf node in a tree classier,which subsets of predictors to use in a logistic regression model, and how many hidden layer neuronsto use in a neural net. Before we study these various algorithms in detail and face decisions on howto set these options, we need to know how we will measure success.

3.1.1 A Two-class Classier

Let us rst look at a single classier for two classes. The two-class situation is certainly the mostcommon and occurs very frequently in practice. We will extend our analysis to more than twoclasses later.

A natural criterion for judging the performance of a classier is the probability that it makesa misclassication error. A classier that makes no errors would be perfect but we do not expectto be able to construct such classiers in the real world due to noise and to not having all theinformation needed to precisely classify cases. Is there a minimum probability of misclassicationwe should require of a classier?

At a minimum, we hope to do better than the crude rule classify everything as belonging tothe most prevalent class. Imagine that, for each case, we know what the probability is that itbelongs to one class or the other. Suppose that the two classes are denoted by C0 and C1. Let p(C0)

29


and p(C1) be the apriori probabilities that a case belongs to C0 and C1 respectively. The aprioriprobability is the probability that a case belongs to a class without any more knowledge about itthan that it belongs to a population where the proportion of C0s is p(C0) and the proportion of C1sis p(C1). In this situation we will minimize the chance of a misclassication error by assigning classC1 to the case if p(C1) > p(C0) and to C0 otherwise. The probability of making a misclassicationerror would be the minimum of p(C0) and p(C1). If we are using misclassication rate as ourcriterion any classier that uses predictor variables must have an error rate better than this.

What is the best performance we can expect from a classier? Clearly the more training dataavailable to a classier the more accurate it will be. Suppose we had a huge amount of training data,would we then be able to build a classier that makes no errors? The answer is no. The accuracyof a classier depends critically on how separated the classes are with respect to the predictorvariables that it the classier uses. We can use the well-known Bayes formula from probabilitytheory to derive the best performance we can expect from a classier for a given set of predictorvariables if we had a very large amount of training data. Bayes formula uses the distributions ofthe decision variables in the two classes to give us a classier that will have the minimum erroramongst all classiers that use the same predictor variables. This classier uses the Minimum ErrorBayes Rule.

3.1.2 Bayes Rule for Minimum Error

Let us take a simple situation where we have just one continuous predictor variable, say X, to usein predicting our two-class outcome variable. Now X is a random variable, since its value dependson the individual case we sample from the population consisting of all possible cases of the class towhich the case belongs.

Suppose that we have a very large training data set. Then the relative frequency histogram ofthe variable X in each class would be almost identical to the probability density function (p.d.f.)of X for that class. Let us assume that we have a huge amount of training data and so we knowthe p.d.f.s accurately. These p.d.f.s are denoted f0(x) and f1(x) for classes C0 and C1 in Fig. 1below.

Figure 1

3.1 Judging Classication Performance 31

Now suppose we wish to classify an object for which the value of X is x0. Let us use Bayesformula to predict the probability that the object belongs to class 1 conditional on the fact that ithas an X value of x0. Applying Bayes formula, the probability, denoted by p(C1|X = x0), is givenby:

p(C1|X = x0) = p(X = x0|C1)p(C1)p(X = x0|C0)p(C0) + p(X = x0|C1)p(C1)

Writing this in terms of the density functions, we get

p(C1|X = x0) = f1(x0)p(C1)f0(x0)p(C0) + f1(x0)p(C1)

Notice that to calculate p(C1|X = x0) we need to know the apriori probabilities p(C0) andp(C1). Since there are only two possible classes, if we know p(C1) we can always compute p(C0)because p(C0) = 1 p(C1). The apriori probability p(C1) is the probability that an object belongsto C1 without any knowledge of the value of X associated with it. Bayes formula enables us toupdate this apriori probability to the aposteriori probability, the probability of the object belongingto C1 after knowing that its X value is x0.

When p(C1) = p(C0) = 0.5, the formula shows that p(C1|X = x0) > p(C0|X = x0) if f1(x0) >f0(x0). This means that if x0 is greater than a (Figure 1), and we classify the object as belongingto C1 we will make a smaller misclassication error than if we were to classify it as belonging to C0.Similarly if x0 is less than a, and we classify the object as belonging to C0 we will make a smallermisclassication error than if we were to classify it as belonging to C1. If x0 is exactly equal to awe have a 50% chance of making an error for either classication.

Figure 2

What if the prior class probabilities were not the same (Figure 2)? Suppose C0 is twice as likelyapriori as C1. Then the formula says that p(C1|X = x0) > p(C0|X = x0) if f1(x0) > 2 f0(x0).


The new boundary value, b for classication will be to the right of a as shown in Fig.2. This isintuitively what we would expect. If a class is more likely we would expect the cut-o to move ina direction that would increase the range over which it is preferred.

In general we will minimize the misclassication error rate if we classify a case as belongingto C1 if p(C1) f1(x0) > p(C0) f0(x0), and to C0 otherwise. This rule holds even when X is avector consisting of several components, each of which is a random variable. In the remainder ofthis note we shall assume that X is a vector.

An important advantage of Bayes Rule is that, as a by-product of classifying a case, we cancompute the conditional probability that the case belongs to each class. This has two advantages.

First, we can use this probability as a score for each case that we are classifying. The scoreenables us to rank cases that we have predicted as belonging to a class in order of condence that wehave made a correct classication. This capability is important in developing a lift curve (explainedlater) that is important for many practical data mining applications.

Second, it enables us to compute the expected prot or loss for a given case. This gives us abetter decision criterion than misclassication error when the loss due to error is dierent for thetwo classes.

3.1.3 Practical Assessment of a Classier Using Misclassication Error as theCriterion

In practice, we can estimate p(C1) and p(C0) from the data we are using to build the classier bysimply computing the proportion of cases that belong to each class. Of course, these are estimatesand they can be incorrect, but if we have a large enough data set and neither class is very rareour estimates will be reliable. Sometimes, we may be able to use public data such as census datato estimate these proportions. However, in most practical business settings we will not know f1(x)and f0(x) . If we want to apply Bayes Rule we will need to estimate these density functions in someway. Many classication methods can be interpreted as being methods for estimating such densityfunctions1. In practice X will almost always be a vector. This complicates the task because of thecurse of dimensionality - the diculty and complexity of the calculations increases exponentially,not linearly, as the number of variables increases.

To obtain an honest estimate of classication error, let us suppose that we have partitioned adata set into training and validation data sets by random selection of cases. Let us assume that wehave constructed a classier using the training data. When we apply it to the validation data, wewill classify each case into C0 or C1. These classications can be displayed in what is known as aconfusion table, with rows and columns corresponding to the true and predicted classes respectively.(Although we can summarize our results in a confusion table for training data as well, the resultingconfusion table is not useful for getting an honest estimate of the misclassication rate due to the

1There are classifiers that focus on simply finding the boundary between the regions to predict each class withoutbeing concerned with estimating the density of cases within each region. For example, Support Vector MachineClassifiers have this characteristic


danger of overtting.)

Confusion Table Predicted Class(Validation Cases)

True Class C0 C1C0 True Negatives (Number of correctly False Positives (Number of cases

classied cases that belong to C0) incorrectly classied as C1that belong to C0)

C1 False Negatives (Number of cases True Positives (Number ofincorrectly classied as C0 correctly classied casesthat belong to C1) that belong to C1)

If we denote the number in the cell at row i and column j by Nij , the estimated misclassicationrate Err = (N01 +N10)/Nval where Nval = (N00 +N01 +N10 +N11), or the total number of casesin the validation data set. If Nval is reasonably large, our estimate of the misclassication rate isprobably reasonably accurate. We can compute a condence interval using the standard formulafor estimating a population proportion from a random sample.

Note that we are assuming that the cost (or benet) of making correct classications is zero. Atrst glance, this may seem incomplete. After all, the benet (negative cost) of correctly classifying abuyer as a buyer would seem substantial. And, in other circumstances (e.g. scoring our classicationalgorithm to fresh data to implement our decisions), it will be appropriate to consider the actual netdollar impact of each possible classication (or misclassication). Here, however, we are attemptingto assess the value of a classier in terms of classication error, so it greatly simplies matter if wecan capture all cost/benet information in the misclassication cells. So, instead of recording thebenet of correctly classifying a buyer, we record the cost of failing to classify him as a buyer. Itamounts to the same thing and our goal becomes the minimization of costs, whether the costs areactual costs or foregone benets (opportunity costs).

The table below gives an idea of how the accuracy of the estimate varies with Nval. The columnheadings are values of the misclassication rate and the rows give the desired accuracy in estimat-ing the misclassication rate as measured by the half-width of the condence interval at the 99%condence level. For example, if we think that the true misclassication rate is likely to be around0.05 and we want to be 99% condent that Err is within 0.01 of the true misclassication rate,we need to have a validation data set with 3,152 cases.

0.01 0.05 0.10 0.15 0.20 0.30 0.40 0.50 0.025 250 504 956 1,354 1,699 2,230 2,548 2,654 0.010 657 3,152 5,972 8,461 10,617 13,935 15,926 16,589 0.005 2,628 12,608 23,889 33,842 42,469 55,741 63,703 66,358


3.1.4 Asymmetric Misclassication Costs and Bayes Risk

Up to this point we have been using the misclassication rate as the criterion for judging the ecacyof a classier. However, there are circumstances when this measure is not appropriate. Sometimesthe error of misclassifying a case belonging to one class is more serious than for the other class. Forexample, misclassifying a household as unlikely to respond to a sales oer when it belongs to theclass that would respond incurs a greater opportunity cost than the converse error. In the formercase, you are missing out on a sale worth perhaps tens or hundreds of dollars. In the latter, you areincurring the costs of mailing a letter to someone who will not purchase. In such a scenario usingthe misclassication rate as a criterion can be misleading. Consider the situation where the salesoer is accepted by 1% of the households on a list. If a classier simply classies every household asa non-responder it will have an error rate of only 1% but will be useless in practice. A classier thatmisclassies 30% of buying households as non-buyers and 2% of the non-buyers as buyers wouldhave a higher error rate but would be better if the prot from a sale is substantially higher thanthe cost of sending out an oer. In these situations, if we have estimates of the cost of both types ofmisclassication, we can use the confusion table to compute the expected cost of misclassicationfor each case in the validation data. This enables us to compare dierent classiers using overallexpected costs as the criterion. However, it does not improve the actual classications themselves.A better method is to change the classication rules (and hence the misclassication rates) to reectthe asymmetric costs. In fact, there is a Bayes classier for this situation which gives rules thatare optimal for minimizing the overall expected loss from misclassication (including both actualand opportunity costs). This classier is known as the Bayes Risk Classier and the correspondingminimum expected cost of misclassication is known as the Bayes Risk. The Bayes Risk Classieremploys the following classication rule:

Classify a case as belonging to C1 if p(C1) f(x0) C(0|1) > p(C0) f0(x0) C(1|0), and toC0 otherwise. Here C(0|1) is the cost of misclassifying a C1 case as belonging to C0 and C(1|0) isthe cost of misclassifying a C0 case as belonging to C1. Note that the opportunity costs of correctclassication for either class is zero. Notice also that this rule reduces to the Minimum Error BayesRule when C(0|1) = C(1|0).

Again, as we rarely know f1(x0) and f0(x0), we cannot construct this classier in practice.Nonetheless, it provides us with an ideal that the various classiers we construct for minimizingexpected opportunity cost attempt to emulate.

3.1.5 Stratied Sampling and Asymmetric Costs

When classes are not present in roughly equal proportions, stratied sampling is often used tooversample the cases from the more rare class and improve the performance of classiers. If a classoccurs only rarely in the training set, the classier will have little information to use in learningwhat distinguishes it from the other classes. The most commonly used weighted sampling schemeis to sample an equal number of cases from each class.

It is often the case that the more rare events are the more interesting or important ones -responders to a mailing, those who commit fraud, defaulters on debt, etc. - and hence the more


costly to misclassify. Hence, after oversampling and training a model on a biased sample, twoadjustments are required:

Adjusting the responses for the biased sampling (e.g. if a class was over-represented in thetraining sample by a factor of 2, its predicted outcomes need to be divided by 2)

Translating the results (in terms of numbers of responses) into expected gains or losses in away that accounts for asymmetric costs.

3.1.6 Generalization to More than Two Classes

All the comments made above about two-class classiers extend readily to classication into morethan two classes. Let us suppose we have k classes C0, C1, C2, Ck1. Then Bayes formula givesus:

p(Cj|X = x0) = fj(x0)p(Cj)k1i=1

fi(x0)p(C1).

The Bayes Rule for Minimum Error is to classify a case as belonging to Cj if

p(Cj) fj(x0) maxi=0,1,,k1

p(Cj) fi(x0).

The confusion table has k rows and k columns. The misclassication cost associated with thediagonal cells is, of course, always zero. If the costs are asymmetric the Bayes Risk Classier followsthe rule: Classify a case as belonging to C1 if

p(Cj) fj(x0) C( j|j) maxi=j

p(Ci) fi(x0) C( i|i)

where C( j|j) is the cost of misclassifying a case that belongs to Cj to any other class Ci, i = j.

3.1.7 Lift Charts

Often in practice, misclassication costs are not known accurately and decision makers would like toexamine a range of possible costs. In such cases, when the classier gives a probability of belongingto each class and not just a binary classication to C1 or C0, we can use a very useful device knownas the lift curve, also called a gains curve or gains chart. The lift curve is a popular technique indirect marketing and one useful way to think of a lift curve is to consider a data mining modelthat attempts to identify the likely responders to a mailing by assigning each case a probability ofresponding score. The lift curve helps us determine how eectively we can skim the cream byselecting a relatively small number of cases and getting a relatively large portion of the responders.The input required to construct a lift curve is a validation data set that has been scored byappending to each case, the estimated probability that it will belong to a given class.


3.1.8 Example: Boston Housing (Two classes)

Let us t a logistic regression model to the Boston Housing data. (We will cover logistic regres-sion in detail later; for now think of it as like linear regression, except the outcome variable beingpredicted is binary.) We t a logistic regression model to the training data (304 randomly selectedcases) with all the 13 variables available in the data set as predictor variables and with the binaryvariable CAT.MEDV (1 = median value >= $30,000, 0 = median value


The same 30 cases are shown below sorted in descending order of the predicted probability ofbeing a HICLASS=1 case.

Predicted Log-odds Predicted Prob. Actual Valueof Success of Success of HICLASS

22 8.9016 0.9999 15 4.5273 0.9893 1

14 4.4472 0.9884 116 3.6381 0.9744 11 3.5993 0.9734 1

15 3.5294 0.9715 130 1.7257 0.8489 13 0.4061 0.6002 0

23 0.0874 0.5218 018 -0.0402 0.4900 08 -1.1157 0.2468 06 -1.2916 0.2156 0

25 -1.9183 0.1281 117 -2.6806 0.0641 09 -4.3290 0.0130 0

24 -6.0590 0.0023 12 -6.5073 0.0015 0

27 -9.6509 0.0001 019 -10.0750 0.0000 020 -10.2859 0.0000 013 -13.1040 0.0000 026 -13.2349 0.0000 028 -13.4562 0.0000 029 -13.9340 0.0000 04 -14.2910 0.0000 0

21 -14.6084 0.0000 012 -19.8654 0.0000 011 -21.6854 0.0000 010 -24.5364 0.0000 07 -37.6119 0.0000 0


First, we need to set a cuto probability value, above which we will consider a case to be apositive or 1, and below which we will consider a case to be a negative or 0. For any given cutolevel, we can use the sorted table to compute a confusion table for a given cuto probability. Forexample, if we use a cuto probability level of 0.400, we will predict 10 positives (7 true positivesand 3 false positives); we will also predict 20 negatives (18 true negatives and 2 false negatives).For each cuto level, we can calculate the appropriate confusion table. Instead of looking at a largenumber of confusion tables, it is much more convenient to look at the cumulative lift curve (some-times called a gains chart) which summarizes all the information in these multiple confusion tablesinto a graph. The graph is constructed with the cumulative number of cases (in descending order ofprobability) on the x axis and the cumulative number of true positives on the y axis as shown below.

Probability Rank Predicted Prob. Actual Value Cumulative Actualof Success of HICLASS Value

1 0.9999 1 12 0.9893 1 23 0.9884 1 34 0.9744 1 45 0.9734 1 56 0.9715 1 67 0.8489 1 78 0.6002 0 79 0.5218 0 7

10 0.4900 0 711 0.2468 0 712 0.2156 0 713 0.1281 1 814 0.0641 0 815 0.0130 0 816 0.0023 1 917 0.0015 0 918 0.0001 0 919 0.0000 0 920 0.0000 0 921 0.

Date post:	11-Nov-2015
Category:	Documents
Upload:	san-geetha-b-c
View:	59 times
Download:	7 times

Data Mining Notes

Documents