CSC550Z: Data Mining for Business IntelligenceWeek 1: Introduction to Data Mining
Tuan Tran, Ph.D.
Assistant ProfessorCollege of Information and Computer Technology
Sullivan UniversityEmail: [email protected]
October 27, 2016
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 1 / 37
Objectives
Understand data mining concepts
Understand steps in a data mining project
Identify types of variables and understand datapre-processing
Know how to normalize data
Understand the issues of overfitting and how to fix it
Steps in constructing data mining model usingXLMiner
Know how to read and intepret output
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 2 / 37
Outline
1 Data Mining Concepts
2 The Process of Data Mining
3 Pre-processing Data
4 Data Mining Example
5 Conclusion
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 3 / 37
Data Mining Concepts
Why Mine Data?
Massive data is collected and warehousedWeb data, e-commercePurchase at grocery storesBank/credit card transactions
Computing system is cheaper and more powerful
Competitive pressure is strongProvide better customized services
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 4 / 37
Data Mining Concepts
What is Data Mining?
There are many definitions
Non-trivial extraction of implicit, previously unknown andpotentially useful information from data
Exploration & analysis, by automatic or semi-automatic means, oflarge quantities of data in order to discover meaningful patterns
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 5 / 37
Data Mining Concepts
Core Ideas in Data Mining
Consisting of several aspects
Classification: Basic form ofdata analysis (purchase/nopurchase)
Prediction: Predict value of anumerical variable (amount ofpurchase)
Association Rules: What goeswith what
Data Reduction: Transforminto simpler data
Data Exploration: Review andexamine data
Visualization: Graphicalanalysis
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 6 / 37
Data Mining Concepts
Supervised Learning
Goal: Predict a single “target” or “outcome” variableTraining data: target value is knownNew data: score to new data where (target) value is not known
Methods: Classification and PredictionExample: Classify emails (spam/no-spam), predict house value, etc.
Source: www.astroml.org
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 7 / 37
Data Mining Concepts
Unsupervised Learning
Goal: Segment data into meaningful segments; detect patternsDraw inferences from datasets consisting of input data without labeledresponsesThere is no target (outcome) variable to predict or classify
Methods: Association rules, data reduction & exploration,visualization
Example: shopping basket, clustering, etc.
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 8 / 37
Data Mining Concepts
Supervised Learning: Classification
Goal: Predict categorical target (outcome) variable
Examples: Purchase/no purchase, fraud/no fraud, creditworthy/notcreditworthy
Each row is a case/instance/record (customer, tax return, applicant)Each column is a variable (attribute, feature)
Target types: Target variable is often binary (yes/no)
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 9 / 37
Data Mining Concepts
Supervised Learning: Prediction
Goal: Predict numerical target (outcome) variable
Examples: sales, revenue, performance
Each row is a case/instance/record (customer, tax return, applicant)Each column is a variable (attribute, feature)
Classification and prediction together constitute “predictive analytics”
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 10 / 37
Data Mining Concepts
Unsupervised Learning: Association Rules
Goal: Produce rules that define “what goes with what”
Examples: “If X was purchased, Y was also purchased”Rows are transactionsEach column is a variable (attribute, feature)
Used in recommender systems - “Our records show you bought X, youmay also like Y”
Also called “affinity analysis”
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 11 / 37
Data Mining Concepts
Unsupervised Learning: Association Rules
How Can We Qualtify It?
Dataset
Transactions Items
12345 ABC12346 AC12347 AD12348 BEF
Support
Itemset Support
A 75%B 50%C 50%A,C 50%
Parameters: support (50%) and confidence (e.g, 50%)
Rules: A → C
support (A,C) = 50%confidence (A → C) = support(A,C)/support(A)=66,6%
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 12 / 37
Data Mining Concepts
Unsupervised Learning: Data Reduction
Goal: Distillation of complex/large data into simpler/smaller data
Reducing the number of variables/columns (e.g., principalcomponents)
Reducing the number of records/rows (e.g., clustering)
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 13 / 37
Data Mining Concepts
Unsupervised Learning: Data Visualization
Goal: Graphs and plots of data for visualization
Histograms, boxplots, bar charts, scatterplots
Especially useful to examine relationships between pairs of variables
Graphical network of my social network links
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 14 / 37
Data Mining Concepts
Unsupervised Learning: Data Exploration
Data sets are typically large, complex & messy (dirty)
Need to review the data to help refine the task
Use techniques of Reduction and Visualization
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 15 / 37
The Process of Data Mining
Steps in Data Mining
Consisting of several steps
1 Understanding: Define purpose
2 Data Collection: Obtain data(random sampling)
3 Cleaning: Explore, pre-processdata
4 Data Reduction: Reduce thedata; if supervised DM, partition it
5 Identify Task: Specify task(classification, clustering, etc.)
6 Technique Selection: Choose thetechniques (CART, NN, etc.)
7 Parameter Tunning: Iterative and“tuning”
8 Evaluation: Assess & comparemodels
9 Deployment: Deploy best model
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 16 / 37
The Process of Data Mining
Obtaining Data: SamplingWhy Sampling?
Massive dataset with correlation instancesSoftware limitation, e.g., XLMiner, e.g., limits the “training” partitionto 10,000 records
How to Sample?Use build-in function of software
What Is the Size of Sample?Data mining typically deals with huge datasetsProduce statistically-valid results
Source: http://labs.geog.uvic.ca/
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 17 / 37
The Process of Data Mining
Rare event oversampling
Often the event of interest is rareExamples: response to mailing, fraud in taxes, ...
Sampling may yield too few “interesting” cases
to effectively train a modelA popular solution: oversample the rare casesto obtain a more balanced training set
Later, need to adjust results for the oversampling
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 18 / 37
Pre-processing Data
Types of Variables
Determine the types of pre-processing needed, and algorithms used
Main distinction: Categorical vs. numericNumeric
ContinuousInteger
CategoricalOrdered (low, medium, high) (ordinal)Unordered (male, female) (nominal)
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 19 / 37
Pre-processing Data
Variable handling
Numeric
Most algorithms in XLMiner can handle
numeric data
May occasionally need to “bin” into
categoriesCategorical
Naive Bayes can use as-is
In most other algorithms, must create
binary dummies (number of dummies =
number of categories - 1)Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 20 / 37
Pre-processing Data
Variable Handling
An outlier is an observation that is “extreme”, beingdistant from the rest of the data (definition of“distant” is deliberately vague)
Outliers can have disproportionate influence on models(a problem if it is spurious)
An important step in data pre-processing is detectingoutliers
Once detected, domain knowledge is required todetermine if it is an error, or truly extreme
In some contexts, finding outliers is the purpose of theDM exercise (airport security screening). This is called“anomaly detection”
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 21 / 37
Pre-processing Data
Handling Missing Data
Most algorithms will not process records with missingvalues. Default is to drop those records.Solution 1: Omission
If a small number of records have missing values, can omitthemIf many records are missing values on a small set of variables,can drop those variables (or use proxies)If many records have missing values, omission is not practical
Solution 2: ImputationReplace missing values with reasonable substitutesLets you keep the record and use the rest of its(non-missing) information
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 22 / 37
Pre-processing Data
Normalizing (Standardizing) Data
Used in some techniques when variables with thelargest scales would dominate and skew results
Puts all variables on same scaleNormalizing function: subtract mean and divide bystandard deviation (used in XLMiner)
Mean of n samples xi : µ =∑n
i=1 xi/n
Standard deviation σ =√∑n
i=1(xi − µ)2/nStandardized samples: xi = (xi − µ)/σ
Alternative function: scale to 0-1 by subtractingminimum and dividing by the range
Range r = xmax − xmin
Normalized samples xi = (xi − xmin)/rUseful when the data contain dummies and numeric
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 23 / 37
Pre-processing Data
The Problem of Overfitting
Statistical models can produce highly complex explanations ofrelationships between variablesThe “fit” may be excellentWhen used with new data, models of great complexity do not doso well100% fit - not useful for new data
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 24 / 37
Pre-processing Data
The Problem of Overfitting (cont.)
Causes:Too many predictorsA model with too many parametersTrying many different models
Consequence: Deployed model will not work as well asexpected with completely new data.
Source: http://pingax.com
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 25 / 37
Pre-processing Data
Partitioning the Data
Problem: How well will ourmodel perform with newdata?Solution: Separate datainto two parts
Training partition to developthe modelValidation partition toimplement the model andevaluate its performance on“new” data
Addresses the issue ofoverfitting
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 26 / 37
Pre-processing Data
Test Partition
When a model is developed on training data, it canoverfit the training data (hence need to assess onvalidation)
Assessing multiple models on same validation data canoverfit validation data
Some methods use the validation data to choose aparameter. This too can lead to overfitting thevalidation data
Solution: final selected model is applied to a testpartition to give unbiased estimate of its performanceon new data
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 27 / 37
Data Mining Example
Example: MLR for Predicting Boston Housing (XLMiner)
This example shows data mining steps for predictinghouse values in Boston areas
Use BostonHousing.xls datasetConstruct a multiple linear regression (MLR) model to predict housevalues based on house featuresIntepret output of the model
Sample of the dataset
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 28 / 37
Data Mining Example
Example: MLR for Predicting Boston Housing (XLMiner)(cont.)
Description of variables
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 29 / 37
Data Mining Example
Example: MLR for Predicting Boston Housing (XLMiner)(cont.)
Partitioning the data
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 30 / 37
Data Mining Example
Example: MLR for Predicting Boston Housing (XLMiner)(cont.)
Using XLMiner for Multiple Linear Regression
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 31 / 37
Data Mining Example
Example: MLR for Predicting Boston Housing (XLMiner)(cont.)
Specifying Output
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 32 / 37
Data Mining Example
Example: MLR for Predicting Boston Housing (XLMiner)(cont.)
Prediction of Training Data
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 33 / 37
Data Mining Example
Example: MLR for Predicting Boston Housing (XLMiner)(cont.)
Prediction of Validation Data
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 34 / 37
Data Mining Example
Example: MLR for Predicting Boston Housing (XLMiner)(cont.)
Summary of errors
Error = actual - predicted
RMS = Root-mean-squared error (Square root of averagesquared error)
In example, sizes of training and validation sets differ, so onlyRMS Error and Average Error are comparable
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 35 / 37
Data Mining Example
Using Excel and XLMiner for Data Mining
Excel is limited in data capacity
However, the training and validation of DM modelscan be handled within the modest limits of Excel andXLMiner
Models can then be used to score larger databases
XLMiner has functions for interacting with variousdatabases (taking samples from a database, andscoring a database from a developed model)
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 36 / 37
Conclusion
Summary
Data Mining consists of supervised methods(Classification & Prediction) and unsupervisedmethods (Association Rules, Data Reduction, DataExploration & Visualization)
Before algorithms can be applied, data must becharacterized and pre-processed
To evaluate performance and to avoid overfitting, datapartitioning is used
Data mining methods are usually applied to a samplefrom a large database, and then the best model isused to score the entire database
Tuan Tran, Ph.D. (Sullivan University) CSC550Z: Data Mining for Business Intelligence October 27, 2016 37 / 37