Introduction to Machine Learning & Data Mining
Jennifer Neville Purdue University
May 24, 2016
Slides:https://www.cs.purdue.edu/homes/neville/soi-dswksp-neville.pdfData:https://www.cs.purdue.edu/homes/neville/iris.dat
Data mining
The process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data
(Fayyad, Piatetsky-Shapiro & Smith 1996)
Databases
Artificial Intelligence
Visualization
Statistics
Example
During WWII, statistician Abraham Wald was asked to help the British decide where to add armor to their planes
The data revolution
The last 35 years of research in ML/DM has resulted in wide spread adoption of predictive analytics to automate and improve decision making. As “big data” efforts increase the collection of data… so will the need for new data science methodology. Data today have more volume, velocity, variety, etc. Machine learning research develops statistical tools, models & algorithms that address these complexities. Data mining research focuses on how to scale to massive data and how to incorporate feedback to improve accuracy while minimizing effort.
Processed data
Target dataData
Selection Preprocessing
MiningPatternsInterpretationevaluation
Knowledge
The data mining process
Machine Learning
Overview
• Task specification
• Data representation
• Knowledge representation
• Learning technique
• Search + scoring
• Prediction and/or interpretation
Overview
• Task specification
• Data representation
• Knowledge representation
• Learning technique
• Search + scoring
• Prediction and/or interpretation
Task specification
• Objective of the person who is analyzing the data
• Description of the characteristics of the analysis and desired result
• Examples:
• From a set of labeled examples, devise an understandable model that will accurately predict whether a stockbroker will commit fraud in the near future.
• From a set of unlabeled examples, cluster stockbrokers into a set of homogeneous groups based on their demographic information
Exploratory data analysis
• Goal
• Interact with data without clear objective
• Techniques
• Visualization, adhoc modeling
Descriptive modeling
• Goal
• Summarize the data or the underlying generative process
• Techniques
• Density estimation, cluster analysis and segmentation
Branch (Bn)
Region
Area
Firm
Size
On
Watchlist
Disclosure
Year
Type
Broker (Bk)
On
Watchlist
Is Problem
Problem In Past
Has
Business
Layoffs
Bk
Bk
Bk
Bk
Bn
Bn
Bn
Bn
Bn
Also known as: unsupervised learning
Predictive modeling
• Goal
• Learn model to predict unknown class label values given observed attribute values
• Techniques
• Classification, regression
BrokerAge>27
Current CoWorkerCount>8
Current FirmAvg(Size)>12
Current BranchMode(Location)=NY
BrokerYears In Industry>1
DisclosureCount(Yr<1995)>0
Past FirmAvg(Size)>90
Current RegulatorMode(Status)=Reg
Past CoWorkerCount>35
DisclosureCount(Type=CC)>0
DisclosureCount>5Past Firm
Max(Size)>1000Current Branch
Mode(Location)=AZ
Past CoWorkerCount(Gender=M)>15
703 564
17949 9218
7 63 34 249
5 54
20010
Also known as: supervised learning
Pattern discovery
• Goal
• Detect patterns and rules that describe sets of examples
• Techniques
• Association rules, graph mining, anomaly detection
-
-
-
-
- - -
- - -
+
++
- -
+
-
-
+
-
-
- -
-
-
+
+
- +
+ +
Model: global summary of a data setPattern: local to a subset of the data
Overview
• Task specification
• Data representation
• Knowledge representation
• Learning technique
• Search + scoring
• Prediction and/or interpretation
Data representation
• Choice of data structure for representing individual and collections of measurements
• Individual measurements: single observations (e.g., person’s date of birth, product price)
• Collections of measurements: sets of observations that describe an instance (e.g., person, product)
• Choice of representation determines applicability of algorithms and can impact modeling effectiveness
• Additional issues: data sampling, data cleaning, feature construction
Tabular data
Fraud Age Degree StartYr Series7
+ 22 Y 2005 N
- 25 N 2003 Y
- 31 Y 1995 Y
- 27 Y 1999 Y
+ 24 N 2006 N
- 29 N 2003 N
N instances X p attributes
Temporal data
0
25
50
75
100
2004 2005 2006 2007
Region 1 Region 2
Relational/structured data
659,000brokers
171,000
branches
5,100
firms400,000
disclosures
Overview
• Task specification
• Data representation
• Knowledge representation
• Learning technique
• Search + scoring
• Prediction and/or interpretation
Knowledge representation
• Underlying structure of the model or patterns that we seek from the data • Specifies the models/patterns that could be returned as the results of the
data mining algorithm
• Defines space of possible models/patterns for algorithm to search over
• Examples:
• If-then rule If short closed car then toxic chemicals
• Conditional probability distributionP( fraud | age, degree, series7, startYr )
• Decision tree
• Regression model
8
Linear equation
!
CDR = 0.12MM1+ 0.34SBScore..." 0.34!
y = "1x1+ "
2x2...+ "
0
Generic form is:
An example for the Alzheimer’s data would be:
Rules
• ExamplesIF (cheese) THEN chocolate [Conf=0.10, Supp=0.04]
IF (cheese & Sterno) THEN chocolate [Conf=0.95, Supp=0.01]
• These can be written as probabilitiesP(chocolate|cheese) = 0.10
P(chocolate|cheese,Sterno) = 0.95
• As rules get more specific (by including additionalattributes), support decreases, but confidencemay increase.
Overview
• Task specification
• Data representation
• Knowledge representation
• Learning technique
• Search + scoring
• Prediction and/or interpretation
Learning technique
• Method to construct model or patterns from data
• Model space
• Choice of knowledge representation defines a set of possible models or patterns
• Scoring function
• Associates a numerical value (score) with each member of the set of models/patterns
• Search technique
• Defines a method for generating members of the set of models/patterns, determining their score, and identifying the ones with the “best” score
Scoring function
• A numeric score assigned to each possible model in a search space, given a reference/input dataset
• Used to judge the quality of a particular model for the domain
• Score function are statistics—estimates of a population parameter based on a sample of data
• Examples:
• Misclassification
• Squared error
• Likelihood
Example learning problem
0 10 20 30 40
0.00
0.05
0.10
0.15
density.default(x = d2[, 1])
N = 500 Bandwidth = 0.6062
Density
Task: Devise a rule to classify
items based on the attribute X
X
+
-
Knowledge representation: If-then rules
Example rule: If x > 25 then + Else -
What is the model space?
All possible thresholds
What score function?
Prediction error rate
Score function over model space
0 10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
res[,1]
res[,2]
threshold
% p
redi
ctio
n er
rors
Full data
Small sample
Large sample
Biased sample
Search procedure?
Try all thresholds, select one with lowest score
Note: learning result depends on data
Overview
• Task specification
• Data representation
• Knowledge representation
• Learning technique
• Search + Evaluation
• Prediction and/or interpretation
Inference and interpretation
• Prediction technique
• Method to apply learned model to new data for prediction/analysis
• Only applicable for predictive and some descriptive models
• Prediction is often used during learning (i.e., search) to determine value of scoring function
• Interpretation of results
• Objective: significance measures, hypothesis tests
• Subjective: importance, interestingness, novelty
Example application
Real-world example:
659,000brokers
171,000
branches
5,100
firms400,000
disclosures
BrokerAge>27
Current CoWorkerCount>8
Current FirmAvg(Size)>12
Current BranchMode(Location)=NY
BrokerYears In Industry>1
DisclosureCount(Yr<1995)>0
Past FirmAvg(Size)>90
Current RegulatorMode(Status)=Reg
Past CoWorkerCount>35
DisclosureCount(Type=CC)>0
DisclosureCount>5Past Firm
Max(Size)>1000Current Branch
Mode(Location)=AZ
Past CoWorkerCount(Gender=M)>15
703 564
17949 9218
7 63 34 249
5 54
20010
BrokerAge>27
Current CoWorkerCount>8
Current FirmAvg(Size)>12
Current BranchMode(Location)=NY
BrokerYears In Industry>1
DisclosureCount(Yr<1995)>0
Past FirmAvg(Size)>90
Current RegulatorMode(Status)=Reg
Past CoWorkerCount>35
DisclosureCount(Type=CC)>0
DisclosureCount>5Past Firm
Max(Size)>1000Current Branch
Mode(Location)=AZ
Past CoWorkerCount(Gender=M)>15
703 564
17949 9218
7 63 34 249
5 54
20010
Neither
Both
NASD Rules
Relational
Models
Performance of NASD models
"One broker I was highly confident in ranking as 5…
Not only did I have the pleasure of meeting him at a shady warehouse location, I also negotiated his bar from the industry... This person actually used investors' funds
to pay for personal expenses including his trip to attend a NASD compliance conference!
…If the model predicted this person, it would be right on target."
Data mining process
6
CS590D 12
Data Mining: Classification Schemes
• General functionality
– Descriptive data mining
– Predictive data mining
• Different views, different classifications
– Kinds of data to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted
CS590D 13
adapted from:
U. Fayyad, et al. (1995), “From Knowledge Discovery to Data
Mining: An Overview,” Advances in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
DataTarget
Data
Selection
KnowledgeKnowledge
Preprocessed
Data
Patterns
Data Mining
Interpretation/
Evaluation
Knowledge Discovery in Databases: Process
Preprocessing
Use public data from NASD BrokerCheck
Extract data about small firms in a few geographic locations
Create class label, temporal features
Learn decision trees, output predictions and tree structure
Evaluate objectively on historical data, subjectively with fraud analysts
Steps in the data mining process
Data mining process
1. Application setup:
• Acquire relevant domain knowledge
• Assess user goals
2. Data selection
• Choose data sources
• Identify relevant attributes
• Sample data
3. Data preprocessing
• Remove noise or outliers
• Handle missing values
• Account for time or other changes
4. Data transformation
• Find useful features
• Reduce dimensionality
Data mining process
5. Data mining:
• Choose task (e.g., classification, regression, clustering)
• Choose algorithms for learning and inference
• Set parameters
• Apply algorithms to search for patterns of interest
6. Interpretation/evaluation
• Assess accuracy of model/results
• Interpret model for end-users
• Consolidate knowledge
7. Repeat...
Data selection
# download data from:# https://www.cs.purdue.edu/homes/neville/iris.dat
# read in datad <- read.table(file=‘iris.dat’,sep=',',header=TRUE)summary(d)
# histogramhist(d[,2], main='Histogram of Sepal Width', xlab='Sepal Width')
# scatterplotplot(d[,c(1,3)],xlab='Sepal Length',ylab='Petal Length’)plot(d)
# feature constructionSepalArea <- d[,1] * d[,2]PetalArea <- d[,3] * d[,4]
# add new features to data framed2 <- cbind(d,SepalArea,PetalArea)
# boxplotboxplot(d2$SepalLength ~ d2$Class,d2,ylab='Sepal Length’)boxplot(d2$SepalArea ~ d2$Class,d2,ylab='Sepal Area')
# correlationcor(d2$SepalLength,d2$sepalArea)
Data transformation
Feature construction
• Create new attributes that can capture the important information in the data more efficiently than the original attributes
• General methodologies:
• Attribute extraction (domain specific)
• Attribute transformations, i.e., mapping data to new space (e.g., PCA)
• Attribute combinations
Feature selection
• Select the most “relevant” subset of attributes
• May improve performance of algorithms due to overfitting
• Improves domain understanding
• Less resources to collect/use a smaller number of features
• Wrapper approach
• Features are selected as part of the mining algorithms
• Filter approach
• Features are selected before mining
Wrapper approach
• Consider all subsets of features, 2p if there are p features; features are selected according to a particular model score function (e.g., classification accuracy)
• Search over all subsets and find smallest set of features such that “best” model score does not significantly change
• For large p, exhaustive search is intractable so heuristic search is often used
• Examples:
• Forward, greedy selection — start with empty set and add features one at a time until no more improvement
• Backward, greedy removal — start with full set and remove features one at a time until no further improvements
• Interleave forward and backward selection
Filter approach
• Select “useful” features before mining, using a score that measures a feature’s utility separate from model learning
• Find features on the basis of statistical properties such as association with the class label
• Example:
• Calculate correlation of features with target class label, order all features by their score, choose features with top k scores
• Other scores: Chi-square, information gain, etc.
Dimensionality reduction
• Identify and describe the “dimensions” that underlie the data
• May be more fundamental than those directly measured but hidden to the user
• Reduce dimensionality of modeling problem
• Benefit is simplification, it reduces the number of variables you have to deal with in modeling
• Can identify set of variables with similar behavior
Principal component analysis (PCA)
• High-level approach, given data matrix D with p dimensions:
• Preprocess D so that the mean of each attribute is 0, call this matrix X
• Compute pxp covariance matrix:
• Compute eigenvectors/eigenvalues of covariance matrix:
• Eigenvectors are the principal component vectors (pxp matrix), where each is a px1 column vector of projection weights
� = XT X
A⌃A�1 = ⇤ : matrix of eigenvectors : diagonal matrix of eigenvalues
A⇤
Aa
: 1st principal component, eigenvector
Applying PCA
• New data vectors are formed by projecting the data onto the first few principal components (i.e., top k eigenvectors)
x = [x1, x2, . . . , xp] (original instance)
A = [a1,a2, . . . ,ap] (principal components)
x
01 = a1x =
pX
j=1
a1jxj
· · ·
x
0m = amx =
pX
j=1
amjxj for m < p
x
0= [x
01, x
02, . . . , x
0m] (transformed instance)
If m=p then data is transformed If m<p then transformation is lossy and dimensionality is reduced
Applying PCA (cont’)
• Goal: Find a new (smaller) set of dimensions that captures most of the variability of the data
• Use scree plot to choose number of dimensions
• Choose m < p so projected data captures much of the variance of original data
Index
Varia
nce
Example: Eigenfaces
First 40 PCA dimensions
PCA applied to images of human faces. Reduce dimensionality to set of basis images.
All other images are linear combo of these “eigenpictures”.
Used for facial recognition.
PCA example on Iris data
> pcdat <- princomp(d[,1:4])> summary(pcdat)Importance of components: Comp.1 Comp.2 Comp.3 Comp.4Standard deviation 2.0485788 0.49053911 0.27928554 0.153379074Proportion of Variance 0.9246162 0.05301557 0.01718514 0.005183085Cumulative Proportion 0.9246162 0.97763178 0.99481691 1.000000000> plot(pcdat)> loadings(pcdat)
Loadings: Comp.1 Comp.2 Comp.3 Comp.4V1 0.362 -0.657 -0.581 0.317V2 -0.730 0.596 -0.324V3 0.857 0.176 -0.480V4 0.359 0.549 0.751
Comp.1 Comp.2 Comp.3 Comp.4SS loadings 1.00 1.00 1.00 1.00Proportion Var 0.25 0.25 0.25 0.25Cumulative Var 0.25 0.50 0.75 1.00
Comp.1 Comp.2 Comp.3 Comp.4
pcdat
Variances
010000
20000
30000
40000
First component explains 92% of data variance
Choose m=1
PCA example on Iris data
# Transform data using 1st componentpcdat$loadings[,1] V1 V2 V3 V4 0.36158968 -0.08226889 0.85657211 0.35884393
pcaF <- as.matrix(d[,1:4]) %*% pcdat$loadings[,1]d3 <- cbind(d2,pcaF)
x = [x1, x2, . . . , xp] (original instance)
A = [a1,a2, . . . ,ap] (principal components)
x
01 = a1x =
pX
j=1
a1jxj
· · ·
x
0m = amx =
pX
j=1
amjxj for m < p
x
0= [x
01, x
02, . . . , x
0m] (transformed instance)
m=1, transform data to one dimension
1 2 3
34
56
78
9
Iris-setosa Iris-virginica
4.55.05.56.06.57.07.58.0
Original data (1st variable)
Transformed data
Machine learning
Descriptive vs. predictive modeling
• Descriptive models summarize the data
• Provide insights into the domain
• Focus on modeling joint distribution P(X)
• May be used for classification, but prediction is not the primary goal
• Predictive models predict the value of one variable of interest given known values of other variables
• Focus on modeling the conditional distribution P( Y | X ) or on modeling the decision boundary for Y
Learning predictive models
• Choose a data representation
• Select a knowledge representation (a “model”)
• Defines a space of possible models M={M1, M2, ..., Mk}
• Use search to identify “best” model(s)
• Search the space of models (i.e., with alternative structures and/or parameters)
• Evaluate possible models with scoring function to determine the model which best fits the data
Knowledge representation
• Underlying structure of the model or patterns that we seek from the data
• Defines space of possible models for algorithm to search over
• Model: high-level global description of dataset
• “All models are wrong, some models are useful” G. Box and N. Draper (1987)
• Choice of model family determines space of parameters and structure
• Estimate model parameters and possibly model structure from training data
BrokerAge>27
Current CoWorkerCount>8
Current FirmAvg(Size)>12
Current BranchMode(Location)=NY
BrokerYears In Industry>1
DisclosureCount(Yr<1995)>0
Past FirmAvg(Size)>90
Current RegulatorMode(Status)=Reg
Past CoWorkerCount>35
DisclosureCount(Type=CC)>0
DisclosureCount>5Past Firm
Max(Size)>1000Current Branch
Mode(Location)=AZ
Past CoWorkerCount(Gender=M)>15
703 564
17949 9218
7 63 34 249
5 54
20010
Classification tree
Model space: all possible decision trees
Scoring functions
• Given a model M and dataset D, we would like to “score” model M with respect to D
• Goal is to rank the models in terms of their utility (for capturing D) and choose the “best” model
• Score function can be used to search over parameters and/or model structure
• Score functions can be different for:
• Models vs. patterns
• Predictive vs. descriptive functions
• Models with varying complexity (i.e., number parameters)
Predictive scoring functions
• Assess the quality of predictions for a set of instances
• Measures difference between the prediction M makes for an instance i and the true class label value of i
Predicted class label for item i
True class label for item i
Distance betweenpredicted and true
Sum over examples
S(M) =NtestX
i=1
d
⇥f(x(i);M), y(i)
⇤
What space are we searching?
Model Score
X
Learned model ⇡ (✓0 = 0.8, ✓1 = 0.4)
Alex Holehuse, Notes from Andrew Ng’s Machine Learning Class, http://www.holehouse.org/mlclass/01_02_Introduction_regression_analysis_and_gr.html
Model space
Searching over models/patterns
• Consider a space of possible models M={M1, M2, ..., Mk} with parameters θ
• Search could be over model structures or parameters, e.g.:
• Parameters: In a linear regression model, find the regression coefficients (β) that minimize squared loss on the training data
• Model structure: In a decision trees, find the tree structure that maximizes accuracy on the training data
Decision trees
Tree models
• Easy to understand knowledge representation
• Can handle mixed variables
• Recursive, divide and conquer learning method
• Efficient inference
Tree learning
• Top-down recursive divide and conquer algorithm
• Start with all examples at root
• Select best attribute/feature
• Partition examples by selected attribute
• Recurse and repeat
• Other issues:
• How to construct features
• When to stop growing
• Pruning irrelevant parts of the tree
choose split on Age>28
Fraud Age Degree StartYr Series7
+ 22 Y 2005 N
+ 24 N 2006 N
Fraud Age Degree StartYr Series7
- 29 N 2003 N
NY
Fraud Age Degree StartYr Series7
+ 22 Y 2005 N
+ 24 N 2006 N
- 29 N 2003 N
Fraud Age Degree StartYr Series7
- 25 N 2003 Y
- 31 Y 1995 Y
- 27 Y 1999 Y
choose split on Series7Y N
Fraud Age Degree StartYr Series7
+ 22 Y 2005 N
- 25 N 2003 Y
- 31 Y 1995 Y
- 27 Y 1999 Y
+ 24 N 2006 N
- 29 N 2003 N
Score each attribute split for these instances:
Age, Degree, StartYr, Series7
Score each attribute split for these instances: Age, Degree, StartYr
Tree models
• Most well-known systems
• CART: Breiman, Friedman, Olshen and Stone
• ID3, C4.5: Quinlan
• How do they differ?
• Split scoring function
• Stopping criterion
• Pruning mechanism
• Predictions in leaf nodes
Scoring functions: Local split value
Choosing an attribute/feature
• Idea: a good feature splits the examples into subsets that distinguish among the class labels as much as possible... ideally into pure sets of "all positive" or "all negative"
67
Information gain
• How much does a feature split decrease the entropy?
Entropy(D)= -9/14 log 9/14 -5/14 log 5/14= 0.9400
15
CS590D 30
Classification and Prediction
• What is classification? What is prediction?
• Issues regarding classification and
prediction
• Bayesian Classification
• Instance Based Methods
• Classification by decision tree induction
• Classification by Neural Networks
• Classification by Support Vector Machines
(SVM)
• Prediction
CS590D 31
Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
68
Information gain
nonoyesyes
yesnoyesyesyesno
yesnoyesyes
High LowMed
IncomeEntropy(Income=high) = -2/4 log 2/4 -2/4 log 2/4 = 1
Entropy(Income=med) = -4/6 log 4/6 -2/6 log 2/6 = 0.9183
Entropy(Income=low) = -3/4 log 3/4 -1/4 log 1/4 = 0.8113
Gain(D,Income)= 0.9400 - (4/14 [1] + 6/14 [0.9183] + 4/14 [0.8113])= 0.029
Tree learning
• Top-down recursive divide and conquer algorithm
• Start with all examples at root
• Select best attribute/feature
• Partition examples by selected attribute
• Recurse and repeat
• Other issues:
• How to construct features
• When to stop growing
• Pruning irrelevant parts of the tree
When to stop growing
• Full growth methods
• All samples for at a node belong to the same class
• There are no attributes left for further splits
• There are no samples left
• What impact does this have on the quality of the learned trees?
• Trees overfit the data and accuracy decreases
• Pruning is used to avoid overfitting
Pruning
• Postpruning
• Use a separate set of examples to evaluate the utility of pruning nodes from the tree (after tree is fully grown)
• Prepruning
• Apply a statistical test to decide whether to expand a node
• Use an explicit measure of complexity to penalize large trees (e.g., Minimum Description Length)
Algorithm comparison
• CART
• Evaluation criterion:Gini index
• Search algorithm:Simple to complex, hill-climbing search
• Stopping criterion: When leaves are pure
• Pruning mechanism:Cross-validation to select gini threshold
• C4.5
• Evaluation criterion: Information gain
• Search algorithm:Simple to complex, hill-climbing search
• Stopping criterion: When leaves are pure
• Pruning mechanism:Reduce error pruning
Learning CART decision trees in R
# install packageslibrary(rpart)install.packages("rpart.plot")library(rpart.plot)
# learn CART tree with default parametersdTree <- rpart(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF, method=“class", data=d3)summary(dTree)
# display learned treeprp(dTree)prp(dTree,type=4, extra=1)
# explore settingsdTree2 <- rpart(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF, method=“class", data=d3, parms=list(split="information"))
dTree3 <- rpart(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF, method="class", data=d3, control=rpart.control(minsplit=5))
Evaluation
Empirical evaluation
• Given observed accuracy of a model on limited data, how well does this estimate generalize for additional examples?
• Given that one model outperforms another on some sample of data, how likely is it that this model is more accurate in general?
• When data are limited, what is the best way to use the data to both learn and evaluate a model?
Evaluating classifiers
• Goal: Estimate true future error rate
• When data are limited, what is the best way to use the data to both learn and evaluate a model?
• Approach 1
• Reclassify training data to estimate error rate
Approach 1
Y X1 X2
Data Set
Model
F(X) Y X1 X2
Data Set
Score: 83%
Typically produces a biased estimate of future error rate -- why?
Learning curves
• Goal: See how performance improves with additional training data
• From dataset set S, where |S|=n
• For i=[10, 20, ... ,100]
• Randomly sample i% of S to construct sample S’
• Learn model on S’
• Evaluate model
• Plot training set size vs. accuracy
Size of dataset
How does performance change when measured on disjoint test set?
Size of dataset
Approach 2
Y X1 X2
Data Set
Model
Test set
Score: 77%
Estimate will vary due to size and makeup of test set
Y X1 X2
Y X1 X2
Training set
Test set
F(X) Y X1 X2
Evaluating classifiers (cont)
• Approach 2:
• Partition D0 into two disjoint subsets, learn model on one subset, measure error on the other subset
• Problem: this is a point estimate of the error on one subset
0 200 400 600 800
0.0
0.2
0.4
0.6
0.8
1.0
Erro
r
Training Set Size
Algorithm AAlgorithm B
Overlapping test sets are dependent
• Repeated sampling of test sets leads to overlap (i.e., dependence) among test sets... this will results in underestimation of variance
• Standard errors will be biased if performance is estimated from overlapping test sets (Dietterich’98)
• Recommendation: Use cross-validation to eliminate dependence between test sets A A
• Use k-fold cross-validation to get k estimates of error for MA and MB
• Set of errors estimated over the test set folds provides empirical estimate of sampling distribution
• Mean is estimate of expected error
Evaluating classification algorithms A and B
Y X1 X2
Dataset Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2
Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2
Train1 Train2 Train3 Train4 Train5 Train6
Test1 Test2 Test3 Test4 Test5 Test6
AccA.1
AccB.1
AccA.2
AccB.2
AccA.3
AccB.3
AccA.4
AccB.4
AccA.5
AccB.5
AccA.6
AccB.6
# package with helper methodsinstall.packages("caret")library(caret)
# partition data into 80% training and 20% test settrainIndex <- createDataPartition(d3$SepalLength, times=1, p=0.8, list=FALSE)dTrain <- d3[ trainIndex,]dTest <- d3[-trainIndex,]
# learn model from training, evaluate on testdtTrain <- rpart(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF, method="class", data=dTrain, control=rpart.control(minsplit=5))testPreds <- predict(dtTrain, dTest, type = "class")
# evaluateconfusionMatrix(testPreds, dTest$Class)
# calculate learning curveallResultsTr <- matrix(numeric(0), 0,2)allResultsTe <- matrix(numeric(0), 0,2)
trainSetSizes <- c(0.025,0.05,0.1,0.2,0.4,0.8)for(i in trainSetSizes){
trainIndex <- createDataPartition(d3$SepalLength,times=1,p=i,list=FALSE)dTrain <- d3[ trainIndex,]dTest <- d3[-trainIndex,]dTree <- rpart(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth
+ SepalArea + PetalArea + pcaF, method="class", data=dTrain, control=rpart.control(minsplit=5))
# evaluate on test settestPreds <- predict(dTree, dTest, type = "class")evalResults <- confusionMatrix(testPreds, dTest$Class)tmpAcc <- evalResults$overall[1]allResultsTe <- rbind(allResultsTe, as.vector(c(i,tmpAcc)))
# evaluate on training settrainPreds <- predict(dTree, dTrain, type = "class")evalResults2 <- confusionMatrix(trainPreds, dTrain$Class)tmpAcc2 <- evalResults2$overall[1]allResultsTr <- rbind(allResultsTr, as.vector(c(i,tmpAcc2)))
}
Ensemble methods
Ensemble methods
• Motivation
• Too difficult to construct a single model that optimizes performance (why?)
• Approach
• Construct many models on different versions of the training set and combine them during prediction
• Goal: reduce bias and/or variance
Conventional classification
X:attributesY:classlabel
92
Ensemble classification
12© 2004 Elder Research, Inc.
Relative Performance Examples: 5 Algorithms on 6 Datasets(with Stephen Lee, U. Idaho, 1997)
.00
.10
.20
.30
.40
.50
.60
.70
.80
.90
1.00
Diabetes Gaussian Hypothyroid German
Credit
Waveform Investment
Neural Network
Logistic Regression
Linear Vector Quantization
Projection Pursuit Regression
Decision Tree
Err
or
Rel
ativ
e to
Pee
r T
echniq
ues
(lo
wer
is
bet
ter)
source: Top Ten Data Mining Mistakes, John Edler, Edler Research)
13© 2004 Elder Research, Inc.
Essentially every Bundling method improves performance
.00
.10
.20
.30
.40
.50
.60
.70
.80
.90
1.00
Diabetes Gaussian Hypothyroid German
Credit
Waveform Investment
Advisor Perceptron
AP weighted average
Vote
Average
Err
or
Rel
ativ
e to
Pee
r T
echniq
ues
(lo
wer
is
bet
ter)
source: Top Ten Data Mining Mistakes, John Edler, Edler Research)
ensemble
Ensemble design
TREATMENTOFINPUTDATA
•sampling
•variableselection
Ensemble design
TREATMENTOFINPUTDATA
•sampling
•variableselection
CHOICEOFBASECLASSIFIER
•decisiontree
•perceptron
•…
Ensemble design
PREDICTIONAGGREGATION
•averaging
•weightedvote
•…
TREATMENTOFINPUTDATA
•sampling
•variableselection
CHOICEOFBASECLASSIFIER
•decisiontree
•perceptron
•…
Ensemble design
Bagging
• Bootstrap aggregating
• Main assumption
• Combining many unstable predictors in an ensemble produces a stable predictor (i.e., reduces variance)
• Unstable predictor: small changes in training data produces large changes in the model (e.g., trees)
• Model space: non-parametric, can model any function if an appropriate base model is used
PREDICTIONAGGREGATION
•averaging
TREATMENTOFINPUTDATA
•samplewithreplacement
CHOICEOFBASECLASSIFIER
•unstablepredictore.g.,decisiontree
Bagging
Bagging
• Given a training data set D={(x1,y1),..., (xN,yN)}
• For m=1:M
• Obtain a bootstrap sample Dm by drawing N instanceswith replacement from D
• Learn model Mm from Dm
• To classify test instance t, apply each model Mm to t and use majority predication or average prediction
• Models have uncorrelated errors due to difference in training sets (each bootstrap sample has ~68% of D)
Sample to create altered training data
Random forests
• Given a training data set D={(x1,y1),..., (xN,yN)}
• For m=1:M
• Obtain a bootstrap sample Dm by drawing N instanceswith replacement from D
• Learn decision tree Mm from Dm using k randomly selected features at each split
• To classify test instance t, apply each model Mm to t and use majority predication or average prediction
install.packages("randomForest")library(randomForest)
# random forest ensembledForest <- randomForest(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF, method=“class", data=d3)
# view resultsprint(dForest)
# importance of each predictorimportance(dForest)
# calculate learning curve for random forestallResultsTe <- matrix(numeric(0), 0,2)
trainSetSizes <- c(0.025,0.05,0.1,0.2,0.4,0.8)for(i in trainSetSizes){
trainIndex <- createDataPartition(d3$SepalLength, times=1, p=i, list=FALSE)dTrain <- d3[ trainIndex,]dTest <- d3[-trainIndex,]
dForest <- randomForest(Class ~ SepalLength + SepalWidth + PetalLength + PetalWidth + SepalArea + PetalArea + pcaF,method="class",data=d3)
# evaluate on test settestPreds <- predict(dForest, dTest, type = "class")evalResults <- confusionMatrix(testPreds, dTest$Class)tmpAcc <- evalResults$overall[1]allResultsTe <- rbind(allResultsTe, as.vector(c(i,tmpAcc)))
}