+ All Categories
Home > Documents > Computational Intelligence: Methods and Applications

Computational Intelligence: Methods and Applications

Date post: 13-Feb-2016
Category:
Upload: elton
View: 53 times
Download: 0 times
Share this document with a friend
Description:
Computational Intelligence: Methods and Applications. Lecture 1 6 Model evaluation and ROC Source: Włodzisław Duch ; Dept. of Informatics, UMK ; Google: W Duch. Confusion matrices. - PowerPoint PPT Presentation
Popular Tags:
54
Computational Intelligence: Computational Intelligence: Methods and Applications Methods and Applications Lecture 16 Model evaluation and ROC Source: Włodzisław Duch; Dept. of Informatics, UMK; Google: W Duch
Transcript
Page 1: Computational Intelligence:  Methods and Applications

Computational Intelligence: Computational Intelligence: Methods and ApplicationsMethods and Applications

Lecture 16 Model evaluation and ROC

Source: Włodzisław Duch; Dept. of Informatics, UMK; Google: W Duch

Page 2: Computational Intelligence:  Methods and Applications

Confusion matricesConfusion matricesFor n observations, including n+ form class + and n from other

classes labeled prediction of a model M are counted and put in a confusion matrix (rows=true, columns=predicted by model M),

1|r

M r

r

r r

r

r

T MP n n n n

nn n n n

P P P P TP FN R PP P P P FP TN R P

P (P) is the priori probability of class (-) estimated from data.

Page 3: Computational Intelligence:  Methods and Applications

FP and FNFP and FNConfusion matrix, including possibility of rejection (don’t know answer):

TP FN

true | predictedFP TN

P PP

P P

This notation is used especially in medical applications:

P is a hit or true positive (TP); PP is a true positive rate;

P is a hit or true negative (TN); PP is a true negative rate P false alarm, or false positive (FP); ex. healthy predicted as sick.

P is a miss, or false negative (FN); ex. sick predicted as healthy.

Page 4: Computational Intelligence:  Methods and Applications

Accuracy and errorsAccuracy and errorsModels are frequently evaluated on the basis of their accuracy.

Elements of Pij depend on the evaluated model, Pij(M)

Accuracy Error rate

Rejection r

( ) ( ) ( )( ) ( ) ( )( ) ( )at ( )

1 ( ) ( )e r r

A M P M P ML M P M P MR M P M P M

L M A M

Accuracy for class k and balanced accuracy used for unbalanced data(i.e. data with small number of samples in some classes).

12

k kk kA M P M P

B M A M A M

Page 5: Computational Intelligence:  Methods and Applications

What is better?What is better?Rank 4 classifiers producing the following confusion matrices:

Accuracy 0.5 Accuracy 0.8

But which one is better for given accuracy? M1 or M3? M2 or M4?Ordering is meaningless because there are two parameters, not one! For example, accuracy for class 1 and class 2.

1 2

3 4

0.25 0.25 0.4 0.1| |

0.25 0.25 0.1 0.4

0.25 0.5 0.4 0.0| |

0.0 0.25 0.2 0.4

M M

M M

P P

P P

Page 6: Computational Intelligence:  Methods and Applications

Other evaluation measuresOther evaluation measuresIn medical applications: accuracy for class + is called sensitivity: what percentage of really sick people does this test recognize?Accuracy for class is called specificity: is this test specific to class + or does it always says + ?In information retrieval sensitivity is called recall: what % of truly relevant information has been recalled? Precision measures % of really relevant info among all recalled info.F-measure is the harmonic mean of recall and precision.

|

|

/

/

Prec / /

2Recall Prec Recall Prec

S M P M P M P

S M P M P M P

M P M P M P M P M P M

F M

sensitivity, recall

specificity

F - measure

Page 7: Computational Intelligence:  Methods and Applications

ExampleExampleN=100 observations, for x=0 let 10 be from and 30 from classes, and for x=1 there are 40 and 20 cases from and classes:

10 40 0.1 0.4, ; ,

30 20 0.3 0.2

0 0.4; 1 0.6; 0.5;

1/ 4 2 / 3| , /

3/ 4 1/ 3

N x P x

P x P x P P

P x P x P x

MAP rule for this data is:

if x=0 then select because P(|x=0) >P(|x=0)

if x=1 then select because P(|x=1) >P(|x=1)

If x=0 then 30 vectors are assigned to , or P(,)=0.3 and 10

vectors assigned to class, so P(,)=0.1; P(,)=0.2; P(,)=0.4

Page 8: Computational Intelligence:  Methods and Applications

ExampleExampleIf x=1 then 30 vectors are assigned to , or P(,)=0.4

and 20 vectors assigned to class, so P(,)=0.2

Therefore MAP decisions lead then to the confusion matrix:

0.4 0.1|

0.2 0.3i jP M

with P=0.5, P =0.5 and

Accuracy = 0.4+0.3=0.7, Error = 0.1+0.2=0.3

Sensitivity =0.4/0.5=0.8 (recall)Specificity =0.3/0.5=0.6Balanced accuracy = (0.8+0.6)/2 = 0.7Precision = 0.4/0.6 = 2/3 = 0.67F-measure = 2*0.8*2/3 *1/(0.8+2/3) = 16/22 = 8/11=0.73

Page 9: Computational Intelligence:  Methods and Applications

Error functionsError functionsAccuracy of the model M is usually maximized, it is a sum of TP+TN, or a combination of sensitivity and specificity with the class priors as weights:

A M P M P M P S M P S M

( ; ) ( , ) Tr ( , ) 1i Mj i Mji j

E M L M A M P P

Sometimes we would like to make decisions if we are quite sure that they are correct. Confidence in model M may be increased by rejecting some cases. Error=sum of the off-diagonal confusion matrix P(i,Mj) elements (true vs. predicted), accuracy=sum (trace) of diagonal elements, so

1L M A M P M P M

This is obvious: class + has P+ fraction of all cases and sensitivity S+ is the accuracy for this class, same for class , so the sum gives accuracy.Equivalently the error ratio may be minimized:

Page 10: Computational Intelligence:  Methods and Applications

More error functionsMore error functionsThis combination for 2 classes (ex. one class against all the rest) is:

min ( ; ) min (1 ) ( ) ( )M ME M L M R M

( ; ) 1E M P P P P

Rejection rate, or the fraction of the samples that will not be classified is:

For this is equal to error + rejection; for large minimization of this error function over parameters of the model M reduces the sum of FP and FN errors, but at a cost of growing rejection rate; for example if the model M {x >5 then else 2 } makes 10 errors, all for x[5,7] then leaving samples in this range unclassified gives M’ {x >7 then or x 5 then 2 },

no errors, but lower accuracy, higher rejection rate and high confidence.

( ) 1 ( ) ( )R M L M A M Minimization of the error-accuracy is thus equivalent to error + rejection:

Page 11: Computational Intelligence:  Methods and Applications

Errors and costsErrors and costsOptimization with explicit costs: assume that FP (false alarms) cost times more than FN (misses).

min ( ; ) min ( ) ( )

min (1 ( )) ( )

(1 ( )) ( )

M M

rM

r

E M P M P M

P S M P M

P S M P M

min ( ; 0) min ( ) ( )rME M P M P M

For = 0 this is equivalent to maximization of

and for large to the maximization of

min ( ; 1) min ( ) ( )rME M P M P M

Page 12: Computational Intelligence:  Methods and Applications

Lifts and cumulative gainsLifts and cumulative gainsTechnique popular in marketing, where cumulative gains and “lifts” are graphically displayed: lift is a measure of how effective model predictions are = (results obtained with)/(without the predictive model).

Ex: is Xi likely to respond? Should I send him an offer? Suppose that 20% of people respond to your offer. Sending this offer randomly to N people gives Y0=0.2*N replies.

A predictive model (called “response model” in marketing), P(|X;M) uses information X to predicts who will respond.

Order predictions from the most likely to the least likely: P(|X1;M) > P(|X2;M) ... > P(|Xk;M)

The ideal model should put those 20% that will reply in front, so that the number of replies Y(Xj) grows to Y0=0.2*N for j=1 .. Y0. In the ideal case cumulative gain will be then a linear curve reaching Y0 and then remaining constant; lift will then be the ratio Y(Xj)/0.2*j.

Page 13: Computational Intelligence:  Methods and Applications

Cumulative gains chartCumulative gains chartThere is no ideal model, so check your predictions P(|X;M) against reality. If the prediction was true plot next point one unit to the right, one unit up; if prediction was false plot it just one unit to the right. The vertical axis contains P+ portion of all data samples Y0 = the number of all that responded (in this case Y0=1000) out of all N=5000 people contacted.

Rank P(|X) True

1 0.98 +

2 0.96 +

3 0.96 -

4 0.94 +

5 0.92 -

........................

1000 0.08 -See more here

Page 14: Computational Intelligence:  Methods and Applications

ROC introROC introReceiver Operator Characteristic evaluate TP (or P++ rate) as a function of some threshold; for example using the likelihood ratio:

the threshold may be treated as a variable measuring confidence; for 1D distributions it is clear that for a large threshold only positive caseswill be left, but their proportion S+ (recall, sensitivity) decreases to zero.

1 1 2 2

1 2

2 1

| |

||

P P P P

P PP P

X X

XX

X

What is the optimal choice? Depends on the ratio of false alarms (FP, false positives, P) we are willing to accept, or proportion of 1-S= P/P

Page 15: Computational Intelligence:  Methods and Applications

ROC curvesROC curvesROC curves display (S+,1-S), or error of class versus accuracy of class + for different thresholds:

Ideal classifier: below some threshold S+ = 1 (all positive cases recognized) for 1-S= 0 (no false alarms).

Useless classifier (blue): same number of true positives as false alarms for any threshold.

Reasonable classifier (red): no errors until some threshold that allows for recognition of 0.5 positive cases, no errors if 1-S > 0.6; slowly rising errors in between.

Good measure of quality: high AUC, Area Under ROC Curve.

AUC = 0.5 is random guessing, AUC = 1 is perfect prediction.

Page 16: Computational Intelligence:  Methods and Applications

ROC curve for realistic case: finite number of thresholds and data points makes it rugged. Instead of the percentage of true positives (recall, sensitivity) precision is sometimes used. Each ROC point captures all information contained in confusion matrix for some parameters (thresholds) of the model and shows for different probabilities confidence in predictions of the classifier. Try Yale tutorial 14 showing ROC plots.

Realistic ROCRealistic ROC

Page 17: Computational Intelligence:  Methods and Applications

More convex ROC curves show superiority of models for all thresholds. Here ROC curves for two models show strong and weak areas: combination of the results of the two models may give a ROC curve covering the grey area.

ROC for combination of classifiersROC for combination of classifiers

Page 18: Computational Intelligence:  Methods and Applications

Rules for classifiers that give only yes/no predictions, for example logical rules, give only one point on the ROC curves, at (S, 1S).AUC = (S + S)/2, and it is identical along SS=const line.

ROC for logical rulesROC for logical rules

Try the demo of AccuROC for Windows (sample output above)http://www.accumetric.com/accurocw.htm

Page 19: Computational Intelligence:  Methods and Applications

Computational Intelligence: Computational Intelligence: Methods and ApplicationsMethods and Applications

Lecture 17WEKA/RapidMiner

Knowledge extraction from simplest decision trees

Source: Włodzisław Duch; Dept. of Informatics, UMK; Google: W Duch

Page 20: Computational Intelligence:  Methods and Applications

A few data mining packagesA few data mining packagesA large number of data mining packages that include many CI models for data analysis is available.

See long list of DM software, including large commercial packages.

GhostMiner, from Fujitsu (created by our group); please get it.

WEKA started the trend to collect many packages in one system. RapidMiner, formerly YALE – initially a better front-end to WEKA, includes all WEKA models, free source; please get it.

New interesting projects: see my list of software. Orange, component-based data mining software, includes visualizations, SOM/MDS modules.KNIME, based on Eclipse platform, includes Weka and R-scripts, modular data exploration platform, visual data flows.R-project, language for statistical computing and graphics.

Page 21: Computational Intelligence:  Methods and Applications

WEKA ProjectWEKA ProjectMachine learning algorithms in Java: I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann 1999

Project Web page: www.cs.waikato.ac.nz/ml/wekaOne of the most popular packages.Essentially a collection of Java class libraries implementing various computational intelligence algorithms.

ARFF data format, with data in CSV format (comma separated, exportable from spreadsheets), and additional information about the data, type of each feature, etc.

CLI, or command line interface (only for Unix lowers), ex: java weka.classifiers.j48.J48 -t data/weather.arff

calls one of the methods (here J48) from the library.

Page 22: Computational Intelligence:  Methods and Applications

WEKA SoftwareWEKA Software“Explorer GUI” for making basic calculations, recently much improved.“Experimenter” and “Knowledge Flow” environments for performing more complex experiments is provided, this allows for averaging over crossvalidation results or combining different models.On-line documentation for library classes but little description of methods.

WEKA/RM software contains:• preprocessing filters, supervised and unsupervised• many classification models• rule-based models for knowledge discovery• association rules (one method)• regression, or numerical prediction models• 3 clusterization (unsupervised learning) methods• scatterogram visualization (2D)• a collection of sample simple problems (from the UCI repository).

Page 23: Computational Intelligence:  Methods and Applications

More WEKA/RM SoftwareMore WEKA/RM SoftwareSimple “Explorer GUI” for making basic calculations.Rather rough “Experimenter” environment for performing more complex experiments, such as averaging over crossvalidation results or combining different models is provided.On-line documentation for library classes.

WEKA/RM software contains:• preprocessing filters, supervised and unsupervised• classification models• rule-based models for knowledge discovery• association rules (one method)• regression, or numerical prediction models• 3 clusterization (unsupervised learning) methods• scatterogram visualization (2D)• a collection of sample problems (from UCI repository)

Page 24: Computational Intelligence:  Methods and Applications

WEKA strong/weak pointsWEKA strong/weak pointsPlatform independent – Java!Many projects created around it: listed here.Free, contains large collection of filters and algorithms.May be extended by a serious user.

But ... Java programs are not so stable as Windows programs, there are problems with some Java versions; rather poor visualization of data and results; RapidMiner is a big improvement.

Simple user interface has been improved recently,

Knowledge Flow GUI changes this.

Requires tedious programming to perform experiments – RM is easier.Algorithms are not described in details in documentation and in the book (only in the class libraries).

Page 25: Computational Intelligence:  Methods and Applications

RapidMinerRapidMinerLike WEKA, same models + few more, easier to use.Free, contains large collection of filters and algorithms.May be extended by a serious user.

Download RapidMiner, start it and read the tutorial!

Includes 20 visual data exploration methods: scatter, scatter matrix, interactive scatter 3D, parallel, 2D density, radial radviz, gradviz, SOM (U-distance and P-density).

Unfortunately algorithms are not described in details in the documentation and you have to study class libraries to understand what exactly GridViz or RadViz does, or read original papers to understand what U, U* and P matrix SOM visualization is.

Check much better descriptions of methods in Orange!

Page 26: Computational Intelligence:  Methods and Applications

Knowledge representationKnowledge representationKnowledge representation is an important subject in Artificial Intelligence, here only simple forms of knowledge are considered.

Decision rules: prepositional rules: IF (all conditions are true) THEN facts M-of-N rules: IF (M conditions of N are true) THEN factsfuzzy rules: IF (conditions true to some degree) THEN

facts are true to some degree

Linguistic variables: favorite-colors, low-noise-level, young-age, etc:

• subsets of nominal or discrete values,

• intervals of numerical values, ex: teenager ={T if age<20}

• constrained subsets of numerical values.

Page 27: Computational Intelligence:  Methods and Applications

WEKA/WEKA/RM RM filtersfiltersMany filters that can be applied to attributes (features) or to instances (vectors, samples), some specific to signal/time series data.

• Divided into supervised/unsupervised, attribute or instance. • Create new attribute from existing ones using algebraic operations;• remove instances with attribute values in some range, for example

missing values; delete attributes of specific type (ex. binary)• change nominal values to binary combinations, ex.

Xi{a,b,c,d} => (Xi1,Xi2)({0,1},{0,1})

• rank the usefulness of attributes (several schemes);• evaluate usefulness of subsets of features (several schemes);• perform PCA; normalize features in many ways;• discretize attributes, define simple bins or look for more natural

discretization, for example bins created by the Minimum Description Length (MDL) principle (called “use Kononenko”).

• many others ...

Page 28: Computational Intelligence:  Methods and Applications

Classification algorithmsClassification algorithmsDivided into: • Bayes – versions of probabilistic Bayesian methods• Functions – parameterized functions, linear and non-linear• Lazy – no parameter learning, all work done when classifying• Meta – committees, voting, boosting, stacking ... metamodels.• Misc – untypical models, fuzzy lattice, hyperpipes, voting features • Tree-building models, recursive partitioning• Rule learning models

These algorithm enable:

• knowledge discovery, or data mining (trees, rules);• predictive modeling in classification or regression tasks. See WEKA detailed presentation:

http://prdownloads.sourceforge.net/weka/weka.ppt

Page 29: Computational Intelligence:  Methods and Applications

Decision rulesDecision rules

• Zero-R, predicting majority class (or mean values)• One-R, simplest one-level (one attribute) decision tree. • Decision stump, one-level tree• C4.5, called here J.48, since this is Java implementation of the

version 8 of C4.5 decision tree algorithm.• M5’ model tree learner. • Naive Bayes tree classifier.• PART rule learner (covering algorithm).

Prototype – based algorithms: • Instance –based learner (IB1, IBk, ID3) nearest neighbor method• Decision table

Algorithm for knowledge discovery, or data mining, 10 rule and 10 tree-based, providing knowledge in form of logical rules.

Page 30: Computational Intelligence:  Methods and Applications

Regression algorithmsRegression algorithmsRegression (function) and classification algorithms include:

• Naive Bayes (2 versions)

• Linear Regression, or LDA

• Additive regression

• Logistic regression

• LWR, Locally Weighted Regression

• MLP (multi-layer perceptron) neural network,

• VPN, voted perceptron network

• SMO, or Support Vector Machine algorithm

• K*, similarity based system with algorithmic complexity minimization.

Page 31: Computational Intelligence:  Methods and Applications

Other algorithmsOther algorithmsStatistical algorithms for model improvement (meta-algorithms): • bagging, • boosting, • adaboost• logit boost, • stacking

Clusterization: • K-means, • Expectation Maximization, • Cobweb

Association: find relations between attributes.Visualization of 2D scatterograms

Page 32: Computational Intelligence:  Methods and Applications

WEKA/WEKA/RM RM exampleexampleContact lenses: do I need hard, soft or none? Very small data set, 24 instances: contact-lens.arff

What is in the database?1. age of the patient: (1) young, (2) pre-presbyopic, (3) presbyopic 2. spectacle prescription: (1) myope, (2) hypermetrope3. astigmatic: (1) no, (2) yes4. tear production rate: (1) reduced, (2) normal

Class Distribution:

1. hard contact lenses: 4

2. soft contact lenses: 5

3. no contact lenses: 15

Page 33: Computational Intelligence:  Methods and Applications

ZeroRZeroRZero method: • for a small number of classes (categorical class variables) predict

the majority class;• for numerical outputs (regression problems) predict the average.

Useful to establish the base rate, zero variance, large bias: if any method obtains results that are worse than ZeroR seriousoverfitting of data occurs.For contact-lenses: confusion matrix

=== Confusion Matrix ===

a b c <= classified as 0 0 5 | a = soft 0 0 4 | b = hard 0 0 15 | c = none

15 classified correctly, 62.5% on the whole data.

What happens in 10xCV?

Page 34: Computational Intelligence:  Methods and Applications

DT - DT - ideaidea

Class: {cancer,

healthy}

Features: cell body: {gray, stripes} nuclei: {1, 2}; tails: {1, 2}

cancer

cancer

healthy

healthy

Page 35: Computational Intelligence:  Methods and Applications

More ambitious treeMore ambitious tree

Page 36: Computational Intelligence:  Methods and Applications

1R1R1R: simplest useful tree (Holte 1993), sometimes results are good.One level tree, nominal attributes.

1R algorithm: for every attribute X for every attribute value Xi: count the class frequencies N(Xi,j)

find the most frequent class c = arg maxj N(Ai,j)

create a rule (majority classifier): IF Xi THEN c

Calculate accuracy of this rule. Select rules of highest accuracy.

Missing value ? is treated as any other nominal value.

Page 37: Computational Intelligence:  Methods and Applications

1R1R example exampleExample taken from WEKA book: weather condition and decision to play an in-door games (tennis); 14 examples are given Task: find the decision rule (weather.nominal.arff).

Attribute: OutlookOutlook = Sunny has 3 examples with No and 2 with Yes;Outlook = Overcast has 4 examples with Yes

Rainy has 3 examples with Yes and 2 with No;

Optimal rules: using only Outlook, or only Humidity.

Dataset is too small to evaluate accuracy, but rules are reasonable.

Page 38: Computational Intelligence:  Methods and Applications

1R 1R continuouscontinuousHow should the continuous values be treated?

Divide the range of continuous attribute into intervals Ii(X) (discretize the attribute); treat intervals as nominal values, i.e. write X=Ii if XIi(X). For each attribute X sort all cases according to the increasing X values; define the intervals Ii(X) where class c dominates, maxc N(Ii(X),c).This should decrease the number of errors in 1R algorithm.Problem: if the data is noisy or some examples are quite untypical, rules should not be created!

A simpler solution: use buckets, or intervals with minimum # of elements,

admitting some “impurities”.

Page 39: Computational Intelligence:  Methods and Applications

1R example with continuous variables1R example with continuous variablesDiscretization of temperature, instead of hot, mild, cool.

To avoid noise, intervals containing not less than 4 elements are used.

WEKA implementation:

Bucket size = min number of elements in interval.

Slightly more accurate solution is found with B=2-4, but B=1 has only 1 error – is it good or bad?

Still it makes no sense ...

Page 40: Computational Intelligence:  Methods and Applications

1R example with continuous variables1R example with continuous variablesDiscretization of humidity, instead of high/normal (small font = no):

65 70 70 70 75 80 80 85 86 90 90 91 95 96

To avoid noise, bucket containing B4 elements are used.65 70 70 70 75 80 80 | 85 86 90 90 91 95 | 96

Rather naive, adds to bucket until non-majority class sample is found.

WEKA implementation:

Bucket size = min number of elements in interval.

Slightly more accurate solution is found with B=2-4, but B=1 has only 1 error – is it good or bad?

Still it makes no sense ...

Page 41: Computational Intelligence:  Methods and Applications

Netflix 1M$ PrizeNetflix 1M$ PrizeNetflix Prize, an award of $1 million to the first person or team who can achieve certain accuracy goals when recommending movies based on personal preferences – announced in Oct 2006.

The company made 100 million anonymous movie ratings available to contestants for learning.

Details for registering and competing for the Netflix Prize are at:

http://www.netflixprize.com

All members of the CI/machine learning community are invited to participate!

10% improvement is needed, 9.65% achieved in April 2009, RMSE <= 0.8563, achieved 0.8596.

Page 42: Computational Intelligence:  Methods and Applications

Computational Intelligence: Computational Intelligence: Methods and ApplicationsMethods and Applications

Lecture 18 Decision trees

Source: Włodzisław Duch; Dept. of Informatics, UMK; Google: W Duch

Page 43: Computational Intelligence:  Methods and Applications

Multi-level DTMulti-level DTFirst split on the best test/attribute, then continue recursively.

Children nodes may be split using different test/attributes, sometimes repeating those already used in the parent nodes.

Page 44: Computational Intelligence:  Methods and Applications

General DT propertiesGeneral DT propertiesDT: first general, later more specific decisions

node test on attribute, selecting subsets or intervalsbranching splitting data vectors into subsetsleaves of the tree are associated with decisions (classes)

Tests: on a single attribute, or their combinationattribute = {value1,value2 ..} or attribute < value1

Criteria: maximize information gain, maximize the purity of new nodes; maximize separability of subset vectors

Pruning: remove branches that contain only a few cases, simple trees may generalize better (lower variance,

higher bias)evaluation optimal tree complexity on validation set.

Stop criterion: node purity, accuracy, tree complexity.

Page 45: Computational Intelligence:  Methods and Applications

Attribute selectionAttribute selectionWhich attribute should be taken next, A1 or A2? Which test?

How much information is gained by splitting? Calculate entropy!Optimal no. of bits to code message with P+ is ~ lg2 P+, same for P

P+ and P are a priori ± class probabilities in the node S, test divides the S set into St and Sf.

2 2( |S) lg lgH P P P P ωSS

( ,A|S) ( |S) ( |S ) ( |S )S S

ftt fG H H H ω ω ω ω

Entropy of “pure” nodes (vectors from one class) is 0; Max. entropy is for a node with mixed samples Pi=1/2.

Plot of H(P+|S)for P+ =1-P

Page 46: Computational Intelligence:  Methods and Applications

Entropy contentEntropy contentInformation is just negative entropy E, thus entropy change = info gain: 1 2 2, ,... lgn i i

i

H p p p p p

Weather example (from WEKA book)

Page 47: Computational Intelligence:  Methods and Applications

Creation of a decision treeCreation of a decision tree

Creation of a tree search in the hypothesis space for the simplest set of hierarchical rules (most compact tree).

ID3 tree – split criterion based on information gain.Bias: smaller trees are better, if equal information gain select split with lower number of branches.No backtracking. Rather robust in the presence of noise, although local (greedy) search does not guarantee that the final tree will be optimal.

Page 48: Computational Intelligence:  Methods and Applications

Decision bordersDecision bordersHierarchical partitioning of feature space into hyper-rectangles.Example: Iris flowers data, with 4 features; displayed in 2-D.

Page 49: Computational Intelligence:  Methods and Applications

Uni and multi-variate criterionsUni and multi-variate criterions

Figure from Duda, Hart & Stork, Chap. 8

Univariate, or monothetic trees,

mult-variate, or oblique trees.

Page 50: Computational Intelligence:  Methods and Applications

DTDTs are not stables are not stable

Figure from Duda, Hart & Stork, Chap. 8

Moving just one example slightly may lead to quite different trees and space partition!

Lack of stability against small perturbation of data.

Page 51: Computational Intelligence:  Methods and Applications

OckhamOckham’s razor’s razorWhy simple trees should be preferred?

1. The number of simple hypothesis that may accidentally fit the data is small, so chances that simple hypothesis uncover some interesting knowledge about the data are larger.

2. Simpler trees have higher bias and thus lower variance, they should not overfit the data that easily.

3. Simpler trees do not partition the feature space into too many small boxes and may generalize better; complex trees may create a separate box for each training data sample.

Still, even if the tree is small ... sufficiently long search by pure chance may find false solution;for small datasets with many attributes several equivalent (from the accuracy point of view) descriptions may exist.

=> One tree is not sufficient, we need a forest of healthy trees!

Page 52: Computational Intelligence:  Methods and Applications

OverfittingOverfittingA model H overfitts the data if:

A model H’ exists such that:Training-error(H) < Training-error(H’) Test-error(H) > Test-error(H’)Model H draws conclusions (makes hypothesis) that are too

detailed for the amount of evidence available.

Accuracy as a function of the number of tree nodes: on the training data it may grow up to 100%, but the final results may be worse than for the majority classifier!

Page 53: Computational Intelligence:  Methods and Applications

Random data exampleRandom data exampleGenerate and label randomly data samples as class 1 or class 2,

with the a priori probability of the majority class P(1)=p>0.5

Majority classifier makes Emaj=1-p percent of errors.

Overfitted tree that classifies correctly all training data has: N·p nodes from 1 class

NN·p nodes from 2 class.

A random X is assigned to 1 class with prob. p and 2 with 1-p.Confusion matrix:

2

2

(1 )(1 ) (1 )p p pp p p

2 (1 ) 2 1 for 0.5

(1 )p p p p

p

For p=0.75 overfitted tree makes 37.5% errors, while the majority classifier will make only 25% errors.

Tree error/Majority classifier error

Page 54: Computational Intelligence:  Methods and Applications

Some examplesSome examplesPlease run a few example of decision tree solutions using Rapid Miner or WEKA on benchmark data using WEKA knowledge explorer or RM.


Recommended