Learning what questions to ask. 8/29/03Decision Trees2 Job is to build a tree that represents a...

DECISION TREESLearning what questions to ask

Decision Trees 28/29/03

Decision tree Job is to build a tree that represents a series of

questions that the classifier will ask of a data instance that is to be classified Each node is a question about the value that the

instance to be classified has in a particular dimension

Outlook Humidity Wind Play Tennis?

Sunny Normal Weak ???

How would the decision tree classify this data instance

Discrete Data

Fan-out of each node determined by how many different values that dimension can take-on

Play Tennis?


Training

Training data is used to build the tree How decide what question to ask first? Remember the curse of dimensionality

There might be just a few dimensions that are important and the rest could be random

Training builds the tree

Classifying means using the tree


What Question to Ask

What question can I ask about the data that will give me the most information gain Closer to being able to classify…

Identifying the most important dimension (most important question)

What to ask next…

What is the outlook?

How humid is it?How windy is it?


Approach comes out of Information Theory From Wikipedia: developed by Claude E.

Shannon to find fundamental limits on signal processing operations such as compressing data

Basically, how much information can I cram into a given signal (how many bits can I encode)

Information Theory

Another statistical approach


Entropy Starts with entropy…

Entropy is a measure of the homogeneity of the data

Purely random (nothing but noise) is maximum entropy

Linearly separable data is minimum entropy What does that mean with discrete data?

Given all instances with a sunny outlook, what if all of them were classified “yes, play tennis” that were “low humidity” and all of them were classified “no, do not play tennis” that were “high humidity”

High entropy or low?

Given all instances with a sunny outlook, what if half were “yes, play tennis” and half “no, don’t play” no matter what the humidity

High entropy or low?


Entropy

S is a collection of training samples is the proportion of positives is the proportion of negativesWe define as 0

If going to measure…

Want a statistical approach that yields…

Example: 100% positivesExample: 0% positivesExample: 50% positives


Example

What if a sample was 20% 80% Log2(.2) = log(.2)/log(2) Log2(.2) = -2.321928 Log(.8) = -0.3219281 -(.2)*(-2.321928) – (.8)*(-0.3219281) 0.7219281

What if 80% 20% Same

What if 50% 50% Highest entropy, 1


If Not Binary

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 )=∑𝑖=1

𝑐

−𝑝𝑖 log2𝑝𝑖

Can extend to more classes Not just positive and negative

• If set base to number of classes back to summing to 1 at max• Sum to number of classes if stick with base 2• From book: Entropy is a measure of the expected encoding length

measured in bits

Decision Trees 10

Humidity question or Windy question?

8/29/03

Information Gain Simply, expected reduction in entropy

caused by partitioning the examples according to this attribute

𝐺𝑎𝑖𝑛 (𝑆 , 𝐴 )≡𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 )− ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝐴 )

|𝑆𝑣||𝑆|

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣)

Scales the contribution of each answer according to membership

If entropy of S is 1 and each of the entropies for the answers is 1 then … 1 – 1 so zero

Information gain is zero

If entropy of S is 1 and each of the entropies for the answers is 0 then … 1 – 0 so one

Information gain is 1


Example

, 9 yesses to tennis, 5 no’s

What is the information gain


The algorithm Recursive algorithm: ID3

Iterative Dichotomizer 3

ID3(S, attributes yet to be processed)Create a Root node for the treeBase cases

If S are all same class, return the single node tree root with that labelIf attributes is empty return r node with label equal to most common class

OtherwiseFind attribute with greatest information gainSet decision attribute for root For each value of the chosen attribute

Add a new branch below rootDetermine Sv for that valueIf Sv is empty

Add a leaf with label of most common classElse

Add subtree to this branch: ID3(Sv, attributes – this attribute)


Another example

Which attribute next?


Another Example

Next attribute?


An issue

Is there a branch for every answer?

What if no training samples had overcast as their outlook?

Could you classify a new unknown or test instance if it had overcast in that dimension?


An issue

Tree often perfectly classifies training data Not guaranteed but usually: if exhaust every dimension

as drill-down last decision node might have answers that are still “impure” but is labeled with most abundant class

For instance: on the cancer data my tree had no leaves deeper than 4 levels

It basically memorizes the training data Is this the best policy? What if had a node that “should” be pure but had

a single exception?

Overfitting


0 5 100

510

Two Classes

X

Y

Visualizing Overfitting

Decision boundary Sometimes it is

better to live with a little error than to try to get perfection


Overfitting

Wikipedia In statistics, overfitting occurs when a

statistical model describes random error or noise instead of the underlying relationship.

-10 -5 0 5 10

-3000

-2000

-1000

01000

2000

3000

X

Y

-10 -5 0 5 10

-3000

-2000

-1000

01000

2000

3000

X

Y


How Fix Bayesian finds boundary that minimizes

error If we trim the decision tree’s leaves—

similar effect i.e. don’t try to memorize every single training

sample


Don’t know until you know Withhold some data Use to test

Definition

Given a hypothesis space , a hypothesis is said to overfit the training data if there exists some alternative hypothesis , such that has smaller error than over the training examples, but has a smaller error than over the entire distribution of instances.


How prevent?

Stop growing tree early Set some threshold for allowable

entropy Post Pruning

Build tree then remove as long as it improves


Remove each decision node in turn and check performance Removing a decision node means

removing all sub-trees below it and assigning the most common class

Remove (permanently) the decision node that caused the greatest increase in accuracy

Rinse and repeat

Reduced Error Pruning Try it a

nd

see


Build the complete (over trained) tree Convert the learned tree into a set of

rules One rule per path from root to leaf Each rule is a set of conjunctions

Remove any clause from each rule chain that increases accuracy Remember each rule chain provides a full

classification Sort rules by accuracy and classify in that

order

Rule Post Pruning


Not really a tree any more A series of rules A node could both be present and not be

present Imagine a bifurcation and one track has

only the first and last “node”

Neural Networks 258/29/03

Bagging

Bootstrap aggregating (bagging)

Helps to avoid overfitting

Usually applied to decision tree models (though not exclusively)

Neural Networks 268/29/03

Bagging

Machine learning ensemble meta-algorithm Create a bunch of models Do so by bootstrap sampling the training data Let all the models vote

Q1

Q2 Q3 Q4

Q1

Q2 Q3 Q4

Q1

Q2 Q3 Q4

Q1

Q2 Q3 Q4

Q1

Q2 Q3 Q4

Q1

Q2 Q3 Q4

Q1

Q2 Q3 Q4

Pick me!

Pick me!

Pick me!

Pick me!Pick me!

Pick me!

Pick me!Pick me! Pick me!

Pick me!

Pick me!

Pick me!

Pick me!Pick me!

Pick me!

Pick me!Pick me!

Pick me!


Random Forest

Forest is a bunch of trees

Each tree has access to a random subset of attributes/dimensions


The nature of Decision Trees

Greedy algorithm Tries to race to an

answer Finds the next

question that best splits the data into classes by answer

Result: Short trees are

preferred


Occam’s razor

The simplest answer is often the best

But does this lead to the best classifier

Book has a philosophical discussion about this without resolving the issue


Coolness factor Many classifiers simply give an answer No reason Decision trees one of the

few that provides such insights


Date post:	27-Dec-2015
Category:	Documents
Upload:	esther-terry
View:	214 times
Download:	0 times

Learning what questions to ask. 8/29/03Decision Trees2 Job is to build a tree that represents a...

Documents