Machine Learning
Sections 18.1 - 18.4
What is learning?
“changes in a system that enable a system to do the same task more efficiently the next time” -- Herbert Simon
“constructing or modifying representations of what is being experienced” -- Ryszard Michalski
“making useful changes in our minds” -- Marvin Minsky
Why learn?Understand and improve human learning
learn to teach, CAI, CBT
Discover new things Data mining
Fill in skeletal information about a domain incorporate new information in real time make systems less “finicky” or “brittle” by
making them better able to generalize
Components of a learning system
Evaluating Performance
Several possible criteria Predictive accuracy of classifier Speed of learner Speed of classifier Space requirements
• Most common criterion is Predictive Accuracy
Major Paradigms of MLRote Learning
memorize examples Association-based storage and retrieval
Induction Learn from examples to reach general conclusions
ClusteringAnalogy
Determine correspondence between representations
Major Paradigms (Cont).Discovery
Unsupervised, no specific goal
Genetic Algorithms Combine successful behaviors, only the fittest
survive
Reinforcement Feedback (reward) given at end of a sequence of
steps Assign rewards by solving a Credit Assignment
Problem
Inductive LearningExtrapolate from a given set of examples so
that we can make accurate predictions about future examples.
Types Supervised
• teacher tells us the answer y=f(x) Unsupervised
• predict a future value and then validate Concept Learning
• Given examples in a class, determine if a test example is in the class (P) or not (N)
Supervised Concept LearningGiven a training set of positive and negative
examples of a concept construct a description that will accurately
classify future examples. Learn some good estimate of function f given a
training set:{ (x1,y1), (x2,y2), . . . (xn,yn)}where each yi is either + (positive) or- (negative)
Inductive Bias
Inductive learning generalizes from specific facts cannot be proven true, but can be proven false
• Falsity preserving is like searching a Hypothesis Space H of possible
f functions Bias allows us to pick which h is preferable define a metric for comparing f functions to find
the best
Inductive learning frameworkRaw input is a feature vector, x, that describes the
relevant attributes of an exampleEach x is a list of n (attribute, value) pairs
x = (person=Sue, major=CS, age=Young, Gender=F)
Attributes have discrete values. All examples have all attributes.
Each example is a point in n-dimensional feature space
Case-based ideaMaintain a library of previous casesWhen a new problem arises
find the most similar case(s) in the library adapt the similar cases to solving the current
problem
Nearest NeighborSave each training example as a point in
n-spaceFor each testing example, measure the
“distance” to each training example. Classify the example the same as the nearest
neighbor
Suffers from the curse of high dimensionalityDoesn’t generalize well if examples are not
clustered tightly
k-nearest neighbor
What should the value of k be? That is, how many “close” examples should the algorithm consider? problem dependent
Using k nearest neighbors hopefully avoids the problem of noise in the data
Nearest-neighbor problemsStoring a large number of examples
strategy for deciding whether to keep or discard an example
one idea• store part of the training data
• use stored part to predict rest of the training data
• if mistake, add the mistake to the stored examples
Irrelevant features use tuning set to add or remove features to/from
the feature set distance function: how much to weight each
dimension?
Nearest-neighbor results
Testbed Nearest-Neighbor Decision-trees Neural Nets
Wisc. Cancer 98% 95% 96%Heart Disease 78% 76% ?
Tumor 37% 38% ?Appendicitis 83% 85% 86%
Learning Decision Trees
Goal: Build a decision tree for classifying examples as positive or negative instances of a concept
Supervised Batch processing of training examples using a preference bias
Decision Tree Example
Building Decision Trees
Preference Bias is Ockham’s Razor Simplest explanation that is consistent with the
observations is probably the best
Finding the smallest decision tree is NP-Hard, so we’ll settle for pretty small
Construction Overview
Top-down, recursivePick the “best” attribute for the current nodeGenerate children nodes
one for each possible value of the selected attrib
Partition the examples on that attributeAssign each subset to the child it goes withrepeat for each child until homogeneous
How to pick the “best” attribute?Random
Just pick one
Least Values narrowest branching of the tree
Most Values shallowest tree (fewest levels)
Max-Gain largest expected Info Gain smallest expected size of subtrees
Max Gain background
Use Information Theory Expected work to guess if an example x in a set
S matches a concept:
log2 |S|
At each step, we can ask a yes/no question that eliminates at most 1/2 of the elements remaining.
Expected questions remainingGiven
S = P union N P and N disjoint if x in P, then log2 |P| questions needed
if x in N, then log2 |N| questions needed
(prob(x in P) * log2 |P|) + (prob(x in N) * log2 |N|)
or, equivalently,
(p / (p+n)) * log2p + (n / (p+n)) * log2n
Information Content
How many questions do we save by knowing if x is in P or N?
I(P,N) = log2 |S| - (|P|/|S| log2 |P|) - (|N|/|S| log2 |N|)
or, equivalently,
I(%P, %N) = - (%P log2%P) - (%N log2%N)
Note that 0 <= I(P,N) <= 1, 0 is no info, 1 is max info
Perfect Balance
I(1/2,1/2) = - ½log2(½) – ½log2(½)
= -½(log2(1) – log2(2)) - ½(log2(1) –log2(2))= - ½ (0 – 1) – ½ (0 – 1)
= ½ + ½
= 1 information content is large=
Example: Homogeneity
If all of the samples in S are Positive None are Negative
Information content is low
I(1,0) = -1 log2(1) – 0 log2(0)
= -0 – 0
= 0
Low Information Content
Low information content is desirable in order to make the smallest tree Most of the examples are classified the same The subtree under this node will probably be
small
Information Gained
For a given attribute measure the difference in Information Content
after a node splits up the examples• measure the information content at each child
• weight the information by the proportion of examples that will go there
MaxGain definitions
Si = subset of S with value i, i = 1, . . ., mPi = subset of Si that are +Ni = subset of Si that are -qi = |Si| / |S| = % of examples on branch i%Pi = |Pi| / |Si| = % of + examples on branch i%Ni = |Ni| / |Si| = % of - examples on branch i
Information remaining
Weighted sum of information content of each child node generated by that attribute
Remainder(A) = qi * I(%Pi, %Ni)i=1
n
Information Gain
Subtract expected information content after the node from the information content at the entrance to the node to get the gain at that node.
Gain(A) = I(%P,%N) - Remainder(A)
Select the best attribute
Of all the remaining attributes, select the attribute with the highest gain for this location in the decision tree Since entrance information is constant
• Select A to get the minimum remainder(A)
Example Data
Example Color Shape Size Class1 red square big +
2 blue square big +
3 red round small -
4 green square small -
5 red round big +
6 green square big -
Remainder(Color) 3 of 6 are Red, 2 of 3 are +
• 3/6 I(2/3, 1/3)
• = 0.5 * 0.914 = 0.457 1 of 6 are Blue, all are +
• 1/6 I(1/1, 0/1)
• = 0.000 2 of 6 are Green, all are -
• 2/6 I(0/2,2/2)
• = 0.000
Remainder(Color) = 0.457 + 0.0 + 0.0
Gain Result
Attribute Remainder Gain
Color 0.457 0.543
Shape 0.914 0.086
Size 0.514 0.459
Final Decision Tree
+ -
Size + -
ColorR BG
Big Small
Extensions
Real-valued data choose thresholds, each interval becomes a discrete
value
Noisy data and overfitting 2 examples have identical evidence, but different
classifications Some values are inaccurate teacher is wrong Some attributes are irrelevant
Pruning
To avoid overfitting choose a threshold for information gain
• if best remaining attribute is not very good, prune here by making the node a leaf rather than generating children
choose a depth limit use a tuning set
Generation of rules
For each path from the root to a leaf, translate to a rule: if color=red and size=big then +
The collection of rules for all paths from the root to leaves is an interpretation of what the tree means
Setting Parameters
Some algorithms require setting learning parameters. Must be set without looking at test data! One method: Tuning Sets
• Partition data into Train Set and Tune Set
• For each candidate parameter value, generate decision tree using the Train Set
• Use Tune set to evaluate error rates and determine which parameter is best
• Compute new decision tree using selected parameter and entire Training set
Cross Validation
Divide all examples into N disjoint sets E = {E1, E2, E3, . . . , En}
For each i=1,…,n do Train set = E - {Ei}, Test set = {Ei} Compute decision tree using Train set Determine performance accuracy Pi using Test set
Compute N-fold cross-validation estimate of performance
(P1 + P2 + P3 + . . . + Pn) / N
WillWait from 12 Examples
Increasing Training Set
Summary
Decision Trees are widely used Easy to understand rationale Can out-perform humans Fast, simple to implement handles noisy data well
Weaknesses Univariate (uses only 1 variable at a time) batch (non-incremental)