Post on 18-Apr-2020
transcript
1
Machine Learning
Lecture # 4Multilayer Percceptron & Decision Trees
Artificial Neural Networks (ANN)• Neural computing requires a
number of neurons, to be connected together into a neural network.
• A neural network consists of:– layers
– links between layers
• The links are weighted.
• There are three kinds of layers:1. input layer
2. Hidden layer
3. output layer
From Human Neurones to Artificial Neurones
A simple neuron
• At each neuron, every input has an associated weight which modifies the strength of each input.
• The neuron simply adds together all the inputs and calculates an output to be passed on.
Activation function
MultiLayer Perceptron (MLP)
Motivation
• Perceptrons are limited because they canonly solve problems that are linearlyseparable
• We would like to build more complicatedlearning machines to model our data
• One way to do this is to build a multiplelayers of perceptrons
Brief History
• 1985 Ackley, Hinton and Sejnowski propose the Boltzmann machine
– This was a multi-layer step perceptron
– More powerful than perceptron
– Successful application NETtalk
• 1986 Rummelhart, Hinton and Williams invent Multi-Layer Perceptron (MLP) with backpropagation
– Dominant neural net architecture for 10 years
Multi layer networks
• So far we discussed networks with one layer.
• But these networks can be extended to combine several layers, increasing the set of functions that can be represented using a NN
MLP
Multilayer Neural Network
Sigmoid Response Functions
MLP
Simple example: AND
0 00 11 01 1
Example: OR function
0 00 11 01 1
-10
20
20
Negation:
01
10
-20
Putting it together:
0 0
0 1
1 0
1 1
-30
20
20
10
-20
-20
-10
20
20
-30
20
20
10
-20
-20
-10
20
20
Example of multilayer Neural Network
• Suppose input values are 10, 30, 20
• The weighted sum coming into H1
SH1 = (0.2 * 10) + (-0.1 * 30) + (0.4 * 20)
= 2 -3 + 8 = 7.
• The σ function is applied to SH1:
σ(SH1) = 1/(1+e-7) = 1/(1+0.000912) = 0.999
• Similarly, the weighted sum coming into H2:
SH2 = (0.7 * 10) + (-1.2 * 30) + (1.2 * 20)
= 7 - 36 + 24 = -5
• σ applied to SH2:
σ(SH2) = 1/(1+e5) = 1/(1+148.4) = 0.0067
• Now the weighted sum to output unit O1 :
SO1 = (1.1 * 0.999) + (0.1*0.0067) = 1.0996
• The weighted sum to output unit O2:
SO2 = (3.1 * 0.999) + (1.17*0.0067) = 3.1047
• The output sigmoid unit in O1:
σ(SO1) = 1/(1+e-1.0996) = 1/(1+0.333) = 0.750
• The output from the network for O2:
σ(SO2) = 1/(1+e-3.1047) = 1/(1+0.045) = 0.957
• The input triple (10,30,20) would becategorised with O2, because this has thelarger output.
Training Parametric Model
Minimizing Error
Least Squares Gradient
Single Layer Perceptron
Single layer Perceptrons
Different Response Functions
Learning a Logistic Perceptron
Back Propagation
Back Propagation
A Worked Example:
• Propagated the values (10,30,20) through the network
• Suppose now that the target categorization for the example was the one associated with O1(using a learning rate of η = 0.1)
• the target output for O1 was 1, and the target output for O2 was 0
• t1(E) = 1; t2(E) = 0; o1(E) = 0.750; o2(E) = 0.957
• error values for the output units O1 and O2 – δO1 = o1(E)(1 - o1(E))(t1(E) - o1(E)) = 0.750(1-0.750)(1-0.750) = 0.0469
– δO2 = o2(E)(1 - o2(E))(t2(E) - o2(E)) = 0.957(1-0.957)(0-0.957) = -0.0394
Input units Hidden units Output units
Unit Output UnitWeighted Sum
InputOutput Unit
Weighted Sum Input
Output
I1 10 H1 7 0.999 O1 1.0996 0.750
I2 30 H2 -5 0.0067 O2 3.1047 0.957
I3 20
• To propagate this information backwards to the hidden nodes H1 and H2– Multiply the error term for O1 by the weight from H1
to O1, then add this to the multiplication of the error term for O2 and the weight between H1 and O2, (1.1*0.0469) + (3.1*-0.0394) = -0.0706
– δH1 = -0.0706*(0.999 * (1-0.999)) = -0.0000705
– Similarly for H2: (0.1*0.0469)+(1.17*-0.0394) = -0.0414
– δH2 -0.0414 * (0.067 * (1-0.067)) = -0.00259
A Worked Example:
Input unit Hidden unit η δH xi Δ = η*δH*xi Old weight New weight
I1 H1 0.1 -0.0000705 10 -0.0000705 0.2 0.1999295
I1 H2 0.1 -0.00259 10 -0.00259 0.7 0.69741
I2 H1 0.1 -0.0000705 30 -0.0002115 -0.1 -0.1002115
I2 H2 0.1 -0.00259 30 -0.00777 -1.2 -1.20777
I3 H1 0.1 -0.0000705 20 -0.000141 0.4 0.39999
I3 H2 0.1 -0.00259 20 -0.00518 1.2 1.1948
Hiddenunit
Outputunit
η δO hi(E) Δ = η*δO*hi(E) Old weight New weight
H1 O1 0.1 0.0469 0.999 0.000469 1.1 1.100469
H1 O2 0.1 -0.0394 0.999 -0.00394 3.1 3.0961
H2 O1 0.1 0.0469 0.0067 0.00314 0.1 0.10314
H2 O2 0.1 -0.0394 0.0067 -0.0000264 1.17 1.16998
A Worked Example:
When to Learn
Online Learning
Batch Learning
Early Stopping
Self Study Examples
XOR Example
Linear separation
Can AND, OR and NOT be represented?
• Is it possible to represent every boolean function by simply combining these?
• Every boolean function can be composed using AND, OR and NOT (or even only NAND).
Linear separation
• How we can learn XOR function?
Linear separation
X1 X2 XOR
0 0 0
1 0 1
0 1 1
1 1 0
Linear separation
X1 X2 XOR
0 0 0
1 0 1
0 1 1
1 1 0
It is impossible to find the value of Wi to learn
XOR
Linear separation
X1 X2 X1*X2 XOR
0 0 0
1 0 1
0 1 1
1 1 0
So we learned W1, W2 and W3
Example, Back Propogation learning function XOR
• Training samples (bipolar)
• Network: 2-2-1 with thresholds (fixed output 1)
in_1 in_2 d
P0 -1 -1 -1
P1 -1 1 1
P2 1 -1 1
P3 1 1 1
• Initial weights W(0)
• Learning rate = 0.2
• Node function: hyperbolic tangent
)1,1,1(:
)5.0,5.0,5.0(:
)5.0,5.0,5.0(:
)1,2(
)0,1(2
)0,1(1
w
w
w
))(1))((1(5.0)('
))(1)(()('
1)(2)(
;1
1)(
1)(lim
;1
1)tanh()(
xgxgxg
xsxsxs
xsxge
xs
xge
exxg
x
x
x
x
pj
W(1,0) W(2,1)
o
0)1(
1x
)1(2x
2
1
0
1
2
-0.63211)(
-1.489840.24492)-,0.24492-,1)(1,1,1(
-0.244921)1/(2)(
-0.244921)1/(2)(
5.0)1,1,1()5.0,5.0,5.0(
5.0)1,1,1()5.0,5.0,5.0(
)1()1,2(
5.02
)1(1
5.01
)1(1
0)0,1(
22
0)0,1(
11
o
o
netgo
xwnet
enetgx
enetgx
pwnet
pwnet
computing Forward
1- d :1)- 1,- (1, P Present00
0.22090.6321)0.6321)(1-1(-0.3679
))(1))((1()('
-0.36789-0.63211)(1
ooo netgnetglnetgl
odl
gpropogatin back Error
-0.207650.24492)(10.24492)-1(1-0.2209
)('
-0.207650.24492)(10.24492)-1(1-0.2209
)('
2)1,2(
22
1)1,2(
11
netgw
netgw
0.0108)0.0108, 0.0442,(0.2449)- 0.2449,-(1,0.2209)(2.0
)1()1,2(
xw
update Weight
0.0415)0.0415,-0.0415,()1-,1-(1,-0.2077)(2.0
0.0415)0.0415,-0.0415,()1-,1-(1,-0.2077)(2.0
02)0,1(
2
01)0,1(
1
pw
pw
1.0108)1.0108, (-0.5415,
0.0108)0.0108, (-0.0442,)1,1,1()1,2()1,2()1,2(
www
0.5415) 0.4585,--0.5415,(
0.0415)0.0415,-0.0415,()5.0,5.0,5.0(
0.4585)-0.5415,-0.5415,(
0.0415)0.0415,-0.0415,()5.0,5.0,5.0(
)0,1(2
)0,1(2
)0,1(2
)0,1(1
)0,1(1
)0,1(1
www
www
0.102823 to0.135345 from reduced for Error 20 lP
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
MSE reduction:every 10 epochs
Output: every 10 epochs
epoch 1 10 20 40 90 140 190 d
P0 -0.63 -0.05 -0.38 -0.77 -0.89 -0.92 -0.93 -1
P1 -0.63 -0.08 0.23 0.68 0.85 0.89 0.90 1
P2 -0.62 -0.16 0.15 0.68 0.85 0.89 0.90 1
p3 -0.38 0.03 -0.37 -0.77 -0.89 -0.92 -0.93 -1
MSE 1.44 1.12 0.52 0.074 0.019 0.010 0.007
init (-0.5, 0.5, -0.5) (-0.5, -0.5, 0.5) (-1, 1, 1)
p0 -0.5415, 0.5415, -0.4585 -0.5415, -0.45845, 0.5415 -1.0442, 1.0108, 1.0108
p1 -0.5732, 0.5732, -0.4266 -0.5732, -0.4268, 0.5732 -1.0787, 1.0213, 1.0213
p2 -0.3858, 0.7607, -0.6142 -0.4617, -0.3152, 0.4617 -0.8867, 1.0616, 0.8952
p3 -0.4591, 0.6874, -0.6875 -0.5228, -0.3763, 0.4005 -0.9567, 1.0699, 0.9061
)0,1(1w )0,1(
2w )1,2(w
After epoch 1
# Epoch
13 -1.4018, 1.4177, -1.6290 -1.5219, -1.8368, 1.6367 0.6917, 1.1440, 1.1693
40 -2.2827, 2.5563, -2.5987 -2.3627, -2.6817, 2.6417 1.9870, 2.4841, 2.4580
90 -2.6416, 2.9562, -2.9679 -2.7002, -3.0275, 3.0159 2.7061, 3.1776, 3.1667
190 -2.8594, 3.18739, -3.1921 -2.9080, -3.2403, 3.2356 3.1995, 3.6531, 3.6468
Decision Trees
Decision Tree Classifier
Ross Quinlan
An
ten
na
Len
gth
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
Abdomen Length
Abdomen Length > 7.1?
no yes
KatydidAntenna Length > 6.0?
no yes
KatydidGrasshopper
• An inductive learning task– Use particular facts to make more generalized conclusions
• A predictive model based on a branching series of Boolean tests– These smaller Boolean tests are less complex than a one-
stage classifier
• Let’s look at a sample decision tree…
What is a Decision Tree?
Predicting Commute Time
Leave At
Stall? Accident?
10 AM 9 AM8 AM
Long
Long
Short Medium Long
No Yes No Yes
If we leave at
10 AM and
there are no
cars stalled on
the road, what
will our
commute time
be?
Inductive Learning
• In this decision tree, we made a series of Boolean decisions and followed the corresponding branch– Did we leave at 10 AM?
– Did a car stall on the road?
– Is there an accident on the road?
• By answering each of these yes/no questions, we then came to a conclusion on how long our commute might take
Decision Trees as Rules
• We did have represent this tree graphically
• We could have represented as a set of rules. However, this may be much harder to read…
Decision Tree as a Rule Set
if hour == 8am
commute time = long
else if hour == 9am
if accident == yes
commute time = long
else
commute time = medium
else if hour == 10am
if stall == yes
commute time = long
else
commute time = short
• Notice that all attributes to not have to be used in each path of the decision.
• As we will see, all attributes may not even appear in the tree.
Weather Example
Objective
• From a set of observations, the objective is to
predict whether we will be able to play Tennis based
on the past examples (inductive principle)
– For this, we will build automatically a decision
tree
– The decision concerns the Play Tennis attribute. It
shares the dataset into two classes: play = Yes
and play = No
Attribute-values
• 14 instances described by 4 attributes categorical
(nominal).
• Each attribute is associated with a set of values.
• An attribute is selected to be in the class, for which
we make a decision.
Decision Tree
• An internal node is a test on an attribute
• A branch represents an outcome of the
test, e.g. outlook=sunny
• A leaf node represents a class label or
class label distribution
• At each node, one attribute is chosen to
split training examples into distinct
classes as much as possible
• A new case is classified by following a
matching path to a leaf node
Building a Decision Tree
Building a Decision Tree
• One approach is to generate all possible trees and
find the best too expensive in general!
• There must be a way:
– Exploration top-down or bottom-up
– form a decision trees
• The main problem:
– during construction at each step, choose a good
attribute on which a test to be performed
Building a Decision Tree
Top-down Tree Construction
Initially, all the training examples are at the root
Then, the examples are recursively partitioned, by choosing one
attribute at a time
Bottom-up Tree Pruning
Remove subtrees or branches, in a bottom-up manner, to
improve the estimated accuracy on new cases.
When Should Building Stop?
• There are several possible stopping criteria
– All samples for a given node belong to the same
class
– If there are no remaining attributes for further
partitioning, majority voting is employed
– There are no samples left
– Or there is nothing to gain in splitting
Decision Tree Algorithms
• The basic idea behind any decision tree algorithm is as follows:
– Choose the best attribute(s) to split the remaining instances and make that attribute a decision node
– Repeat this process recursively for each child
– Stop when:• All the instances have the same target attribute value
• There are no more attributes
• There are no more instances
Sample Experience Table
Example Attributes Target
Hour Weather Accident Stall Commute
D1 8 AM Sunny No No Long
D2 8 AM Cloudy No Yes Long
D3 10 AM Sunny No No Short
D4 9 AM Rainy Yes No Long
D5 9 AM Sunny Yes Yes Long
D6 10 AM Sunny No No Short
D7 10 AM Cloudy No No Short
D8 9 AM Rainy No No Medium
D9 9 AM Sunny Yes No Long
D10 10 AM Cloudy Yes Yes Long
D11 10 AM Rainy No No Short
D12 8 AM Cloudy Yes No Long
D13 9 AM Sunny No No Medium
Predicting Commute Time
Leave At
Stall? Accident?
10 AM 9 AM8 AM
Long
Long
Short Medium Long
No Yes No Yes
If we leave at
10 AM and
there are no
cars stalled on
the road, what
will our
commute time
be?
Choosing Attributes
• The previous experience decision table showed 4 attributes: hour, weather, accident and stall
• But the decision tree only showed 3 attributes: hour, accident and stall
• Why is that?
Choosing Attributes
• Methods for selecting attributes (which will be described later) show that weather is not a discriminating attribute
Choosing Attributes
• The basic structure of creating a decision tree is the same for most decision tree algorithms
• The difference lies in how we select the attributes for the tree
• We will focus on the ID3 algorithm developed by Ross Quinlan in 1975
Identifying the Best Attributes
• Refer back to our original decision tree
Leave At
Stall? Accident?
10 AM 9 AM8 AM
Long
Long
Short Medium
No Yes No Yes
Long
How did we know to split on leave at
and then on stall and accident and not
weather?
Which is the splitting (best) attribute?
ID3 Heuristic
• To determine the best attribute, we look at the ID3 heuristic
• ID3 splits attributes based on their entropy.
• Entropy is the measure of disinformation…
Entropy
• Entropy is minimized when all values of the target attribute are the same.– If we know that commute time will always be short, then
entropy = 0
• Entropy is maximized when there is an equal chance of all values for the target attribute (i.e. the result is random)– If commute time = short in 3 instances, medium in 3
instances and long in 3 instances, entropy is maximized
Entropy
• Calculation of entropy
– Entropy(S) = ∑(i=1 to l)-|Si|/|S| * log2(|Si|/|S|)
• S = set of examples
• Si = subset of S with value vi under the target attribute
• l = size of the range of the target attribute
The Entropy Function Relative to Boolean Classification
79
1.0
0.0 0.5 1.0
Proportion of positive examples
Entro
py
Example taken from
Tom Mitchell’s
Machine Learning
ID3
• ID3 splits on attributes with the lowest entropy
• calculate the entropy for all values of an attribute as the weighted sum of subset entropies as follows:
• We can also measure information gain (which is inversely proportional to entropy) as follows:– Entropy(S) - ∑(i = 1 to k) |Si|/|S| Entropy(Si)
ID3
• Given our commute time sample set, we can calculate the entropy of each attribute at the root node
Attribute Expected Entropy Information Gain
Hour 0.6511 0.768449
Weather 1.28884 0.130719
Accident 0.92307 0.496479
Stall 1.17071 0.248842
Entropy: weather example
Which is the splitting (best) attribute?
Which is the splitting (best) attribute?
• At each node, available attributes are evaluated on the basis of separating
the classes of the training examples
• A purity or impurity measure is used for this purpose
• Information Gain: Increases with the average purity of the subsets that an
attribute produces
• Splitting Strategy: choose the attribute that results in greatest information
gain
• Typical goodness functions: information gain (ID3), information gain ratio
(C4.5), gini index (CART)
Which is the splitting (best) attribute?
Which is the splitting (best) attribute?
• Entropy is a measure of disorder prevailing in a collection of objects. If all
objects belong to the same class, there is no disorder.
• Quinlan proposed to select the attribute that minimizes the disorder of
the partition resultant.
The attribute “outlook”
• “outlook” = “sunny”
• “outlook” = “overcast”
• “outlook” = “rainy”
• Expected information for attribute
Information Gain
Difference between the information before split and the information after
split
The information before the split, info(D), is the entropy,
The information after the split using attribute A is computed as the
weighted sum of the entropies on each split, given n splits,
Information Gain
• Difference between the information before split and the information after split
• Information gain for the attributes from the weather data:– gain(“outlook”)=0.247
– gain(“temperature”)=0.029
– gain(“humidity”)=0.152
– gain(“windy”)=0.048
The Final Decision Tree
• Not all the leaves need to be pure
• Splitting stops when data can not be split any further
Rule extraction from Tree
if
Then PlayTennis = yes
ID3 in Gaming
• Black & White, developed by Lionhead Studios, and released in 2001 used ID3
• Used to predict a player’s reaction to a certain creature’s action
• In this model, a greater feedback value means the creature should attack
ID3 in Black & White
Example Attributes Target
Allegiance Defense Tribe Feedback
D1 Friendly Weak Celtic -1.0
D2 Enemy Weak Celtic 0.4
D3 Friendly Strong Norse -1.0
D4 Enemy Strong Norse -0.2
D5 Friendly Weak Greek -1.0
D6 Enemy Medium Greek 0.2
D7 Enemy Strong Greek -0.4
D8 Enemy Medium Aztec 0.0
D9 Friendly Weak Aztec -1.0
ID3 in Black & White
Allegiance
Defense
Friendly Enemy
0.4 -0.3
-1.0
Weak Strong
0.1
Medium
Note that this decision tree does not even use the tribe attribute
ID3 in Black & White
• Now suppose we don’t want the entire decision tree, but we just want the 2 highest feedback values
• We can create a Boolean expressions, such as((Allegiance = Enemy) ^ (Defense = Weak)) v ((Allegiance = Enemy) ^ (Defense = Medium))
Deciding when a tree is complete
• Continue splitting nodes until some goodness-of-split criterion
fails to be met.
– when the quality of a particular split falls below the threshold, the
tree is not grown further along that branch.
– when all branches from the root reach terminal nodes then tree is
complete.
97
Deciding when a tree is complete
• Grow the tree too large and then prune the nodes off.
– after tree construction stops create a sequence of subtrees from the
original tree
• choose one subtree for each possible number of leaves (the
subtree chosen with p leaves has the best assessment value of all
candidate subtrees with p leaves)
– once the sequence of subtrees is established select which subtree to
use according to some criterion.
• best assessment val
98
Evaluation methodology
• Standard methodology:
1. Collect a large set of examples (all with correct classifications)
2. Randomly divide collection into two disjoint sets: training and test
3. Apply learning algorithm to training set giving hypothesis H
4. Measure performance of H w.r.t. test set
Important: keep the training and test sets disjoint!
• To study the efficiency and robustness of an algorithm, repeat steps 2-4 for
different training sets and sizes of training sets
• If you improve your algorithm, start again with step 1 to avoid evolving the
algorithm to work well on just this collection99
Another Version of the Weather Dataset
Decision Tree for the New Dataset
• Entropy for splitting using “ID Code” is zero, since each leaf node is “pure”
• Information Gain is thus maximal for ID code
Highly-Branching attributes
• Attributes with a large number of values are usually problematic
E.g. id, primary keys, or almost primary key attributes
• Subsets are likely to be pure if there is a large number of values
• Information Gain is biased towards choosing attributes with a large number of values
• This may result in overfitting (selection of an attribute that is non-optimal for prediction)
Solution: Information Gain Ratio
• Modification of the Information Gain that reduces the bias toward highly-branching attributes
• Information Gain Ratio should be
– Large when data is evenly spread
– Small when all data belong to one branch
• Information Gain Ratio takes number and size of branches into account when choosing an attribute
• It corrects the information gain by taking the intrinsic information of a split into account
Information Gain Ratio andIntrinsic information
• Intrinsic information
computes the entropy of distribution of instances into branches
• Information Gain Ratio normalizes Information Gain by
Computing the Information Gain Ratio
• The intrinsic information for ID code is
• Importance of attribute decreases as intrinsic information gets larger
• The Information gain ratio of “ID code”,
Information Gain Ratio for Weather Data
107
Acknowledgements
Introduction to Machine Learning, Alphaydin
Statistical Pattern Recognition: A Review – A.K Jain et al., PAMI (22) 2000
Pattern Recognition and Analysis Course – A.K. Jain, MSU
Pattern Classification” by Duda et al., John Wiley & Sons.
http://www.doc.ic.ac.uk/~sgc/teaching/pre2012/v231/lecture13.html
Some Material adopted from Dr. Adam Prugel-Bennett Dr. Andrew Ng and Dr. Aman
ullah’s Slides
Mat
eria
l in
th
ese
slid
es h
as b
een
tak
en f
rom
, th
e fo
llow
ing
reso
urc
es