Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack...

transcript

Machine Learning

Lecture # 4Multilayer Percceptron & Decision Trees

Artificial Neural Networks (ANN)• Neural computing requires a

number of neurons, to be connected together into a neural network.

• A neural network consists of:– layers

– links between layers

• The links are weighted.

• There are three kinds of layers:1. input layer

2. Hidden layer

3. output layer

From Human Neurones to Artificial Neurones

A simple neuron

• At each neuron, every input has an associated weight which modifies the strength of each input.

• The neuron simply adds together all the inputs and calculates an output to be passed on.

Activation function

MultiLayer Perceptron (MLP)

Motivation

• Perceptrons are limited because they canonly solve problems that are linearlyseparable

• We would like to build more complicatedlearning machines to model our data

• One way to do this is to build a multiplelayers of perceptrons

Brief History

• 1985 Ackley, Hinton and Sejnowski propose the Boltzmann machine

– This was a multi-layer step perceptron

– More powerful than perceptron

– Successful application NETtalk

• 1986 Rummelhart, Hinton and Williams invent Multi-Layer Perceptron (MLP) with backpropagation

– Dominant neural net architecture for 10 years

Multi layer networks

• So far we discussed networks with one layer.

• But these networks can be extended to combine several layers, increasing the set of functions that can be represented using a NN

Multilayer Neural Network

Sigmoid Response Functions

Simple example: AND

0 00 11 01 1

Example: OR function

0 00 11 01 1

Negation:

Putting it together:

Example of multilayer Neural Network

• Suppose input values are 10, 30, 20

• The weighted sum coming into H1

SH1 = (0.2 * 10) + (-0.1 * 30) + (0.4 * 20)

= 2 -3 + 8 = 7.

• The σ function is applied to SH1:

σ(SH1) = 1/(1+e-7) = 1/(1+0.000912) = 0.999

• Similarly, the weighted sum coming into H2:

SH2 = (0.7 * 10) + (-1.2 * 30) + (1.2 * 20)

= 7 - 36 + 24 = -5

• σ applied to SH2:

σ(SH2) = 1/(1+e5) = 1/(1+148.4) = 0.0067

• Now the weighted sum to output unit O1 :

SO1 = (1.1 * 0.999) + (0.1*0.0067) = 1.0996

• The weighted sum to output unit O2:

SO2 = (3.1 * 0.999) + (1.17*0.0067) = 3.1047

• The output sigmoid unit in O1:

σ(SO1) = 1/(1+e-1.0996) = 1/(1+0.333) = 0.750

• The output from the network for O2:

σ(SO2) = 1/(1+e-3.1047) = 1/(1+0.045) = 0.957

• The input triple (10,30,20) would becategorised with O2, because this has thelarger output.

Training Parametric Model

Minimizing Error

Least Squares Gradient

Single Layer Perceptron

Single layer Perceptrons

Different Response Functions

Learning a Logistic Perceptron

Back Propagation

A Worked Example:

• Propagated the values (10,30,20) through the network

• Suppose now that the target categorization for the example was the one associated with O1(using a learning rate of η = 0.1)

• the target output for O1 was 1, and the target output for O2 was 0

• t1(E) = 1; t2(E) = 0; o1(E) = 0.750; o2(E) = 0.957

• error values for the output units O1 and O2 – δO1 = o1(E)(1 - o1(E))(t1(E) - o1(E)) = 0.750(1-0.750)(1-0.750) = 0.0469

– δO2 = o2(E)(1 - o2(E))(t2(E) - o2(E)) = 0.957(1-0.957)(0-0.957) = -0.0394

Input units Hidden units Output units

Unit Output UnitWeighted Sum

InputOutput Unit

Weighted Sum Input

Output

I1 10 H1 7 0.999 O1 1.0996 0.750

I2 30 H2 -5 0.0067 O2 3.1047 0.957

• To propagate this information backwards to the hidden nodes H1 and H2– Multiply the error term for O1 by the weight from H1

to O1, then add this to the multiplication of the error term for O2 and the weight between H1 and O2, (1.1*0.0469) + (3.1*-0.0394) = -0.0706

– δH1 = -0.0706*(0.999 * (1-0.999)) = -0.0000705

– Similarly for H2: (0.1*0.0469)+(1.17*-0.0394) = -0.0414

– δH2 -0.0414 * (0.067 * (1-0.067)) = -0.00259

A Worked Example:

Input unit Hidden unit η δH xi Δ = η*δH*xi Old weight New weight

I1 H1 0.1 -0.0000705 10 -0.0000705 0.2 0.1999295

I1 H2 0.1 -0.00259 10 -0.00259 0.7 0.69741

I2 H1 0.1 -0.0000705 30 -0.0002115 -0.1 -0.1002115

I2 H2 0.1 -0.00259 30 -0.00777 -1.2 -1.20777

I3 H1 0.1 -0.0000705 20 -0.000141 0.4 0.39999

I3 H2 0.1 -0.00259 20 -0.00518 1.2 1.1948

Hiddenunit

Outputunit

η δO hi(E) Δ = η*δO*hi(E) Old weight New weight

H1 O1 0.1 0.0469 0.999 0.000469 1.1 1.100469

H1 O2 0.1 -0.0394 0.999 -0.00394 3.1 3.0961

H2 O1 0.1 0.0469 0.0067 0.00314 0.1 0.10314

H2 O2 0.1 -0.0394 0.0067 -0.0000264 1.17 1.16998

A Worked Example:

When to Learn

Online Learning

Batch Learning

Early Stopping

Self Study Examples

XOR Example

Linear separation

Can AND, OR and NOT be represented?

• Is it possible to represent every boolean function by simply combining these?

• Every boolean function can be composed using AND, OR and NOT (or even only NAND).

Linear separation

• How we can learn XOR function?

Linear separation

X1 X2 XOR

Linear separation

X1 X2 XOR

It is impossible to find the value of Wi to learn

Linear separation

X1 X2 X1*X2 XOR

So we learned W1, W2 and W3

Example, Back Propogation learning function XOR

• Training samples (bipolar)

• Network: 2-2-1 with thresholds (fixed output 1)

in_1 in_2 d

P0 -1 -1 -1

P1 -1 1 1

P2 1 -1 1

P3 1 1 1

• Initial weights W(0)

• Learning rate = 0.2

• Node function: hyperbolic tangent

)1,1,1(:

)5.0,5.0,5.0(:

)0,1(2

)0,1(1

))(1))((1(5.0)('

))(1)(()('

1)(2)(

1)(lim

1)tanh()(

xgxgxg

xsxsxs

W(1,0) W(2,1)

-0.63211)(

-1.489840.24492)-,0.24492-,1)(1,1,1(

-0.244921)1/(2)(

5.0)1,1,1()5.0,5.0,5.0(

)1()1,2(

0)0,1(

enetgx

computing Forward

1- d :1)- 1,- (1, P Present00

0.22090.6321)0.6321)(1-1(-0.3679

))(1))((1()('

-0.36789-0.63211)(1

ooo netgnetglnetgl

gpropogatin back Error

-0.207650.24492)(10.24492)-1(1-0.2209

2)1,2(

1)1,2(

0.0108)0.0108, 0.0442,(0.2449)- 0.2449,-(1,0.2209)(2.0

)1()1,2(

update Weight

0.0415)0.0415,-0.0415,()1-,1-(1,-0.2077)(2.0

02)0,1(

01)0,1(

1.0108)1.0108, (-0.5415,

0.0108)0.0108, (-0.0442,)1,1,1()1,2()1,2()1,2(

0.5415) 0.4585,--0.5415,(

0.0415)0.0415,-0.0415,()5.0,5.0,5.0(

0.4585)-0.5415,-0.5415,(

0.0415)0.0415,-0.0415,()5.0,5.0,5.0(

)0,1(2

)0,1(1

0.102823 to0.135345 from reduced for Error 20 lP

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

MSE reduction:every 10 epochs

Output: every 10 epochs

epoch 1 10 20 40 90 140 190 d

P0 -0.63 -0.05 -0.38 -0.77 -0.89 -0.92 -0.93 -1

P1 -0.63 -0.08 0.23 0.68 0.85 0.89 0.90 1

P2 -0.62 -0.16 0.15 0.68 0.85 0.89 0.90 1

p3 -0.38 0.03 -0.37 -0.77 -0.89 -0.92 -0.93 -1

MSE 1.44 1.12 0.52 0.074 0.019 0.010 0.007

init (-0.5, 0.5, -0.5) (-0.5, -0.5, 0.5) (-1, 1, 1)

p0 -0.5415, 0.5415, -0.4585 -0.5415, -0.45845, 0.5415 -1.0442, 1.0108, 1.0108

p1 -0.5732, 0.5732, -0.4266 -0.5732, -0.4268, 0.5732 -1.0787, 1.0213, 1.0213

p2 -0.3858, 0.7607, -0.6142 -0.4617, -0.3152, 0.4617 -0.8867, 1.0616, 0.8952

p3 -0.4591, 0.6874, -0.6875 -0.5228, -0.3763, 0.4005 -0.9567, 1.0699, 0.9061

)0,1(1w )0,1(

2w )1,2(w

After epoch 1

# Epoch

13 -1.4018, 1.4177, -1.6290 -1.5219, -1.8368, 1.6367 0.6917, 1.1440, 1.1693

40 -2.2827, 2.5563, -2.5987 -2.3627, -2.6817, 2.6417 1.9870, 2.4841, 2.4580

90 -2.6416, 2.9562, -2.9679 -2.7002, -3.0275, 3.0159 2.7061, 3.1776, 3.1667

190 -2.8594, 3.18739, -3.1921 -2.9080, -3.2403, 3.2356 3.1995, 3.6531, 3.6468

Decision Trees

Decision Tree Classifier

Ross Quinlan

1 2 3 4 5 6 7 8 9 10

Abdomen Length

Abdomen Length > 7.1?

no yes

KatydidAntenna Length > 6.0?

no yes

KatydidGrasshopper

• An inductive learning task– Use particular facts to make more generalized conclusions

• A predictive model based on a branching series of Boolean tests– These smaller Boolean tests are less complex than a one-

stage classifier

• Let’s look at a sample decision tree…

What is a Decision Tree?

Predicting Commute Time

Leave At

Stall? Accident?

10 AM 9 AM8 AM

Short Medium Long

No Yes No Yes

If we leave at

10 AM and

there are no

cars stalled on

the road, what

will our

commute time

Inductive Learning

• In this decision tree, we made a series of Boolean decisions and followed the corresponding branch– Did we leave at 10 AM?

– Did a car stall on the road?

– Is there an accident on the road?

• By answering each of these yes/no questions, we then came to a conclusion on how long our commute might take

Decision Trees as Rules

• We did have represent this tree graphically

• We could have represented as a set of rules. However, this may be much harder to read…

Decision Tree as a Rule Set

if hour == 8am

commute time = long

else if hour == 9am

if accident == yes

commute time = long

commute time = medium

else if hour == 10am

if stall == yes

commute time = long

commute time = short

• Notice that all attributes to not have to be used in each path of the decision.

• As we will see, all attributes may not even appear in the tree.

Weather Example

Objective

• From a set of observations, the objective is to

predict whether we will be able to play Tennis based

on the past examples (inductive principle)

– For this, we will build automatically a decision

– The decision concerns the Play Tennis attribute. It

shares the dataset into two classes: play = Yes

and play = No

Attribute-values

• 14 instances described by 4 attributes categorical

(nominal).

• Each attribute is associated with a set of values.

• An attribute is selected to be in the class, for which

we make a decision.

Decision Tree

• An internal node is a test on an attribute

• A branch represents an outcome of the

test, e.g. outlook=sunny

• A leaf node represents a class label or

class label distribution

• At each node, one attribute is chosen to

split training examples into distinct

classes as much as possible

• A new case is classified by following a

matching path to a leaf node

Building a Decision Tree

• One approach is to generate all possible trees and

find the best too expensive in general!

• There must be a way:

– Exploration top-down or bottom-up

– form a decision trees

• The main problem:

– during construction at each step, choose a good

attribute on which a test to be performed

Building a Decision Tree

Top-down Tree Construction

Initially, all the training examples are at the root

Then, the examples are recursively partitioned, by choosing one

attribute at a time

Bottom-up Tree Pruning

Remove subtrees or branches, in a bottom-up manner, to

improve the estimated accuracy on new cases.

When Should Building Stop?

• There are several possible stopping criteria

– All samples for a given node belong to the same

– If there are no remaining attributes for further

partitioning, majority voting is employed

– There are no samples left

– Or there is nothing to gain in splitting

Decision Tree Algorithms

• The basic idea behind any decision tree algorithm is as follows:

– Choose the best attribute(s) to split the remaining instances and make that attribute a decision node

– Repeat this process recursively for each child

– Stop when:• All the instances have the same target attribute value

• There are no more attributes

• There are no more instances

Sample Experience Table

Example Attributes Target

Hour Weather Accident Stall Commute

D1 8 AM Sunny No No Long

D2 8 AM Cloudy No Yes Long

D3 10 AM Sunny No No Short

D4 9 AM Rainy Yes No Long

D5 9 AM Sunny Yes Yes Long

D6 10 AM Sunny No No Short

D7 10 AM Cloudy No No Short

D8 9 AM Rainy No No Medium

D9 9 AM Sunny Yes No Long

D10 10 AM Cloudy Yes Yes Long

D11 10 AM Rainy No No Short

D12 8 AM Cloudy Yes No Long

D13 9 AM Sunny No No Medium

Predicting Commute Time

Leave At

Stall? Accident?

10 AM 9 AM8 AM

Short Medium Long

No Yes No Yes

If we leave at

10 AM and

there are no

cars stalled on

the road, what

will our

commute time

Choosing Attributes

• The previous experience decision table showed 4 attributes: hour, weather, accident and stall

• But the decision tree only showed 3 attributes: hour, accident and stall

• Why is that?

Choosing Attributes

• Methods for selecting attributes (which will be described later) show that weather is not a discriminating attribute

Choosing Attributes

• The basic structure of creating a decision tree is the same for most decision tree algorithms

• The difference lies in how we select the attributes for the tree

• We will focus on the ID3 algorithm developed by Ross Quinlan in 1975

Identifying the Best Attributes

• Refer back to our original decision tree

Leave At

Stall? Accident?

10 AM 9 AM8 AM

Short Medium

No Yes No Yes

How did we know to split on leave at

and then on stall and accident and not

weather?

Which is the splitting (best) attribute?

ID3 Heuristic

• To determine the best attribute, we look at the ID3 heuristic

• ID3 splits attributes based on their entropy.

• Entropy is the measure of disinformation…

Entropy

• Entropy is minimized when all values of the target attribute are the same.– If we know that commute time will always be short, then

entropy = 0

• Entropy is maximized when there is an equal chance of all values for the target attribute (i.e. the result is random)– If commute time = short in 3 instances, medium in 3

instances and long in 3 instances, entropy is maximized

Entropy

• Calculation of entropy

– Entropy(S) = ∑(i=1 to l)-|Si|/|S| * log2(|Si|/|S|)

• S = set of examples

• Si = subset of S with value vi under the target attribute

• l = size of the range of the target attribute

The Entropy Function Relative to Boolean Classification

0.0 0.5 1.0

Proportion of positive examples

Example taken from

Tom Mitchell’s

Machine Learning

• ID3 splits on attributes with the lowest entropy

• calculate the entropy for all values of an attribute as the weighted sum of subset entropies as follows:

• We can also measure information gain (which is inversely proportional to entropy) as follows:– Entropy(S) - ∑(i = 1 to k) |Si|/|S| Entropy(Si)

• Given our commute time sample set, we can calculate the entropy of each attribute at the root node

Attribute Expected Entropy Information Gain

Hour 0.6511 0.768449

Weather 1.28884 0.130719

Accident 0.92307 0.496479

Stall 1.17071 0.248842

Entropy: weather example

• At each node, available attributes are evaluated on the basis of separating

the classes of the training examples

• A purity or impurity measure is used for this purpose

• Information Gain: Increases with the average purity of the subsets that an

attribute produces

• Splitting Strategy: choose the attribute that results in greatest information

• Typical goodness functions: information gain (ID3), information gain ratio

(C4.5), gini index (CART)

• Entropy is a measure of disorder prevailing in a collection of objects. If all

objects belong to the same class, there is no disorder.

• Quinlan proposed to select the attribute that minimizes the disorder of

the partition resultant.

The attribute “outlook”

• “outlook” = “sunny”

• “outlook” = “overcast”

• “outlook” = “rainy”

• Expected information for attribute

Information Gain

Difference between the information before split and the information after

The information before the split, info(D), is the entropy,

The information after the split using attribute A is computed as the

weighted sum of the entropies on each split, given n splits,

Information Gain

• Difference between the information before split and the information after split

• Information gain for the attributes from the weather data:– gain(“outlook”)=0.247

– gain(“temperature”)=0.029

– gain(“humidity”)=0.152

– gain(“windy”)=0.048

The Final Decision Tree

• Not all the leaves need to be pure

• Splitting stops when data can not be split any further

Rule extraction from Tree

Then PlayTennis = yes

ID3 in Gaming

• Black & White, developed by Lionhead Studios, and released in 2001 used ID3

• Used to predict a player’s reaction to a certain creature’s action

• In this model, a greater feedback value means the creature should attack

ID3 in Black & White

Example Attributes Target

Allegiance Defense Tribe Feedback

D1 Friendly Weak Celtic -1.0

D2 Enemy Weak Celtic 0.4

D3 Friendly Strong Norse -1.0

D4 Enemy Strong Norse -0.2

D5 Friendly Weak Greek -1.0

D6 Enemy Medium Greek 0.2

D7 Enemy Strong Greek -0.4

D8 Enemy Medium Aztec 0.0

D9 Friendly Weak Aztec -1.0

Allegiance

Defense

Friendly Enemy

0.4 -0.3

Weak Strong

Medium

Note that this decision tree does not even use the tribe attribute

• Now suppose we don’t want the entire decision tree, but we just want the 2 highest feedback values

• We can create a Boolean expressions, such as((Allegiance = Enemy) ^ (Defense = Weak)) v ((Allegiance = Enemy) ^ (Defense = Medium))

Deciding when a tree is complete

• Continue splitting nodes until some goodness-of-split criterion

fails to be met.

– when the quality of a particular split falls below the threshold, the

tree is not grown further along that branch.

– when all branches from the root reach terminal nodes then tree is

complete.

Deciding when a tree is complete

• Grow the tree too large and then prune the nodes off.

– after tree construction stops create a sequence of subtrees from the

original tree

• choose one subtree for each possible number of leaves (the

subtree chosen with p leaves has the best assessment value of all

candidate subtrees with p leaves)

– once the sequence of subtrees is established select which subtree to

use according to some criterion.

• best assessment val

Evaluation methodology

• Standard methodology:

1. Collect a large set of examples (all with correct classifications)

2. Randomly divide collection into two disjoint sets: training and test

3. Apply learning algorithm to training set giving hypothesis H

4. Measure performance of H w.r.t. test set

Important: keep the training and test sets disjoint!

• To study the efficiency and robustness of an algorithm, repeat steps 2-4 for

different training sets and sizes of training sets

• If you improve your algorithm, start again with step 1 to avoid evolving the

algorithm to work well on just this collection99

Another Version of the Weather Dataset

Decision Tree for the New Dataset

• Entropy for splitting using “ID Code” is zero, since each leaf node is “pure”

• Information Gain is thus maximal for ID code

Highly-Branching attributes

• Attributes with a large number of values are usually problematic

E.g. id, primary keys, or almost primary key attributes

• Subsets are likely to be pure if there is a large number of values

• Information Gain is biased towards choosing attributes with a large number of values

• This may result in overfitting (selection of an attribute that is non-optimal for prediction)

Solution: Information Gain Ratio

• Modification of the Information Gain that reduces the bias toward highly-branching attributes

• Information Gain Ratio should be

– Large when data is evenly spread

– Small when all data belong to one branch

• Information Gain Ratio takes number and size of branches into account when choosing an attribute

• It corrects the information gain by taking the intrinsic information of a split into account

Information Gain Ratio andIntrinsic information

• Intrinsic information

computes the entropy of distribution of instances into branches

• Information Gain Ratio normalizes Information Gain by

Computing the Information Gain Ratio

• The intrinsic information for ID code is

• Importance of attribute decreases as intrinsic information gets larger

• The Information gain ratio of “ID code”,

Information Gain Ratio for Weather Data

Acknowledgements

Introduction to Machine Learning, Alphaydin

Statistical Pattern Recognition: A Review – A.K Jain et al., PAMI (22) 2000

Pattern Recognition and Analysis Course – A.K. Jain, MSU

Pattern Classification” by Duda et al., John Wiley & Sons.

http://www.doc.ic.ac.uk/~sgc/teaching/pre2012/v231/lecture13.html

Some Material adopted from Dr. Adam Prugel-Bennett Dr. Andrew Ng and Dr. Aman

ullah’s Slides

Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack...

Documents