MACHINE LEARNING. 2 What is learning? A computer program learns if it improves its performance at...

Post on 22-Dec-2015

219 views 2 download

Tags:

transcript

MACHINE MACHINE LEARNINGLEARNING

2

What is learning?What is learning?

A computer program learns if it A computer program learns if it improves its performance at some improves its performance at some task through experience (T. task through experience (T. Mitchell, 1997)Mitchell, 1997)

Any change in a system that allows Any change in a system that allows it to perform better (Simon 1983)it to perform better (Simon 1983)

3

What do we learnWhat do we learn::DescriptionsRules how to

recognize/classify objects, states, events

Rules how to transform an initial situation to achieve a goal (final state)

4

How do we learnHow do we learn:: Rote learning - storage of computed information. Taking advice from others. (Advice may need to be

operationalized.) Learning from problem solving experiences -

remembering experiences and generalizing from them. (May add efficiency but not new knowledge.)

Learning from examples. (May or may not involve a teacher.)

Learning by experimentation and discovery. (Decreasing burden on teacher, increasing burden on learner.)

5

Approaches to Machine Approaches to Machine LearningLearning

• Symbol-based• Connectionist Learning• Evolutionary learning

6

Inductive Symbol-Based Inductive Symbol-Based Machine LearningMachine Learning

Concept LearningConcept Learning

Version space searchVersion space search Decision trees: ID3 algorithmDecision trees: ID3 algorithm Explanation-based learningExplanation-based learning Supervised learningSupervised learning Reinforcement learningReinforcement learning

7

Version space search for Version space search for concept learningconcept learning

Concepts – describe classes of Concepts – describe classes of objectsobjects

Concepts consist of feature setsConcepts consist of feature sets Operation on concept Operation on concept

descriptionsdescriptions Generalization:Generalization: Replace a feature with Replace a feature with

a variablea variable Specialization:Specialization: Instantiate a variable Instantiate a variable

with a featurewith a feature

8

Positive and Negative Positive and Negative examples of a conceptexamples of a concept

The concept description has to The concept description has to match all positive examplesmatch all positive examples

The concept description has to The concept description has to be false for the negative be false for the negative examplesexamples

9

Plausible descriptionsPlausible descriptions

The version space represents all the alternative plausible descriptions of the concept

A plausible description is one that is applicable to all known positive examples and no known negative example.

10

Basic IdeaBasic Idea

Given:Given: A representation languageA representation language A set of positive and negative examples A set of positive and negative examples

expressed in that languageexpressed in that language

Compute:Compute: A concept description that is A concept description that is consistent with all the positive examples consistent with all the positive examples and none of the negative examplesand none of the negative examples

11

HypothesesHypotheses

The version space contains two sets of hypotheses:

G – the most general hypotheses that match the training data

S – the most specific hypotheses that match the training data

Each hypothesis is represented as a vector of values of the known attributes

12

Example of Version spaceExample of Version space

Consider the task to obtain a description of the concept: Japanese Economy car.

The attributes under consideration are:

Origin, Manufacturer, Color, Decade, Origin, Manufacturer, Color, Decade, TypeType

training data:

Positive ex: (Japan, Honda, Blue, 1980, Economy)

Positive ex: (Japan, Honda, White, 1980, Economy)

Negative ex: (Japan, Toyota, Green, 1970, Sports)

13

Example continuedExample continued

The most general hypothesis that matches the positive data and does not match the negative data, is:

(?, Honda, ?, ?, Economy) the symbol ‘?’ means that the attribute may take any value

The most specific hypothesis that matches the positive examples is:

(Japan, Honda, ?,1980, Economy)

14

Algorithm: Candidate Algorithm: Candidate eliminationelimination

Initialize G to contain one element: the most general description (all features are variables).

Initialize S to empty. Accept a new training example.

15

Process positive examplesProcess positive examples

Remove from G any descriptions that do not cover the example.

Generalize S as little as possible so that the new training example is covered.

Remove from S all elements that cover negative examples.

16

Process negative examplesProcess negative examples

Remove from S any descriptions that cover the negative example. Specialize G as little as possible so that the negative example is not covered.

Remove from G all elements that do not cover the positive examples.

17

Algorithm continuedAlgorithm continuedContinue processing new training examples,

until one of the following occurs: Either Either SS or or GG become empty become empty, there are no

consistent hypotheses over the training space. Stop.

SS and and GG are both singleton sets are both singleton sets. if they are identical, output their value and stop. if they are different, the training cases were

inconsistent. Output this result and stop. No more training examples. G has several

hypotheses. The version space is a disjunction of hypotheses. If

for a new example the hypotheses agree, then we can classify the example. If they disagree we can take the majority vote

18

Learning the concept of Learning the concept of "Japanese economy car""Japanese economy car"

Features: Origin, Manufacturer, Color, Decade, Type

POSITIVE EXAMPLE: (Japan, Honda, Blue, 1980, Economy)

Initialize G to singleton set that includes everything

Initialize S to singleton set that includes first positive example G = {(?, ?, ?, ?, ?)}

S = {(Japan, Honda, Blue, 1980, Economy)}

19

Example continuedExample continued

NEGATIVE EXAMPLE: (Japan, Toyota, Green, 1970, Sports)

Specialize G to exclude negative example G = {(?, Honda, ?, ?, ?), (?, ?, Blue, ?, ?) (?, ?, ?, 1980, ?) (?, ?, ?, ?, Economy)} S = {(Japan, Honda, Blue, 1980, Economy)}

20

Example continuedExample continued

POSITIVE EXAMPLE: (Japan, Toyota, Blue, 1990, Economy)

Remove from G descriptions inconsistent with positive example

Generalize S to include positive example G = { (?, ?, Blue, ?, ?)

(?, ?, ?, ?, Economy)} S = {(Japan, ?, Blue, ?, Economy)}

21

Example continuedExample continued

NEGATIVE EXAMPLE: (USA, Chrysler, Red, 1980, Economy)

Specialize G to exclude negative example (but staying within version space, i.e., staying consistent with S)

G = {(?, ?, Blue, ?, ?) (Japan, ?, ?, ?, Economy)} S = {(Japan, ?, Blue, ?, Economy)}

22

Example continuedExample continued

POSITIVE EXAMPLE: (Japan, Honda, White, 1980, Economy)

Remove from G descriptions inconsistent with positive example

Generalize S to include the positive example

G = {(Japan, ?, ?, ?, Economy)} S = {(Japan, ?, ?, ?, Economy)} S = G, both singleton => done!

23

Decision treesDecision trees

A decision tree is a structure that represents a procedure for classifying objects based on their attributes.

Each object is represented as a set of attribute/value pairs and a classification.

24

ExampleExample

A set of medical symptoms might be represented as follows:

Cough Fever Weight Pain Classification Mary no yes normal throat flu Fred no yes normal abdomen appendicitis Julie yes yes skinny none flu Elvis yes no obese chest heart disease

The system is given a set of training instances along with their correct classifications and develops a decision tree based on these examples.

25

AttributesAttributes

If a crucial attribute is not represented, then no decision tree will be able to learn the concept.

If two training instances have the same representation but belong to different classes, then the attribute set is said to be inadequate. It is impossible for the decision tree to distinguish the instances.

26

ID3 Algorithm (ID3 Algorithm (Quinlan, 1986)

ID3(R, C, S) // R – list of attributes, // C – categorical attribute, S - examples

If all examples from S belong to the same class Cj , return a leaf labeled Cj

If R is empty return a node with the most frequent value of C

Else select the select the “best” decision attribute “best” decision attribute AA in R with values in R with values vv1, 1, vv2, 2,

…, …, vn vn for next nodefor next node divide the training set divide the training set SS into into SS1, …, 1, …, SnSn according to values according to values

vv1,…,1,…,vnvn Call ID3 (R – {A}, C, S1), ID3(R – {A}, C, S2), … ID3(R – {A}, Call ID3 (R – {A}, C, S1), ID3(R – {A}, C, S2), … ID3(R – {A},

C, Sn), i.e. recursively build subtrees C, Sn), i.e. recursively build subtrees TT1, …, 1, …, TnTn for for SS1, …, 1, …, SnSn Return a Return a node labellednode labelled AA with children the subtrees with children the subtrees T1, T2, … T1, T2, …

TnTn

27

EntropyEntropy SS - a sample of training examples

Entropy (S ) = expected number of bits needed to encode the classification of an arbitrary member of S

Information theory: optimal length code assigns-log2 p bits to message having probability p

Generally for c different classesEntropy(S)

c(- pi * log2 pi)

28

Entropy of the Training Entropy of the Training SetSet

T : a set of records partitioned into C1, C2, …, Ck on the bases of the categorical attribute C.

Probability of each class Pi = Ci / T Info(T) = -p1*Log(P1) - … - Pk*log(Pk)

Info (T) is the information needed to classify an Info (T) is the information needed to classify an element.element.

29

How much helpful is an How much helpful is an attribute?attribute?

X : a non-categorical attribute, T = {T1,…,Tn} is the split of T according to X

The entropy of each Tk is:

Info(Tk) = - (Tk1 / Tk)* log(Tk1 / Tk) - …

- (T kc / Tk)*log(Tkc / Tk )

where c is the number of partitions in Tk produced by

the categorical attribute C

For any k, Info(Tk) reflects how the categorical attribute C splits the set Tk

30

Information GainInformation Gain

Info(X,T) = T1/T * Info(T1) +

T2/T * Info(T2) +

…. + Tn /T * Info(Tn)

Gain(X,T) = Info(T) – Info(X,T) =

Entropy(T) - i (Ti/T)*Entropy(Ti)

31

Information GainInformation Gain Gain(X,T) - the expected reduction in

entropy caused by partitioning the examples of T according to the attribute X.

Gain(X,T) - a measure of the effectiveness of an attribute in classifying the training data

The best attribute has maximal Gain(X,T)

32

Example (1)Example (1)

NameName HairHair HeightHeight WeightWeight LotionLotion Result

Sarah blonde average light nosunburned (positive)

Dana blonde tall average yesnone (negative)

Alex brown short average yes none

Annie blonde short average no sunburned

Emily red average heavy no sunburned

Pete brown tall heavy no none

John brown average heavy no none

Katie blonde short light yes none

33

Example (2)Example (2) Attribute “hair”

Blonde: T1 = {Sara, Dana, Annie, Katie}Brown: T2 = {Alex, Pete, John}Red: T3 = { Emily}

T1 is split by C into 2 sets: T11 = {Sarah, Annie}, T12 = {Dana, Katie}

Info(T1) = - 2/4 * log(2/4) – 2/4* log(2/4) = -log(1/2) = 1 In a similar way we compute Info(T2) = 0, Info(T3) = 0

Info(‘hair’,T) = T1/T * Info(T1) + T2/T * Info(T2) + T3 /T *Info(T3)

= 4/8 * Info(T1) + 3/8* Info(T2) + 1/8 * Info(T3) =

= 4/8 * 1 = 0.50

This happens to be the best attribute

34

Example (3)Example (3)

red

yes

Hair color

blonde

brown

Lotion

no

sunburn

none

sunburn

none

35

Split RatioSplit Ratio

GainRatio(D,T) = GainRatio(D,T) =

Gain(D,T) / SplitInfo(D,T)Gain(D,T) / SplitInfo(D,T)

where where SplitInfo(D,T)SplitInfo(D,T) is the is the information due to the split of T when information due to the split of T when D is considered categorical attributeD is considered categorical attribute

36

Split Ratio TreeSplit Ratio Tree

brown

blondered

lotion

no

yes

Color

nonenone

sunburn

none

37

More Training Examples