+ All Categories
Home > Documents > Data Mining (and machine learning)

Data Mining (and machine learning)

Date post: 08-Jan-2016
Category:
Upload: ginger
View: 17 times
Download: 0 times
Share this document with a friend
Description:
Data Mining (and machine learning). DM Lecture 7: Feature Selection. Today. Finishing correlation/regression: Feature Selection: Coursework 2. Remember how to calculate r. If we have pairs of (x,y) values, Pearson’s r is:. Interpretation of this should be obvious (?). - PowerPoint PPT Presentation
Popular Tags:
69
David Corne, and Nick Taylor, Heriot-Watt University - [email protected] These slides and related resources: Data Mining (and machine learning) DM Lecture 7: Feature Selection
Transcript
Page 1: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Data Mining(and machine learning)

DM Lecture 7: Feature Selection

Page 2: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Today

Finishing correlation/regression: Feature Selection:

Coursework 2

Page 3: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Remember how to calculate r

Sampleyx y

y

x

x

std

y

std

x

n ),(

)()(

)1(

1

If we have pairs of (x,y) values, Pearson’s r is:

Interpretation of this should be obvious (?)

Page 4: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Equivalently, you can do it like this

Sampleyx

YXn ),()1(

1

Looking at it another way: after z-normalisation X is the z-normalised x value in the sample – indicatinghow many stds away from the mean it is. Same for Y The formula for r on the last slide is equivalent to this:

Page 5: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

min Max mean std correlation median mode

Population 0 1 0.06 0.13 0.37 0.02 0.01

Householdsize 0 1 0.46 0.16 -0.03 0.44 0.41

Racepctblack 0 1 0.18 0.25 0.63 0.06 0.01

racePctWhite 0 1 0.75 0.24 -0.68 0.85 0.98

racePctAsian 0 1 0.15 0.21 0.04 0.07 0.02

racePctHisp 0 1 0.14 0.23 0.29 0.04 0.01

agePct12t21 0 1 0.42 0.16 0.06 0.4 0.38

agePct12t29 0 1 0.49 0.14 0.15 0.48 0.49

agePct16t24 0 1 0.34 0.17 0.1 0.29 0.29

agePct65up 0 1 0.42 0.18 0.07 0.42 0.47

numbUrban 0 1 0.06 0.13 0.36 0.03 0

pctUrban 0 1 0.7 0.44 0.08 1 1

medIncome 0 1 0.36 0.21 -0.42 0.32 0.23

pctWWage 0 1 0.56 0.18 -0.31 0.56 0.58

pctWFarmSelf 0 1 0.29 0.2 -0.15 0.23 0.16

pctWInvInc 0 1 0.5 0.18 -0.58 0.48 0.41

pctWSocSec 0 1 0.47 0.17 0.12 0.475 0.56

pctWPubAsst 0 1 0.32 0.22 0.57 0.26 0.1

pctWRetire 0 1 0.48 0.17 -0.1 0.47 0.44

medFamInc 0 1 0.38 0.2 -0.44 0.33 0.25

perCapInc 0 1 0.35 0.19 -0.35 0.3 0.23

whitePerCap 0 1 0.37 0.19 -0.21 0.32 0.3

blackPerCap 0 1 0.29 0.17 -0.28 0.25 0.18

indianPerCap 0 1 0.2 0.16 -0.09 0.17 0

AsianPerCap 0 1 0.32 0.2 -0.16 0.28 0.18

OtherPerCap 0 1 0.28 0.19 -0.13 0.25 0

HispPerCap 0 1 0.39 0.18 -0.24 0.345 0.3

NumUnderPov 0 1 0.06 0.13 0.45 0.02 0.01

PctPopUnderPov 0 1 0.3 0.23 0.52 0.25 0.08

PctLess9thGrade 0 1 0.32 0.21 0.41 0.27 0.19

The names file in the C&C dataset has correlation values (class,target)for each field

Page 6: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

min Max mean std correlation median mode

Population 0 1 0.06 0.13 0.37 0.02 0.01

Householdsize 0 1 0.46 0.16 -0.03 0.44 0.41

Racepctblack 0 1 0.18 0.25 0.63 0.06 0.01

racePctWhite 0 1 0.75 0.24 -0.68 0.85 0.98

racePctAsian 0 1 0.15 0.21 0.04 0.07 0.02

racePctHisp 0 1 0.14 0.23 0.29 0.04 0.01

agePct12t21 0 1 0.42 0.16 0.06 0.4 0.38

agePct12t29 0 1 0.49 0.14 0.15 0.48 0.49

agePct16t24 0 1 0.34 0.17 0.1 0.29 0.29

agePct65up 0 1 0.42 0.18 0.07 0.42 0.47

numbUrban 0 1 0.06 0.13 0.36 0.03 0

pctUrban 0 1 0.7 0.44 0.08 1 1

medIncome 0 1 0.36 0.21 -0.42 0.32 0.23

pctWWage 0 1 0.56 0.18 -0.31 0.56 0.58

pctWFarmSelf 0 1 0.29 0.2 -0.15 0.23 0.16

pctWInvInc 0 1 0.5 0.18 -0.58 0.48 0.41

pctWSocSec 0 1 0.47 0.17 0.12 0.475 0.56

pctWPubAsst 0 1 0.32 0.22 0.57 0.26 0.1

pctWRetire 0 1 0.48 0.17 -0.1 0.47 0.44

medFamInc 0 1 0.38 0.2 -0.44 0.33 0.25

perCapInc 0 1 0.35 0.19 -0.35 0.3 0.23

whitePerCap 0 1 0.37 0.19 -0.21 0.32 0.3

blackPerCap 0 1 0.29 0.17 -0.28 0.25 0.18

indianPerCap 0 1 0.2 0.16 -0.09 0.17 0

AsianPerCap 0 1 0.32 0.2 -0.16 0.28 0.18

OtherPerCap 0 1 0.28 0.19 -0.13 0.25 0

HispPerCap 0 1 0.39 0.18 -0.24 0.345 0.3

NumUnderPov 0 1 0.06 0.13 0.45 0.02 0.01

PctPopUnderPov 0 1 0.3 0.23 0.52 0.25 0.08

PctLess9thGrade 0 1 0.32 0.21 0.41 0.27 0.19

here …

Page 7: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

ViolentCrimesPerPop 0 1 0.24 0.23 1 0.15 0.03 0

PctIlleg 0 1 0.25 0.23 0.74 0.17 0.09 0

PctKids2Par 0 1 0.62 0.21 -0.74 0.64 0.72 0

PctFam2Par 0 1 0.61 0.2 -0.71 0.63 0.7 0

racePctWhite 0 1 0.75 0.24 -0.68 0.85 0.98 0

PctYoungKids2Par 0 1 0.66 0.22 -0.67 0.7 0.91 0

PctTeen2Par 0 1 0.58 0.19 -0.66 0.61 0.6 0

racepctblack 0 1 0.18 0.25 0.63 0.06 0.01 0

pctWInvInc 0 1 0.5 0.18 -0.58 0.48 0.41 0

pctWPubAsst 0 1 0.32 0.22 0.57 0.26 0.1 0

FemalePctDiv 0 1 0.49 0.18 0.56 0.5 0.54 0

TotalPctDiv 0 1 0.49 0.18 0.55 0.5 0.57 0

PctPolicBlack 0 1 0.22 0.24 0.54 0.12 0 1675

MalePctDivorce 0 1 0.46 0.18 0.53 0.47 0.56 0

PctPersOwnOccup 0 1 0.56 0.2 -0.53 0.56 0.54 0

PctPopUnderPov 0 1 0.3 0.23 0.52 0.25 0.08 0

PctUnemployed 0 1 0.36 0.2 0.5 0.32 0.24 0

PctHousNoPhone 0 1 0.26 0.24 0.49 0.185 0.01 0

PctPolicMinor 0 1 0.26 0.23 0.49 0.2 0.07 1675

PctNotHSGrad 0 1 0.38 0.2 0.48 0.36 0.39 0

Here are the top 20 (although the first doesn’t count) - thishints at how we might use correlation for feature selection

Page 8: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Can anyone see a potential problem with choosing only (for example) the

20 features that correlate best with the target class ?

Page 9: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature Selection: What

You have some data, and you want to use it to build a classifier, so that you can predict something (e.g. likelihood of cancer)

Page 10: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature Selection: What

You have some data, and you want to use it to build a classifier, so that you can predict something (e.g. likelihood of cancer)

The data has 10,000 fields (features)

Page 11: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature Selection: What

You have some data, and you want to use it to build a classifier, so that you can predict something (e.g. likelihood of cancer)

The data has 10,000 fields (features)

you need to cut it down to 1,000 fields beforeyou try machine learning. Which 1,000?

Page 12: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature Selection: What

You have some data, and you want to use it to build a classifier, so that you can predict something (e.g. likelihood of cancer)

The data has 10,000 fields (features)

you need to cut it down to 1,000 fields beforeyou try machine learning. Which 1,000?

The process of choosing the 1,000 fields to use is called Feature Selection

Page 13: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Datasets with many features

Gene expression datasets (~10,000 features)

http://www.ncbi.nlm.nih.gov/sites/entrez?db=gds

Proteomics data (~20,000 features)

http://www.ebi.ac.uk/pride/

Page 14: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature Selection: Why?

Page 15: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature Selection: Why?The accuracy of all test Web URLs when chang the number of

top words for category file

74%76%78%80%82%84%86%88%90%

Number of top words for category file

Acc

urac

y

Page 16: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature Selection: Why?

From http://elpub.scix.net/data/works/att/02-28.content.pdf

Page 17: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Quite easy to find lots more cases from papers, where experiments show that accuracy reduces when you use more features

Page 18: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

• Why does accuracy reduce with more features?

• How does it depend on the specific choice of features?

• What else changes if we use more features?

• So, how do we choose the right features?

Page 19: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Why accuracy reduces:

• Note: suppose the best feature set has 20 features. If you add another 5 features, typically the accuracy of machine learning may reduce. But you still have the original 20 features!! Why does this happen???

Page 20: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Noise / Explosion

• The additional features typically add noise. Machine learning will pick up on spurious correlations, that might be true in the training set, but not in the test set.

• For some ML methods, more features means more parameters to learn (more NN weights, more decision tree nodes, etc…) – the increased space of possibilities is more difficult to search.

Page 21: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature selection methods

Page 22: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature selection methods

A big research area!

This diagram from (Dash & Liu, 1997)

We’ll look briefly at parts of it

Page 23: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature selection methods

Page 24: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Feature selection methods

Page 25: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Correlation-based feature ranking

This is what you will use in CW 2.It is indeed used often, by practitioners (who

perhaps don’t understand the issues involved in FS)

It is actually fine for certain datasets.It is not even considered in Dash & Liu’s

survey.

Page 26: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

A made-up dataset

f1 f2 f3 f4 … class

0.4 0.6 0.4 0.6 1

0.2 0.4 1.6 -0.6 1

0.5 0.7 1.8 -0.8 1

0.7 0.8 0.2 0.9 2

0.9 0.8 1.8 -0.7 2

0.5 0.5 0.6 0.5 2

Page 27: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Correlated with the class

f1 f2 f3 f4 … class

0.4 0.6 0.4 0.6 1

0.2 0.4 1.6 -0.6 1

0.5 0.7 1.8 -0.8 1

0.7 0.8 0.2 0.9 2

0.9 0.8 1.8 -0.7 2

0.5 0.5 0.6 0.5 2

Page 28: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

uncorrelated with the class / seemingly random

f1 f2 f3 f4 … class

0.4 0.6 0.4 0.6 1

0.2 0.4 1.6 -0.6 1

0.5 0.7 1.8 -0.8 1

0.7 0.8 0.2 0.9 2

0.9 0.8 1.8 -0.7 2

0.5 0.5 0.6 0.5 2

Page 29: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Correlation based FS reduces the dataset to this.

f1 f2 … class

0.4 0.6 1

0.2 0.4 1

0.5 0.7 1

0.7 0.8 2

0.9 0.8 2

0.5 0.5 2

Page 30: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

But, col 5 shows us f3 + f4 – which is perfectly correlated with the class!f1 f2 f3 f4 … class

0.4 0.6 0.4 0.6 1 1

0.2 0.4 1.6 -0.6 1 1

0.5 0.7 1.8 -0.8 1 1

0.7 0.8 0.2 0.9 1.1 2

0.9 0.8 1.8 -0.7 1.1 2

0.5 0.5 0.6 0.5 1.1 2

Page 31: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Good FS Methods therefore:

• Need to consider how well features work together

• As we have noted before, if you take 100 features that are each well correlated with the class, they may simply be correlated strongly with each other, so provide no more information than just one of them

Page 32: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

`Complete’ methods

Original dataset has N features

You want to use a subset of k features

A complete FS method means: try every subset of k features, and choose the best!

the number of subsets is N! / k!(N−k)!

what is this when N is 100 and k is 5?

Page 33: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

`Complete’ methods

Original dataset has N features

You want to use a subset of k features

A complete FS method means: try every subset of k features, and choose the best!

the number of subsets is N! / k!(N−k)!

what is this when N is 100 and k is 5?

75,287,520 -- almost nothing

Page 34: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

`Complete’ methods

Original dataset has N features

You want to use a subset of k features

A complete FS method means: try every subset of k features, and choose the best!

the number of subsets is N! / k!(N−k)!

what is this when N is 10,000 and k is 100?

Page 35: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

`Complete’ methods

Original dataset has N features

You want to use a subset of k features

A complete FS method means: try every subset of k features, and choose the best!

the number of subsets is N! / k!(N−k)!

what is this when N is 10,000 and k is 100?

5,000,000,000,000,000,000,000,000,000,

Page 36: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,

Page 37: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,

Page 38: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,

Page 39: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

… continued for another 114 slides.

Actually it is around 5 × 1035,101

(there are around 1080 atoms in the universe)

Page 40: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Can you see a problem with complete methods?

Page 41: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

`Forward’ methods

These methods `grow’ a set S of features –

• S starts empty

• Find the best feature to add (by checking which one gives best performance on a test set when combined with S).

• If overall performance has improved, return to step 2; else stop

Page 42: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

`Backward’ methods

These methods remove features one by one.

• S starts with the full feature set

• Find the best feature to remove (by checking which removal from S gives best performance on a test set).

• If overall performance has improved, return to step 2; else stop

Page 43: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

• When might you choose forward instead of backward?

Page 44: Data Mining (and machine learning)

Random(ised) methodsaka Stochastic methods

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Suppose you have 1,000 features.

There are 21000 possible subsets of features.

One way to try to find a good subset is to run a stochastic search algorithm

E.g. Hillclimbing, simulated annealing, genetic algorithm,particle swarm optimisation, …

Page 45: Data Mining (and machine learning)

One slide introduction to (most) stochastic search algorithms

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

A search algorithm: BEGIN: 1. initialise a random population P of N candidate solutions (maybe just 1) (e.g. each solution is a random subset of features) 2. Evaluate each solution in P (e.g. accuracy of 3-NN using only the features in that solution) ITERATE: 1. generate a set C of new solutions, using the good ones in P (e.g. choose a good one and mutate it, combine bits of two or more solutions, etc …) 2. evaluate each of the new solutions in C. 3. Update P – e.g. by choosing the best N from all of P and C 4. If we have iterated a certain number of times, or accuracy is good enough, stop

Page 46: Data Mining (and machine learning)

One slide introduction to (most) stochastic search algorithms

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

A search algorithm: BEGIN: 1. initialise a random population P of N candidate solutions (maybe just 1) (e.g. each solution is a random subset of features) 2. Evaluate each solution in P (e.g. accuracy of 3-NN using only the features in that solution) ITERATE: 1. generate a set C of new solutions, using the good ones in P (e.g. choose a good one and mutate it, combine bits of two or more solutions, etc …) 2. evaluate each of the new solutions in C. 3. Update P – e.g. by choosing the best N from all of P and C 4. If we have iterated a certain number of times, or accuracy is good enough, stop

GENERATE

TEST

GENERATETESTUPDATE

Page 47: Data Mining (and machine learning)

Why randomised/search methods are good for FS

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Usually you have a large number of features (e.g. 1,000)

You can give each feature a score (e.g. correlation with target,Relief weight (see end slides), etc …), and choose thebest-scoring features. This is very fast.

However this does not evaluate how well features work withother features. You could give combinations of features a score,but there are too many combinations of multiple features.

Search algorithms are the only suitable approach that get to gripswith evaluating combinations of features.

Page 48: Data Mining (and machine learning)

CW2

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Page 49: Data Mining (and machine learning)

CW2Involves:- Some basic dataset processing on CandC dataset

- Applying a DMML technique called Naïve Bayes (NB: already implemented by me)

- Implementing your own script/code that can work out the correlation (Pearson’s r) between any two fields

- Running experiments to compare the results of NB when using the ‘top 5’, ‘top 10’ and ‘top 20’ fields according to correlation with the class field.

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Page 50: Data Mining (and machine learning)

CW2Naïve Bayes:- Described in next the last lecture.

- It only works on discretized data, and predicts the class value of a target field.

- It uses Bayesian probability in a simple way to come up with a best guess for the class value, based on the proportions in exactly the type of histograms you are doing for CW1

- My NB awk script builds its probability model on the first 80% of data, and then outputs its average accuracy when applying this model to the remaining 20% of the data.

- It also outputs a confusion matrix

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Page 51: Data Mining (and machine learning)

CW2

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Page 52: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Page 53: Data Mining (and machine learning)

If time: the classic example of an instance-based heuristic method

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Page 54: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief methodAn instance-based, heuristic method – it works out weight values for Each feature, based on how important they seem to be in discriminating between near neighbours

Page 55: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method There are two features here – the x and the y co-ordinateInitially they each have zero weight: wx = 0; wy = 0;

Page 56: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief methodwx = 0; wy = 0; choose an instance at random

Page 57: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief methodwx = 0; wy = 0; choose an instance at random, call it R

Page 58: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief methodwx = 0; wy = 0; find H (hit: the nearest to R of the same class) and M (miss: the nearest to R of different class)

Page 59: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method

M H

wx = 0; wy = 0; find H (hit: the nearest to R of the same class) and M (miss: the nearest to R of different class)

Page 60: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method

M H

wx = 0; wy = 0; now we update the weights based on the distancesbetween R and H and between R and M. This happens one featureat a time

Page 61: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method

M H

To change wx, we add to it: (MR − HR)/n ; so, the further the `miss’ in the x direction, the higher the weight of x – the more important x is in terms of discriminating the classes

HR

MR

Page 62: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method

M H

To change wy, we add to it: (MR − HR)/n again, but this time calculated in the y dimension; clearly the difference is smaller; differences in this feature don’t seem important in terms of class value

HRMR

Page 63: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief methodMaybe now we have wx = 0.07, wy = 0.002.

Page 64: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief methodwx = 0.07, wy = 0.002;Pick another instance at random, and do the same again.

Page 65: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method

H M

wx = 0.07, wy = 0.002;Identify H and M

Page 66: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method

H M

wx = 0.07, wy = 0.002;Add the HR and MR differences divided by n, for each feature, again …

Page 67: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method

In the end, we have a weight value for each feature. The higher the value, the more relevant the feature.

We can use these weights for feature selection, simply by choosing the features with the S highest weights (if we want to use S features)

NOTE-It is important to use Relief F only on min-max normalised data in [0,1]. However it is fine if category attibutes are involved, in which case use Hamming distance for those attributes,-Why divide by n? Then, the weight values can be interpreted as a difference in probabilities.

Page 68: Data Mining (and machine learning)

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

The Relief method, plucked directly from the

original paper (Kira and Rendell 1992)

Page 69: Data Mining (and machine learning)

Some recommended reading, if you are interested, is on the website

David Corne, and Nick Taylor, Heriot-Watt University - [email protected] slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Recommended