Presentation Title Department of Computer Science A More Principled Approach to Machine Learning...

Presentation TitleDepartment of Computer Science

A More Principled Approach to Machine Learning

Michael R. SmithBrigham Young UniversityDepartment of Computer Science2 February 2015


Machine Learning Learn from past

experience

Change their behavior without explicitly being programed

Optimization techniquesMaximize accuracyMinimize error

Mine data

2


Machine Learning Example I, Robot

3


Machine Learning

4


5

Machine Learning

Weight Height Blood Press

Temp

205 78 good 98.2

157 65 bad 100.7

185 71 mod 99.5

Learning Algorithm

Training Data


Temp

172 67 bad 100.1

Has Disease

yes

yes

no

Has Disease

?


6

Machine Learning

Learning Algorithm )

Training Data ()

Hyper-parameters ()

Hypothesis/model ()


7

Machine LearningWeight Height Blood

PressTemp Has

Disease

205 78 good 98.2 Yes

157 65 bad 100.7 Yes

185 71 mod 99.5 No

Learning Algorithm ()

Training Data ()


Temp

172 67 bad 100.1

Has Disease

?

Hyper-parameters ()

Data Set

# Features

# Classes

Entropy

… # Nodes

Learning Rate

… Accuracy

Disease

4 2 0.24 … 3 0.1 … 83.4

Iris 4 3 0.76 … 7 0.2 … 97.4Meta-data


8

Meta-Learning

Learning Algorithm ()Meta-

Data ()

Data Set

# Features

# Classes

Entropy

… # Nodes

Learning Rate

… Accuracy

Disease

4 2 0.24 … 3 0.1 … 83.4

Iris 4 3 0.76 … 7 0.2 … 97.4

Data Set

# Features

# Classes

Entropy

…

Ecology 17 3 0.5 …

, , …

Meta-features


Meta-Learning Learning how to learn

Learn from previous experimentsData-driven approach

Given a data set, automatically:Preprocess the data ()

Select features Create new features Select/discard instances

Select a learning algorithm Set the hyper-parameters for

9


Meta-Learning Difficulties

Large space of possibilities Can have an infinite number of choices for Large search space for the data (instances, features, etc.)

Cannot select until is chosenThe performance of is dependent on and Meta-features are not predictive of performanceGetting data is computationally expensive

10


Previous Work Model Selection

Predict with fixed

Hyper-parameter SelectionPredict for a given

Hyper-parameter OptimizationSearch the space of

Grid Search Bayesian Optimization Search

No learning from previous experiments

OpenML.orgStore results from previous experiments 11

Random Search


Instance Hardness Learning algorithms are generally evaluated at the

data set level

Are some instances intrinsically hard to classify?Why are some instances misclassified?Are there instances which are misclassified that should not be?

Are some instances misclassified by all learning algorithms? If so, why?

12


Data Set

13


Overfit

14


15

Linear Classifier


16

Detrimental Instances


Instance Hardness Better intuition of learning algorithms and why

instances are misclassifiedCan learning algorithms be improved? Where?

Informed analysis of learning algorithm performance Is the classification reasonable?

Where can the quality of the data be improved

Empirical analysis of the classification of 57 data sets by 9 learning algorithms10-fold cross-validation178,109 instances5,310 models were created 17


Instance Hardness Measure how difficult and instance is to classify

correctly

18


Instance Hardness

19

9 learning algorithmsC4.5MLPRIPPERNNgeRidor

Unsupervised Meta-learningCluster learning algorithms based on diversity Intuition for all of the algorithms in the cluster

5NNRandom ForestLWLNaïve Bayes


Existence of Instance Hardness

20

53% correctly classified by all algorithms

5% misclassified by all algorithms

Learning algorithms disagree on 42% of the instances

15% misclassified by the majority of algorithms


21

Modeling Detrimental Instances

𝒙

𝒚

Each instance is composed of: – the input features – the true unobserved

class label – the observed class label

True class label is generally ignoredRegularizationValidation setsPruning


22

Modeling Detrimental Instances

𝒚

𝒙 𝒙

𝒚

𝑝 ( �̂�|𝑥 )𝑝 (𝑥) 𝑝 ( �̂�|𝑥 , 𝑦 )𝑝 (𝑦∨𝑥 )𝑝 (𝑥)

How can the true class label be taken into account?FilteringData polishingSpecific learning algorithm

Boosting

Weight by


Instance Quality Learning

23

Incorporate the quality into the learning processMaximize the quality of the data rather than just the data

Detrimental instances should have less of an effect on

Inequality LearningDo not treat all of the instances “equally”


24

Inequality Learning


25

0.00019

0.678

0.054

Inequality Learning


Results: Original

MLP C4.5 5-NN LWL NB Nnge RandF Ridor Rip

Orig 80.7 80.1 79 69.4 75.7 79.4 81.6 76.6 77.8

QW-L 83.8 80.1 80 70.4 77.2 79.4 83.3 78.6 79.7

p-val < 0.001 0.045 0.015 0.014 < 0.001 0.788 < 0.001 0.036 < 0.001

g,e,l 47,0,5 32,0,20 35,1,16 28,10,14 35,1,16 20,1,27 33,1,18 31,1,19 38,0,14

QW-B 84.6 82.3 80.3 68.2 75.2 79.4 83.5 78.6 78.8

p-val < 0.001 < 0.001 0.016 0.590 0.858 0.877 < 0.001 0.013 < 0.001

g,e,l 49,0,3 37,1,14 32,0,20 22,12,18 19,1,32 21,1,26 32,2,18 34,1,16 37,3,12

Filter 82.9 81.8 82.3 70.0 77.3 82.4 83.2 79.5 79.7

p-val < 0.001 < 0.001 < 0.001 0.032 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001

g,e,l 39,0,13 38,3,11 38,4,10 26,12,14 36,1,15 40,0,12 33,1,18 35,3,14 40,2,10

26


Results: Original

27

MLP C4.5 5-NN LWL NB Nnge RandF Ridor Rip

Orig 80.7 80.1 79 69.4 75.7 79.4 81.6 76.6 77.8

QW-L 83.8 80.1 80 70.4 77.2 79.4 83.3 78.6 79.7

p-val < 0.001 0.045 0.015 0.014 < 0.001 0.788 < 0.001 0.036 < 0.001

g,e,l 47,0,5 32,0,20 35,1,16 28,10,14 35,1,16 20,1,27 33,1,18 31,1,19 38,0,14

QW-B 84.6 82.3 80.3 68.2 75.2 79.4 83.5 78.6 78.8

p-val < 0.001 < 0.001 0.016 0.590 0.858 0.877 < 0.001 0.013 < 0.001

g,e,l 49,0,3 37,1,14 32,0,20 22,12,18 19,1,32 21,1,26 32,2,18 34,1,16 37,3,12

Filter 82.9 81.8 82.3 70.0 77.3 82.4 83.2 79.5 79.7

p-val < 0.001 < 0.001 < 0.001 0.032 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001

g,e,l 39,0,13 38,3,11 38,4,10 26,12,14 36,1,15 40,0,12 33,1,18 35,3,14 40,2,10


Inequality Learning

28

Increases the accuracy for all of the investigated learning algorithmsAdvantage to using a continuous value rather than binary

Most effective in global learning algorithms such as backpropagationCould be a side effect of how we integrated instance quality

into the learning algorithm. (Future Work)

Focusing on the data, how does it compare with hyper-parameter optimization (HPO)?


Comparison of HPO and Filtering HPO and filtering can both be expensive, which has

the greatest benefitStandard Approach

Perspective for current state Based off the performance on a validation set

Optimistic Approach Perspective for the potential of a technique Based off the performance on 10-fold cross-validation

HPO Filtering

29


K-Fold Cross-Validation Create K partitions of the data set

For each partition, use as testing and remaining K-1 partitions for training

30


K-Fold Cross-Validation Use a validation set to determine which set of hyper-

parameters to use

31

Validation examples


Experimental Methodology

32

Hyper-parameter optimizationBayesian Optimization (more than 512 hyper-parameter

settings explored for most learning algorithms) Standard uses the accuracy on a validation set Optimistic uses the 10-fold cross-validation accuracy

FilteringEnsemble Filter (L-Filter)

Removes instances that are misclassified by the majority of a set of learning algorithms

Adaptive Filter (A-Filter) Greedy search among candidate learning algorithms


Results-Standard Approach

MLP C4.5 kNN NB RF RIP75

77

79

81

83

85

87

89

91

93

Orig L-Filter HPO

Accu

racy

VS Orig

L-Filter

HPO

MLP 44,1,7 47,0,5

C4.5 45,1,6 39,0,13

kNN 44,2,6 41,2,9

NB 42,0,10

42,1,9

RF 38,3,11

37,2,13

RIP 50,0,2 47,1,4


Results-Optimistic Approach

34

MLP C4.5 kNN NB RF RIP75

77

79

81

83

85

87

89

91

93

HPO L-Filter A-Filter

Accu

racy

Not one filtering approach is best for all data sets and learning algorithms

VS HPO

L-Filter

A-Filter

MLP 27,3,22

45,0,7

C4.5 33,4,15

48,2,2

kNN 30,2,20

51,0,1

NB 22,2,28

34,0,18

RF 27,1,24

46,0,6

RIP 34,1,17

48,0,4


Why does filtering have such a significant effect? Recall: Maximize the probability of the hypothesis

given the data

At the instance-level:

𝑝 (h ∣𝐷 )=𝑝 (𝐷∣h )𝑝 (h )𝑃 (𝐷 )

𝑝 (h ∣𝐷 )=∏𝑖

∣𝐷 ∣

𝑝 ( ⟨𝑥𝑖 , 𝑦 𝑖 ⟩ ∣h)𝑝 (h )

𝑃 (𝐷 )

35


36

Example Data Set


A Need for Better Understanding Filter has a much higher potential than HPO

No principled examination

37

HPO Pros

Significantly increases acc One pass Uses all of the instances

Exceptions VS noise

Cons Uses all of the instances

Noisy instances are used to induce

FilteringPros

Significantly increases acc Noisy instances are not used

to induce

Cons Requires multiple passes

through the training set Find noisy instances Train the learning algorithm

Can remove good instances


The Need for a Repository

38



39



40


Benefits of a Repository Better science

Reproducible/saved resultsSave time

Build reputationEasier to compare with other work

Gives a snapshot of current stateOverallSpecific data set

Meta-learningProvide data set

41


Machine Learning Results Repository

42


Machine Learning Results Repository

43

Data Set-Level

Learning Algorithm -Level

Instance-Level


Future Directions and Projects MLRR

Data qualityLinking with papersCreating user profilesAnonymous postings for supplemental material

Meta-learningCombine learning with optimization techniquesMeta-features

Deep learning Collaborative filtering

Automate machine learning44


Future Directions and Projects Incorporate information into the learning process

Use cases of machine learningHow is machine learning actually used?How can it be made easier to use?

Collaboration/application to other fieldsBioinformaticsSocial mediaSports statistics

45


Thank you

Date post:	05-Jan-2016
Category:	Documents
Upload:	doris-cook
View:	212 times
Download:	0 times

Presentation Title Department of Computer Science A More Principled Approach to Machine Learning...

Documents