Date post: | 05-Jan-2016 |
Category: |
Documents |
Upload: | doris-cook |
View: | 212 times |
Download: | 0 times |
Presentation TitleDepartment of Computer Science
A More Principled Approach to Machine Learning
Michael R. SmithBrigham Young UniversityDepartment of Computer Science2 February 2015
Presentation TitleDepartment of Computer Science
Machine Learning Learn from past
experience
Change their behavior without explicitly being programed
Optimization techniquesMaximize accuracyMinimize error
Mine data
2
Presentation TitleDepartment of Computer Science
Machine Learning Example I, Robot
3
Presentation TitleDepartment of Computer Science
Machine Learning
4
Presentation TitleDepartment of Computer Science
5
Machine Learning
Weight Height Blood Press
Temp
205 78 good 98.2
157 65 bad 100.7
185 71 mod 99.5
Learning Algorithm
Training Data
Weight Height Blood Press
Temp
172 67 bad 100.1
Has Disease
yes
yes
no
Has Disease
?
Presentation TitleDepartment of Computer Science
6
Machine Learning
Learning Algorithm )
Training Data ()
Hyper-parameters ()
Hypothesis/model ()
Presentation TitleDepartment of Computer Science
7
Machine LearningWeight Height Blood
PressTemp Has
Disease
205 78 good 98.2 Yes
157 65 bad 100.7 Yes
185 71 mod 99.5 No
Learning Algorithm ()
Training Data ()
Weight Height Blood Press
Temp
172 67 bad 100.1
Has Disease
?
Hyper-parameters ()
Data Set
# Features
# Classes
Entropy
… # Nodes
Learning Rate
… Accuracy
Disease
4 2 0.24 … 3 0.1 … 83.4
Iris 4 3 0.76 … 7 0.2 … 97.4Meta-data
Presentation TitleDepartment of Computer Science
8
Meta-Learning
Learning Algorithm ()Meta-
Data ()
Data Set
# Features
# Classes
Entropy
… # Nodes
Learning Rate
… Accuracy
Disease
4 2 0.24 … 3 0.1 … 83.4
Iris 4 3 0.76 … 7 0.2 … 97.4
Data Set
# Features
# Classes
Entropy
…
Ecology 17 3 0.5 …
, , …
Meta-features
Presentation TitleDepartment of Computer Science
Meta-Learning Learning how to learn
Learn from previous experimentsData-driven approach
Given a data set, automatically:Preprocess the data ()
Select features Create new features Select/discard instances
Select a learning algorithm Set the hyper-parameters for
9
Presentation TitleDepartment of Computer Science
Meta-Learning Difficulties
Large space of possibilities Can have an infinite number of choices for Large search space for the data (instances, features, etc.)
Cannot select until is chosenThe performance of is dependent on and Meta-features are not predictive of performanceGetting data is computationally expensive
10
Presentation TitleDepartment of Computer Science
Previous Work Model Selection
Predict with fixed
Hyper-parameter SelectionPredict for a given
Hyper-parameter OptimizationSearch the space of
Grid Search Bayesian Optimization Search
No learning from previous experiments
OpenML.orgStore results from previous experiments 11
Random Search
Presentation TitleDepartment of Computer Science
Instance Hardness Learning algorithms are generally evaluated at the
data set level
Are some instances intrinsically hard to classify?Why are some instances misclassified?Are there instances which are misclassified that should not be?
Are some instances misclassified by all learning algorithms? If so, why?
12
Presentation TitleDepartment of Computer Science
Data Set
13
Presentation TitleDepartment of Computer Science
Overfit
14
Presentation TitleDepartment of Computer Science
15
Linear Classifier
Presentation TitleDepartment of Computer Science
16
Detrimental Instances
Presentation TitleDepartment of Computer Science
Instance Hardness Better intuition of learning algorithms and why
instances are misclassifiedCan learning algorithms be improved? Where?
Informed analysis of learning algorithm performance Is the classification reasonable?
Where can the quality of the data be improved
Empirical analysis of the classification of 57 data sets by 9 learning algorithms10-fold cross-validation178,109 instances5,310 models were created 17
Presentation TitleDepartment of Computer Science
Instance Hardness Measure how difficult and instance is to classify
correctly
18
Presentation TitleDepartment of Computer Science
Instance Hardness
19
9 learning algorithmsC4.5MLPRIPPERNNgeRidor
Unsupervised Meta-learningCluster learning algorithms based on diversity Intuition for all of the algorithms in the cluster
5NNRandom ForestLWLNaïve Bayes
Presentation TitleDepartment of Computer Science
Existence of Instance Hardness
20
53% correctly classified by all algorithms
5% misclassified by all algorithms
Learning algorithms disagree on 42% of the instances
15% misclassified by the majority of algorithms
Presentation TitleDepartment of Computer Science
21
Modeling Detrimental Instances
𝒙
𝒚
Each instance is composed of: – the input features – the true unobserved
class label – the observed class label
True class label is generally ignoredRegularizationValidation setsPruning
Presentation TitleDepartment of Computer Science
22
Modeling Detrimental Instances
𝒚
𝒙 𝒙
𝒚
𝑝 ( �̂�|𝑥 )𝑝 (𝑥) 𝑝 ( �̂�|𝑥 , 𝑦 )𝑝 (𝑦∨𝑥 )𝑝 (𝑥)
How can the true class label be taken into account?FilteringData polishingSpecific learning algorithm
Boosting
Weight by
Presentation TitleDepartment of Computer Science
Instance Quality Learning
23
Incorporate the quality into the learning processMaximize the quality of the data rather than just the data
Detrimental instances should have less of an effect on
Inequality LearningDo not treat all of the instances “equally”
Presentation TitleDepartment of Computer Science
24
Inequality Learning
Presentation TitleDepartment of Computer Science
25
0.00019
0.678
0.054
Inequality Learning
Presentation TitleDepartment of Computer Science
Results: Original
MLP C4.5 5-NN LWL NB Nnge RandF Ridor Rip
Orig 80.7 80.1 79 69.4 75.7 79.4 81.6 76.6 77.8
QW-L 83.8 80.1 80 70.4 77.2 79.4 83.3 78.6 79.7
p-val < 0.001 0.045 0.015 0.014 < 0.001 0.788 < 0.001 0.036 < 0.001
g,e,l 47,0,5 32,0,20 35,1,16 28,10,14 35,1,16 20,1,27 33,1,18 31,1,19 38,0,14
QW-B 84.6 82.3 80.3 68.2 75.2 79.4 83.5 78.6 78.8
p-val < 0.001 < 0.001 0.016 0.590 0.858 0.877 < 0.001 0.013 < 0.001
g,e,l 49,0,3 37,1,14 32,0,20 22,12,18 19,1,32 21,1,26 32,2,18 34,1,16 37,3,12
Filter 82.9 81.8 82.3 70.0 77.3 82.4 83.2 79.5 79.7
p-val < 0.001 < 0.001 < 0.001 0.032 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001
g,e,l 39,0,13 38,3,11 38,4,10 26,12,14 36,1,15 40,0,12 33,1,18 35,3,14 40,2,10
26
Presentation TitleDepartment of Computer Science
Results: Original
27
MLP C4.5 5-NN LWL NB Nnge RandF Ridor Rip
Orig 80.7 80.1 79 69.4 75.7 79.4 81.6 76.6 77.8
QW-L 83.8 80.1 80 70.4 77.2 79.4 83.3 78.6 79.7
p-val < 0.001 0.045 0.015 0.014 < 0.001 0.788 < 0.001 0.036 < 0.001
g,e,l 47,0,5 32,0,20 35,1,16 28,10,14 35,1,16 20,1,27 33,1,18 31,1,19 38,0,14
QW-B 84.6 82.3 80.3 68.2 75.2 79.4 83.5 78.6 78.8
p-val < 0.001 < 0.001 0.016 0.590 0.858 0.877 < 0.001 0.013 < 0.001
g,e,l 49,0,3 37,1,14 32,0,20 22,12,18 19,1,32 21,1,26 32,2,18 34,1,16 37,3,12
Filter 82.9 81.8 82.3 70.0 77.3 82.4 83.2 79.5 79.7
p-val < 0.001 < 0.001 < 0.001 0.032 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001
g,e,l 39,0,13 38,3,11 38,4,10 26,12,14 36,1,15 40,0,12 33,1,18 35,3,14 40,2,10
Presentation TitleDepartment of Computer Science
Inequality Learning
28
Increases the accuracy for all of the investigated learning algorithmsAdvantage to using a continuous value rather than binary
Most effective in global learning algorithms such as backpropagationCould be a side effect of how we integrated instance quality
into the learning algorithm. (Future Work)
Focusing on the data, how does it compare with hyper-parameter optimization (HPO)?
Presentation TitleDepartment of Computer Science
Comparison of HPO and Filtering HPO and filtering can both be expensive, which has
the greatest benefitStandard Approach
Perspective for current state Based off the performance on a validation set
Optimistic Approach Perspective for the potential of a technique Based off the performance on 10-fold cross-validation
HPO Filtering
29
Presentation TitleDepartment of Computer Science
K-Fold Cross-Validation Create K partitions of the data set
For each partition, use as testing and remaining K-1 partitions for training
30
Presentation TitleDepartment of Computer Science
K-Fold Cross-Validation Use a validation set to determine which set of hyper-
parameters to use
31
Validation examples
Presentation TitleDepartment of Computer Science
Experimental Methodology
32
Hyper-parameter optimizationBayesian Optimization (more than 512 hyper-parameter
settings explored for most learning algorithms) Standard uses the accuracy on a validation set Optimistic uses the 10-fold cross-validation accuracy
FilteringEnsemble Filter (L-Filter)
Removes instances that are misclassified by the majority of a set of learning algorithms
Adaptive Filter (A-Filter) Greedy search among candidate learning algorithms
Presentation TitleDepartment of Computer Science
Results-Standard Approach
MLP C4.5 kNN NB RF RIP75
77
79
81
83
85
87
89
91
93
Orig L-Filter HPO
Accu
racy
VS Orig
L-Filter
HPO
MLP 44,1,7 47,0,5
C4.5 45,1,6 39,0,13
kNN 44,2,6 41,2,9
NB 42,0,10
42,1,9
RF 38,3,11
37,2,13
RIP 50,0,2 47,1,4
Presentation TitleDepartment of Computer Science
Results-Optimistic Approach
34
MLP C4.5 kNN NB RF RIP75
77
79
81
83
85
87
89
91
93
HPO L-Filter A-Filter
Accu
racy
Not one filtering approach is best for all data sets and learning algorithms
VS HPO
L-Filter
A-Filter
MLP 27,3,22
45,0,7
C4.5 33,4,15
48,2,2
kNN 30,2,20
51,0,1
NB 22,2,28
34,0,18
RF 27,1,24
46,0,6
RIP 34,1,17
48,0,4
Presentation TitleDepartment of Computer Science
Why does filtering have such a significant effect? Recall: Maximize the probability of the hypothesis
given the data
At the instance-level:
𝑝 (h ∣𝐷 )=𝑝 (𝐷∣h )𝑝 (h )𝑃 (𝐷 )
𝑝 (h ∣𝐷 )=∏𝑖
∣𝐷 ∣
𝑝 ( ⟨𝑥𝑖 , 𝑦 𝑖 ⟩ ∣h)𝑝 (h )
𝑃 (𝐷 )
35
Presentation TitleDepartment of Computer Science
36
Example Data Set
Presentation TitleDepartment of Computer Science
A Need for Better Understanding Filter has a much higher potential than HPO
No principled examination
37
HPO Pros
Significantly increases acc One pass Uses all of the instances
Exceptions VS noise
Cons Uses all of the instances
Noisy instances are used to induce
FilteringPros
Significantly increases acc Noisy instances are not used
to induce
Cons Requires multiple passes
through the training set Find noisy instances Train the learning algorithm
Can remove good instances
Presentation TitleDepartment of Computer Science
The Need for a Repository
38
Presentation TitleDepartment of Computer Science
The Need for a Repository
39
Presentation TitleDepartment of Computer Science
The Need for a Repository
40
Presentation TitleDepartment of Computer Science
Benefits of a Repository Better science
Reproducible/saved resultsSave time
Build reputationEasier to compare with other work
Gives a snapshot of current stateOverallSpecific data set
Meta-learningProvide data set
41
Presentation TitleDepartment of Computer Science
Machine Learning Results Repository
42
Presentation TitleDepartment of Computer Science
Machine Learning Results Repository
43
Data Set-Level
Learning Algorithm -Level
Instance-Level
Presentation TitleDepartment of Computer Science
Future Directions and Projects MLRR
Data qualityLinking with papersCreating user profilesAnonymous postings for supplemental material
Meta-learningCombine learning with optimization techniquesMeta-features
Deep learning Collaborative filtering
Automate machine learning44
Presentation TitleDepartment of Computer Science
Future Directions and Projects Incorporate information into the learning process
Use cases of machine learningHow is machine learning actually used?How can it be made easier to use?
Collaboration/application to other fieldsBioinformaticsSocial mediaSports statistics
45
Presentation TitleDepartment of Computer Science
Thank you