Experimental Design for Machine Learning on Multimedia Data
Lecture 5
Dr. Gerald Friedland, [email protected]
1Website: http://www.icsi.berkeley.edu/~fractor/fall2019/
2
DiscussHomework
Projectproposaldeadline:October11th,2019.
EmailprojectproposalstomeandRishi.
3
ProjectQuestions1-4(Data&ProblemInspection)
1) What is the variable the machine learner should predict? What is the required accuracy for success? What impact will adversarial examples have?
2) How much data do we have to train the prediction of the variable? Are the classes balanced? How many modalities could be exploited in the data? Is there temporal information? How much noise are we expecting? Do you expect bias?
3) How well is the data annotated (anecdotally)? What is the annotator agreement (measured)?
4) Given questions 1-3: Are we reducing information (pattern matching) or do we need to infer information (statistical machine learner)? As a consequence, what seems the best choice for the type of machine learner per modality?
4
ProjectQuestions5-8(TrainingforGeneralization)
1) Estimate the memory equivalent capacity needed for the machine learner of your choice. What is the expected generalization? How does the progression look like: Is there enough data?
2) Train your machine learner for accuracy at memory equivalent capacity. Can you reach near 100% memorization? If not, why (diagnose)?
3) Train your machine learner for generalization: Plot the accuracy/capacity curve. What is the expected accuracy and generalization ratio at the point you decided to stop? Do you need to try a different machine learner (if so, redo from 5)? Should you extract features (if so, redo from 5)?
4) How well did your generalization prediction hold on the independent test data? Explain results. How confident are you in the results?
5
ProjectQuestions9-10(FinishingTouch)
1) How do you combine the models of the modalities? Explain your choice. How confident are you in the combination results (ie., does it make sense to combine)?
2) What are the final combined results of the system? Are the experiments documented and repeatable (if not, please make sure they are, even for bad results)? Are the experiments reproducible (speculate)?
6
GenericProjectWorkflowforAccuracy
Machine Learning
Development Data
Statistical Models
Apply Models
Test Data
Results
Ground Truth
Error Metric
Accuracy Scores
Training Testing
Evaluation
7
GenericProjectWorkflowforGeneralization
Supervised Machine Learning Engineering ProcessMaximizing the Chance for Generalization/Minimizing Adverserial Examples
Labelled Data High Enough?
ClassesBalanced?
Check Annotator Agreement
“Clean“ DataAnnotate Again
Too low
Undetermined
Annotate Redundantly
Yes
Subsample to Balance Classes
Yes
No
Committed to a Specific ML
Approach?
No
Yes
Approach with Lowest Capacity
Estimate wins
Estimate Best Approach using
Capacity Estimators
Estimate Generalization
Progression
Train Machine Learner
Good Generalization Progression?
No Acquire More Labelled DataHard Decision
Congrats!
Insufficient Labelled Data
Yes Distrust Estimators
Mismatched Representation Function(s)
Training Machine Learner High Accuracy, Small Capacity?
Yes
High Accuracy only with Large Capacity
Low Accuracyeven with
Large Capacityaka “Overfitting“
Labelled Data
Train with Training Data at
Memory Capacity
Split into Train and Test Data
Run Capacity Estimator on Training Data
HighAccuracy?
Yes
Reduce Machine Learner Capacity
Train on Training Data, Test on Test
Data
Start Debugging
No
Model
Start Debugging
HighAccuracy?
Yes Use Model from Previous
Iteration
No
Already tried many different ML
approaches?
Yes
No
Acquire More Labelled Data
Gerald Friedland, v0.4 Jan 2nd, 2019 [email protected]
8
Conclusionssofar
▪ Thelowerlimitofgeneralizationismemorization.Thisis,theupperlimitforthesizeofamachinelearnerisit’smemorycapacity.
▪ Thememorycapacityismeasurableinbits.▪ Usingamachinelearnerthatisovercapacityisawasteofresourcesand
increasestheriskoffailure!▪ Alchemyconvertedintochemistrybymeasuring:It’stimetoconvert
guessingandcheckinginMachineLearningintoscience!Let’scallitdatascience?
▪ Todi=o:▪ Non-Binaryclassifiers,regression▪ Convolutionalnetworks,othermachinelearners▪ Re-thinkingtraining▪ Explainableadversarialexamples
9
PredictingCapacityRequirements
Givendataandlabels:HowmuchactualcapacitydoIneedtomemorizethefunction?
Theoreticalanswer:Whatistheminimumdescriptionlengthofthetablerepresentingthefunctionf(thisis,ShannonEntropy).
PracticalAnswer:1) Worstcase:Let’sbuildaneuralnetworkwhereonlythe
biasesaretrained2) Expectedcase:Howmuchparameterreductioncan
(exponential)trainingbuyus?
10
PredictingMaximumMemoryCapacity
0
1
+1...
1
.
.
.
1
1
-1
x1
x2
+/-1
.
.
.1
1
1
b1
b2
bmxn
1
1
“Dumb”Network
Runtime:O(nlogn)
11
PredictingMemoryCapacity
DumbNetwork:• Highlyinefficient.• Potentiallynot100%accurate(hashcollisions).• Wecanassumetrainingweights(andbiases)gets100%accuracywhilereducingparameters.
ExpectedReduction:Exponential!nthresholdsshouldbeabletoberepresentedwithlog2nweightsandbiases(searchtree!).
12
EmpiricalResults
Allresultsrepeatableat:https://github.com/fractor/nntailoring
13
FromMemorizationtoGeneralization
Goodnews:• Real-worlddataisnotrandom.• Theinformationcapacityofaperceptronisusually>1bitperparameter(Cover,MacKay).
Thismeans,weshouldbeabletouselessparametersthanpredictedbymemorycapacitycalculations.
Memorizationisworst-casegeneralization.
14
SuggestedEngineeringProcessforGeneralization
• Startatapproximateexpectedcapacity.• Trainto>98%accuracy.Ifimpossible,increaseparameters.• Retrainiterativelywithdecreasedcapacitywhiletestingagainstvalidationset. Shouldsee:decreaseintrainingaccuracywithincreaseinvalidationsetaccuracy
• Stopatminimumcapacityforbestheld-outsetaccuracy.
Bestcasescenario:Asparametersarereduced,neuralnetworkfailstomemorizeonlytheinsignificant(noise)bits.
15
GeneralizationProcess:ExpectedCurve
16
OvercapacityMachineLearning:Issues
▪ Wasteofmoney,energy,andtime.Badforenvironment.
▪ Unseen(redundantbits)areanecessaryconditionforadversarialexamples.See: B.Li,G.Friedland,J.Wang,R.Jia,C.Spanos,D.Song:“OneBitMatters:ExplainingAdversarialExamplesastheAbuseofRedundancy”,submittedtoICLR2019.
▪ Lessparametersgiveahigherchanceforexplainability(Occam’sRazor).See: G.Friedland,A.Metere:“MachineLearningforScience”,UQSciMLWorkshop,LosAngeles,June2018.
17
Reminder:Occam’sRazor
Amongcompetinghypotheses,theonewiththefewestassumptionsshouldbeselected.
Foreachacceptedexplanationofaphenomenon,theremaybeanextremelylarge,perhapsevenincomprehensible,numberofpossibleandmorecomplexalternatives,becauseonecanalwaysburdenfailingexplanationswithadhochypothesestopreventthemfrombeingfalsified;therefore,simplertheoriesarepreferabletomorecomplexonesbecausetheyaremoretestable.(Wikipedia,Sep.2017)
18
Demotime!
▪ Intrototoolsongithub▪ T(n,k)calculation▪ Capacityestimationgiventable▪ Capacityprogression